its almost a fact that lower level (in the likes of C/C++ ) and how its coded (algorithm) to read big files play a part in performance.
The latter far more so than the former, even in this case.
C# is compiled to IL (Intermediate Language) and is subsequently run on the .NET CLR.
I Created a 8-million and 1 line file consisting of "this is an example line of text, line #<num>" where <num> was of course the current iteration, starting from 0. The resulting size of the file was 382,888,890 Bytes.
A C# Program that simply reads it in and counts lines:
class Program
{
static void Main(string[] args)
{
Stopwatch watch = new Stopwatch();
watch.Start();
int linecount = 0;
using(StreamReader sw = new StreamReader("D:\\testoutput.txt"))
{
while (!sw.EndOfStream)
{
String currentline = sw.ReadLine();
linecount++;
}
}
watch.Stop();
Console.WriteLine("Finished. Total Time:" + watch.Elapsed.ToString() + ", read " + linecount + " Lines.");
Console.ReadKey();
}
}
Output:
Finished. Total Time:00:00:07.8800219, read 8000000 Lines.
About 8 seconds to process the entire file.
My VBScript is a bit rusty but I came up with this:
Dim FSO
Dim TStream
Dim StringRead,CurrentLine
Dim StartTime,EndTime
Set FSO = CreateObject("Scripting.FileSystemObject")
Set TStream = FSO.OpenTextFile("D:\testoutput.txt")
StartTime = Timer
Do While Not TStream.AtEndOfStream
StringRead = TStream.ReadLine()
CurrentLine = CurrentLine + 1
Loop
TStream.Close()
EndTime = Timer
WScript.echo EndTime-StartTime
Which should be functionally similar. It gave me this back:
47.14844
So the first thought would be that this extra time must be because VBScript is interpreted.
However, I'm not entirely certain this is the case. And this suspicion is proven on some level by inserting the same code into a Visual Basic 6 project. Visual Basic 6 supports compiling to Native code. Doing so yields a time of 55 seconds- almost 10 seconds slower than VBScript. Interestingly, having it compile to P-code, (an intermediate language of sorts) resulting in the program finishing a few seconds faster (53.2 seconds).
For VBScript, all Variables use a 'Variant' Data type. This effectively means that any access or assignment to a variable needs to package and unpackage a OLE_VARIANT structure (internally, of course). Additionally, VBScript is Late-bound, which means that it's access to COM objects (such as the File System Object) are all performed using IDispatch. suffice it to say that this is much slower than an Early Bound call; and pretty much means it has to lookup the method name each time it's used. In this case, that's a problem since there is both Variable access (incrementing the line count) as well as late-bound Method calls (both the termination expression as well as the actual ReadLine() method call) being done within the loop body.
Within Visual Basic 6, I made two changes- I referenced the Scripting Runtime (allowing Early Bound calls), and made all variables strongly typed. this reduced processing time to 28.6. Still not as fast as C#; but, the thing is that C# is
always interpreted at the IL level, and in this case Visual Basic 6 is compiling to Native code, so clearly "lower-level" doesn't translate directly to faster performance. In this case the C# version is faster simply because the Interpreter is able to use new Processor features and run in Long mode (rather than the 32-bit WoW), and that would end up changing what the native code output by the Jitter contains. Visual Basic 6 has a Native code compiler but it will always optimize for a Pentium. Even enabling all advanced optimizations and the "favour Pentium Pro" option didn't allow it to run faster than about 26 seconds.
You might think this is related to Visual Basic itself. This appears partly true. Using Visual Studio 2013 and C++ and the following code, with all optimizations set to full and Release:
#include <string>
#include <algorithm>
#include <vector>
#include <hash_map>
#include <iostream>
#include <fstream>
#include <ctime>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
string line;
int linecount = 0;
ifstream myfile;
myfile.open("D:\\testoutput.txt");
cout << "processing..." << endl;
clock_t startTime = clock();
while (myfile.good()){
getline(myfile, line);
linecount++;
}
cout << double (clock() - startTime)/CLOCKS_PER_SEC << " seconds." << endl;
//cout << (double)(clock() – startTime) / (double) CLOCKS_PER_SEC << " seconds." << endl;
cout << "Finished." << endl;
cout << "processed " << linecount << " lines.";
int test;
cin >> test;
}
resulted in this output:
processing...
9.535 seconds.
Finished.
processed 8000001 lines.
(This was with ALL optimizations set to full and for speed (/Ox, /Ot)). The only thing I can think of that accounts for the small difference would be that the C# program ran in native 64-bit Mode, whereas the C++ is only compiling to 32-bit (by default), but switching the C++ program to x64 caused it to take about twice as long to complete. My guess as to why it's slower than C# in this case would have to be the ifstream library.