I reiterate my point. Size of a file does not matter if what you are comparing is the result of the output between to 2 pieces of code.
That's what I said.
you want to make sure the output produced by the 2 pieces of code are the same.
Agree.
Size of file does matter in a benchmark, when you are concerned about the way the program is written and the algorithm used. That's is whether you have use the most optimized method when dealing with big files.
Yes, it does. but only if you perform a <single> benchmark. ST did two. therefore there are two points of reference and as I noted a linear formula can be derived from those two data points that roughly approximates a short range of the values of whatever the actual relationship between them are. A third data point will be enough to create a parabola, but, that doesn't mean that the performance relationship is a parabola, it's just all you can do with 3 points. It could very well be a cubic function of the size of the input.
The thing is, here, we can <SEE> the code. we can see why, right off, a larger file would make a difference. It doesn't matter if that larger file is larger by a million or a billion bytes, it's still larger and that difference is reflected thusly in the timings, and the reason is rather obvious as partially evidenced by your quick mention of it.
Because of the size of the file, you have chosen to read the files in chunks.
That's a direct consequence of taking size into consideration when designing your program. That's why size does matter in a benchmark. 1 million is way different 1 billion!
Oh, yes, of course, because everybody knows that you can't read in chunks for both 1 million and 1 billion. I obviously specifically designed it for the exact size that ST gave, I was in no way trying to make it more generic and efficient for smaller files (which it is, even a 128K file will benefit from chunk reading because it causes less stress on the task allocator and also causes less process memory fragmentation).
If you want to get right down to it, all benchmarks are flawed because of the timing code, it changes the results by being there, but you can't get results without it. the difference is that that benchmark code surrounds all the different timed blocks and therefore that fact can be ignored in the results.
I will agree that there are certainly instances where a million and a billion are a significant difference algorithm-wise, but at the same time, is not even the slightest floating point error in an algorithm a huge difference when it comes to the algorithms for trigonometric functions? It's a matter of the goal of the code in question as to exactly what constitutes a significant difference. In this case, because essentially ST was testing a large file (that was all I considered, I wasn't making sweeping design changes based on the fact that it was in millions as opposed to billions, but rather generic changes where it
won't matter wether it was a million or a billion. Will the timing be different for a billion and a million? Of course it will. And I will agree that in that sense, the results are flawed. But you assume that my changes are based on his results, when in fact they are merely based on the simple premise that it doesn't work properly for large files. I didn't pay very close attention to the specific timings of them, because all I needed to know was that it was slower with larger files. I didn't need to know how many milliseconds it took to process with X characters.