Author Topic: how to get the Count of string in file (Read 36672 times)

vishuvishal · « **Reply #90 on:** August 15, 2010, 05:45:55 PM »

He he...
I hope this is form for dos.
Not for VBS or VB or C

Don't mind it.
But, I started liking batch programming.
I really appreciate your knowledge of expertise.
As I think windows functionality can be operated from dos. Cause window itself is dos operated operating system.
So, I think you must count on batch. Rather than other languages.

If I said anything dis-hearting the integrity of any programmer. I really apologize for that.
I didn't mean that way.
But, can you point which is the best IDE for the season.
Like, C is the best language.

comment appreciated.

I know this is going off topic.

Thanks and regards.
Vishu

ghostdog74 · « **Reply #91 on:** August 15, 2010, 06:11:30 PM »

Quote from: Salmon Trout on August 15, 2010, 11:55:39 AM

One thousand and twenty-five thousand million and two bytes (1,025,000,002) as I posted above.

so its 1 million (but your filename passed to your vbscript states 1 billion. )

Quote

Did I imply that I did not already realise this?

appears to me. You showed a benchmark between BCP and your code, then says BCP's one is sluggish after a while without stating your reasons and conclusion of your findings. Makes one wonder why it happens right?

vishuvishal · « **Reply #92 on:** August 15, 2010, 06:15:13 PM »

Quote from: ghostdog74 on August 15, 2010, 06:11:30 PM

so its 1 million (but your filename passed to your vbscript states 1 billion. )appears to me. You showed a benchmark between BCP and your code, then says BCP's one is sluggish after a while without stating your reasons and conclusion of your findings. Makes one wonder why it happens right?

Really don't know what you talking about.

ghostdog74 · « **Reply #93 on:** August 15, 2010, 06:17:40 PM »

Quote from: victoria on August 15, 2010, 10:16:42 AM

The Sed solution is the best solution.

not really! If its a big file, using your method of substituting the word to include newlines, (which is expensive compared to pure string counting) , and then piping to 2 calls of find command to find the count is not the best way to go. The best way is to count the number of words found AS YOU ITERATE THE FILE (with whatever tool that is processing it) and put the count in memory. That said, sed is not the best tool to use in this case.

ghostdog74 · « **Reply #94 on:** August 15, 2010, 06:18:37 PM »

Quote from: vishuvishal on August 15, 2010, 06:15:13 PM

Really don't know what you talking about.

sorry i don't care if you know or not. My words are not for you.

BC_Programmer · « **Reply #95 on:** August 15, 2010, 06:19:59 PM »

Quote from: ghostdog74 on August 15, 2010, 06:11:30 PM

Quote
One thousand and twenty-five thousand million and two bytes (1,025,000,002)
so its 1 million (but your filename passed to your vbscript states 1 billion. )

a Billion is a thousand millions... (In North America, at least)

ghostdog74 · « **Reply #96 on:** August 15, 2010, 06:25:33 PM »

Quote from: BC_Programmer on August 15, 2010, 06:19:59 PM

so its 1 million (but your filename passed to your vbscript states 1 billion. )

a Billion is a thousand millions... (In North America, at least)

ok ok. But i am talking about post #83. where ST said he download "1 million places of pi", then his file name for testing the benchmark is "1 billion places of pi". He is showing a benchmark, and when there are ambiguities, its only natural for the inquisitive mind to ask questions.

BC_Programmer · « **Reply #97 on:** August 15, 2010, 06:31:11 PM »

Quote from: ghostdog74 on August 15, 2010, 06:25:33 PM

ok ok. But i am talking about post #83. where ST said he download "1 million places of pi", then his file name for testing the benchmark is "1 billion places of pi". He is showing a benchmark, and when there are ambiguities, someone like me will question.

Doesn't much matter if it's a billion or a million, as long as the same inputs were used to test both- the exact size is more a curiousity (except in some cases).

ghostdog74 · « **Reply #98 on:** August 15, 2010, 06:37:35 PM »

a billion and a million is different.

ghostdog74 · « **Reply #99 on:** August 15, 2010, 06:42:03 PM »

Quote from: victoria on August 07, 2010, 02:21:14 PM

Two \\ should be one

C:\\test>type cntstr.bat
rem @echo off
sed s/%1/%1\\n/g %2 | egrep -c %1

C:\\test>cntstr.bat the yz.txt

C:\\test>rem @echo off

C:\\test>sed s/the/the\\n/g yz.txt | egrep -c the
10

C:\\test>type yz.txt
the
the
the
the
the the the
the the the

this example will also count words like thesis, stethescope, etc, which is not exactly the word "the". egrep is also deprecated. Use grep -E

Code: [Select]

grep -Eo "\bthe\b" file|wc -l

the above does not need to do substitution on the entire file and gets the exact string.

BC_Programmer · « **Reply #100 on:** August 15, 2010, 06:54:07 PM »

Quote from: ghostdog74 on August 15, 2010, 06:37:35 PM

a billion and a million is different.

Not in this case. What difference would it have on the results? sure, the numbers will be larger for a billion then for a million, but it's not the actual number that's important, it's how the two numbers compare.

ST performed two tests: one with a smaller file, and one with a larger file. the two tests revealed that with a larger amount of data to read, my method causes a large IO bottleneck. Two points of reference is enough for a crude line-chart comparison of the two, and while it may not be entirely accurate, it can reveal specific trends in the two functions. For example, we can determine that my routine seems to run at something like O((n/4)^2), whereas his is a more linear method whose time taken is linearly related to the length of the file. In mine, this is not the case because additional overhead is required for the system to properly manage the larger amount of memory being used to store the entire string.

What is important here is that we are comparing the programs used, As long as the inputs are the same the comparisons are valid.

if you test program A and Program B with Input C, it's a fair comparison between A and B as long as C is the same for both.

It doesn't matter if there was a mixup over the specifics of the size of C. The comparison was between A and B.

If you compare a Quick Sort with a Merge Sort, wether you are testing with a million or a billion elements is largely redundant; what's important is the comparison. If there was confusion over the layout of the data (such as how a quicksort takes longer then a merge sort with a nearly sorted array) and it was relevant, then yes, I would agree. but while there is indeed some ambiguity, it's irrelevant.

ghostdog74 · « **Reply #101 on:** August 15, 2010, 07:04:58 PM »

Quote from: BC_Programmer on August 15, 2010, 06:54:07 PM

Not in this case. What difference would it have on the results? sure, the numbers will be larger for a billion then for a million, but it's not the actual number that's important, it's how the two numbers compare.

If its a larger file, then your method of slurping all into memory is not a good solution. That's the difference. why do you say its not important? If the test files are like 1 thousand vs 100 , then of course your method will work. Size of the test samples do matter when doing benchmarks as it will affect the design of the algorithm being used.

victoria · « **Reply #102 on:** August 15, 2010, 07:52:35 PM »

Quote from: ghostdog74 on August 15, 2010, 06:42:03 PM

this example will also count words like thesis, stethescope, etc, which is not exactly the word the egrep is also deprecated. Use grep -E
Code: [Select]
grep -Eo bthe file|wc -lthe above does not need to do substitution on the entire file and gets the exact string.

Ghost,
Your grep works. I had an old 2005 version.

Your skill level has improved. Who is your Tutor?

C:test>grep -Eo the yz.txt
the
the
the
the
the
the
the
the
the
the

C:test>grep -Eo the yz.txt | wc -l
10

C:test>type yz.txt
the
the
the
the
the the the
the the the
C:test>

BC_Programmer · « **Reply #103 on:** August 15, 2010, 08:04:13 PM »

Quote from: ghostdog74 on August 15, 2010, 07:04:58 PM

If its a larger file, then your method of slurping all into memory is not a good solution. That's the difference. why do you say its not important? If the test files are like 1 thousand vs 100 , then of course your method will work. Size of the test samples do matter when doing benchmarks as it will affect the design of the algorithm being used.

Reread my post.

ghostdog74 · « **Reply #104 on:** August 15, 2010, 08:54:59 PM »

Quote from: BC_Programmer on August 15, 2010, 08:04:13 PM

Reread my post.

I reiterate my point. Size of a file does not matter if what you are comparing is the result of the output between to 2 pieces of code. That is, you want to make sure the output produced by the 2 pieces of code are the same. Size of file does matter in a benchmark, when you are concerned about the way the program is written and the algorithm used. That's is whether you have use the most optimized method when dealing with big files.

Because of the size of the file, you have chosen to read the files in chunks. That's a direct consequence of taking size into consideration when designing your program. That's why size does matter in a benchmark. 1 million is way different 1 billion!

Computer Hope Forum

News:

Author Topic: how to get the Count of string in file (Read 36672 times)

vishuvishal

Re: how to get the Count of string in file

ghostdog74

Re: how to get the Count of string in file

vishuvishal

Re: how to get the Count of string in file

ghostdog74

Re: How to get the Count of string in file

ghostdog74

Re: how to get the Count of string in file

BC_Programmer

Re: how to get the Count of string in file

ghostdog74

Re: how to get the Count of string in file

BC_Programmer

Re: how to get the Count of string in file

ghostdog74

Re: how to get the Count of string in file

ghostdog74

Re: how to get the Count of string in file

BC_Programmer

Re: how to get the Count of string in file

ghostdog74

Re: how to get the Count of string in file

victoria

Re: how to get the Count of string in file

BC_Programmer

Re: how to get the Count of string in file

ghostdog74

Re: how to get the Count of string in file