Computer Hope
Microsoft => Microsoft DOS => Topic started by: arunavlp on August 03, 2010, 04:43:17 AM
-
hi,
Am having a file with 1 line having a file size of 35MB.
Eg:-
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
I need to get a count of INS* in the above file. Am new to DOS Commands.
Please help me.
Thanks in Advance.
Regards,
Arun S.
-
hi,
Am having a file with 1 line having a file size of 35MB.
Eg:-
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
I need to get a count of INS* in the above file. Am new to DOS Commands.
Please help me.
Thanks in Advance.
Regards,
Arun S.
download gawk f (http://gnuwin32.sourceforge.net/packages/gawk.htm)or windows,
then
c:\test> gawk "{m=gsub("INS",""); total+=m}END{print "total:" total}" file
-
hi ,
Thanks for suggestion. but i got an error message like this
30.834
gawk: {m=gsub(INS,");
gawk: ^ unterminated string
i dont know wht this error means. Please help me on this.
Regards,
Arun S.
-
Escape your double quotes
c:\test> gawk "{m=gsub(\"INS\",\"\"); total+=m}END{print \"total:\" total}" file
-
hi,
Thanks It works.. :) but please let me know if we can do it in Find Command....
Regards,
Arun S.
-
hi,
Thanks It works.. :) but please let me know if we can do it in Find Command....
Regards,
Arun S.
i personally wouldn't bother. find (or findstr) just find the string on a line for you. It won't count how many there are. More involved programming is needed. ( that i will leave it someone else who has the expertise and time to show you, )
When parsing files and doing string manipulation, use a good tool for the job.
-
The find command will count the lines with the search argument. If a line has more than one occurrence of the search argument, it still counts for one. Findstr does not do counting but allows for multiple search arguments and a limited form of regular expressions.
You can use VBScript which came with your Windows machine. The little demo script will prompt the user for the file name and the search argument. It can be tweaked to remove the prompts (which will probably gut the majority of the script). ;D
Const ForReading = 1
Set fso = CreateObject("Scripting.FileSystemObject")
Do
WScript.StdOut.Write "Please enter file name: "
strFile = WScript.StdIn.ReadLine
If fso.FileExists(strFile) Then
Set objFile = fso.OpenTextFile(strFile, ForReading)
strCharacters = objFile.ReadAll
Exit Do
Else
WScript.StdOut.Write "Invalid file name ... Try Again" & vbCrLf
End If
Loop
Do
WScript.StdOut.Write "Please enter character string: "
strToCount = WScript.StdIn.ReadLine
If strToCount <> "" Then Exit Do
Loop
strTemp = Replace(LCase(strCharacters), LCase(strToCount), "")
WScript.Echo "Occurences of:", strToCount, "=", (Len(strCharacters) - Len(strTemp)) / Len(strToCount)
objFile.Close
Save the script with a vbs extension and run only from the command prompt as: cscript scriptname.vbs
Good luck. ;D
-
I can give you Idea what it should like to be:
set /p pass= <string.txt
echo %pass%
call set new=%%pass:~%a%,1%%
set /a a=%a% + 1
set key=%key%%new%
echo %new%
This new will give you the number of string.
However, I am going will give you further details tommorrow
Thanks and regard
vishu
-
set /p pass=<string.txt
echo %pass%
:st
call set new=%%pass:~%a%,1%%
echo a=%a% + 1
echo %a%
set key=%key%%new%
echo %new%
echo %key%
pause
::if %new% ==; goto :EOF
pause
goto :st
All we need to fix is loop.
Change the string.txt to your file drive:path\file name
Gave you a best option
-
@echo off
sed s/the/the\\n/g yz.txt | egrep -c the
counthe.bat
10
type yz.txt
the
the
the
the
the the the
the the the
-
Two \\ should be one
C:\\test>type cntstr.bat
rem @echo off
sed s/%1/%1\\n/g %2 | egrep -c %1
C:\\test>cntstr.bat the yz.txt
C:\\test>rem @echo off
C:\\test>sed s/the/the\\n/g yz.txt | egrep -c the
10
C:\\test>type yz.txt
the
the
the
the
the the the
the the the
-
Only one \\ backslash each time
type cntstr.bat
rem @echo off
sed s/%1/%1\\n/g %2 | egrep -c %1
cntstr.bat 22 yr2010.doc
rem @echo off
sed s/22/22\\n/g yr2010.doc | egrep -c 22
12
(http://i7.photobucket.com/albums/y268/billrich/2010.jpg)
-
(http://i7.photobucket.com/albums/y268/billrich/generic-1.jpg)
-
Output for reply #6 by sidewinder
cscript swcnt.vbs
Microsoft (R) Windows Script Host Version 5.8
Copyright (C) Microsoft Corporation. All rights reserved.
Please enter file name: yr2010.doc
Please enter character string: 22
Occurences of: 22 = 12
-
Victoria, I really understand wht these commands will do.
Seems like not a proper bat file
-
Victoria, I really do not understand what these commands will do.
Seems like not a proper bat file.
It is a VBS written by Sidewinder in Reply #6. It works perfectly.
I do not write VBS.
Many ways to skin a cat.
-
I use a Proxy Server to reach Computer Hope. The Editor used to post this post does not work well. I get random extra \\. There was only one \\ each time here.
Nevertheless, the string count by me and sidewinder is the same.
Please do not question my sanity.
-
I use a Proxy Server to reach Computer Hope.
We know.
-
hi,
It got Worked Thanks. Smart Work.
Regards,
Arun S.
-
We know.
How do I avoid random \\ when I post?
-
victoria...just wondering, what's the proxy server's IP?
-
victoria...just wondering, what is the proxy server IP?
????
The Ip address is dynamic and several Proxy Servers rotate the IP address?
Not really sure?
-
We can do this using DOS also
Please run this in loop. Once please enter the condition once we get EOF. like '\0' or null or " " etc. I do not know exactly what to use for batch.
Please complete or correct my codes.set /p pass=<string.txt
echo %pass%
:st
call set new=%%pass:~%a%,1%%
echo a=%a% + 1
echo %a%
Thanks and regards
vishu
-
Please complete or correct my codes.set /p pass=<string.txt
echo %pass%
:st
call set new=%%pass:~%a%,1%%
echo a=%a% + 1
echo %a%
(http://i7.photobucket.com/albums/y268/billrich/vincnt89.jpg)
-
To Vis,
(http://i7.photobucket.com/albums/y268/billrich/vincnt89.jpg)
-
Hello. Wrong Thread.
-
We can do this using DOS also
Please run this in loop. Once please enter the condition once we get EOF. like 01\'0\' or null or \" \" etc. I do not know exactly what to use for batch.
Please complete or correct my codes.set /p pass=<string.txt
echo %pass%
:st
call set new=%%pass:~%a%,1%%
echo a=%a% + 1
echo %a%
Thanks and regards
vishu
C:test>type viscnt.bat
@echo off
set /a a=2
echo Here is a string > string.txt
set /p pass=<string.txt
echo pass=%pass%
REM :st is label or a point in the code
REM where we jump to or return to
call :st %a%
echo return from :st
rem we may use call to jump to or return to a location
rem ( a label ) in the code or to rem another batch file
rem set new=%%pass:~%a%,1%% I do not know what this does
set new=%pass:~%a%,1%
echo new=%new%
rem set assigns a value to a variable.
rem A variable is a location in RAM where the value is stored
set /a a=%a% + 1
echo a=%a%
goto :end
:st %a%
echo a=%1
echo We are at the :st label location
echo a=%1
exit /b
:end
Output:
C:test>viscnt.bat
pass=Here is a string
a=2
We are at the :st label location
a=2
return from :st
new=Here is a string a
a=3
-
Hello. Wrong Thread.
Getting confused, Bill? :)
-
Say What?
-
Say What?
We know it's you, Bill
-
We know it\'s you, Bill?
What is Salmon Trout talking about?
Is Salmon Trout part of the ComputerHope.com Staff?
-
That third person thing is a "dead giveaway" as Londoners say, Bill.
-
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
I need to get a count of INS* in the above file. Am new to DOS Commands.
Arunavlp,
Im sorry the thread got off topic.
Swindwinder* and Ghostdog provided excellent methods for counting the number of times a string appears in a document.
Please ignore the off topic posts.
Good Luck
* Reply #6 on: August 05, 2010, 04:31:42 AM
-
Victoria, if you aren't Billrich, how come you have access to his Photobucket account?
-
Thanks It works.. but please let me know if we can do it in Find Command....
Arun S.
Arun,
Im sorry Arun the off topic posts continue.
Some posters never make any suggestions for counting strings in a document.
These posters write about topics completely unrelated to counting strings.
_____________________________
sed s/22/22*n/g yr2010.doc | egrep -c 22
Number of 22 strings is 12 in calendar. One 22 string for each month
* use a blackslash above to add a newline for strings without a newline
-
Can we use FINDSTR
-
I have an idea that we search for the end of file
till then run this command in loop.
Run a variable counter. Till the loop runs.
Once we reach to End of file stop the loop.
Check the variable. That would of number of strings.
and we can also use
for /f "delims=" %%i in (id.txt) do (
echo i = %%i
)
if we can apply all these.
Can anyone create this.
That would be done.
Thanks and regards
vishu
-
These posters write about topics completely unrelated to counting strings.
Like your breach of the forum rules. Squirm how you like, you've been rumbled!
-
Can we use FINDSTR
Let us see your code.
Findstr with a counter might work? Findstr will usually only print the line where the string is found. When the string appears twice in the same line, only one string is counted. Therefore the final count is wrong.
I will try findstr again. Sidewinder and the other experts stated findstr will not work.
Let us see your code.
-
for /f *delims=* %%i in (id.txt) do (
echo i = %%i
)
vishu
C:test>type vis812.bat
REM Replace * with a double quote symbol
@echo off
set /a c=0
setlocal enabledelayedexpansion
for /f *tokens=1-5* %%i in (id.txt) do (
echo %%i %%j %%k
if *%%i*==*the* set /a c=!c! + 1
if *%%j*==*the* set /a c=!c! + 1
if *%%k*==*the* set /a c=!c! + 1
)
echo count=%c%
echo.
echo Display id.txt
echo.
type id.txt
Output.
C:test>vis812.bat
the
the
the
the
the the the
the the the
count=10
Display id.txt
the
the
the
the
the the the
the the the
p.s. The above code uses only batch code but not findstr.
The code will only count the string in the id.txt. There are 10 theS in the id.txt.
The code will most likely not work with other text files
-
Can we use FINDSTR
C:test>findstr 22 yr2010.doc
17 18 19 20 21 22 23 21 22 23 24 25 26 27 21 22 23 24 25 26 27
18 19 20 21 22 23 24 16 17 18 19 20 21 22 20 21 22 23 24 25 26
18 19 20 21 22 23 24 22 23 24 25 26 27 28 19 20 21 22 23 24 25
17 18 19 20 21 22 23 21 22 23 24 25 26 27 19 20 21 22 23 24 25
C:test>findstr 22 yr2010.doc | find /c /v **
4
C:test>
Vis,
Even though each line above has three 22s ; only one is counted by findstr.
I do not know to modify so findstr counts all strings.
** use the double quote symbol above
-
Can we use FINDSTR
REM Replace * with double quote
C:test>type yr812.bat
@echo off
set /a c=0
setlocal enabledelayedexpansion
for /f *tokens=1-26* %%a in (yr2010.doc) do (
if *%%a*==*22* set /a c=!c! + 1
if *%%b*==*22* set /a c=!c! + 1
if *%%c*==*22* set /a c=!c! + 1
if *%%d*==*22* set /a c=!c! + 1
if *%%e*==*22* set /a c=!c! + 1
if *%%f*==*22* set /a c=!c! + 1
if *%%g*==*22* set /a c=!c! + 1
if *%%h*==*22* set /a c=!c! + 1
if *%%i*==*22* set /a c=!c! + 1
if *%%j*==*22* set /a c=!c! + 1
if *%%k*==*22* set /a c=!c! + 1
if *%%l*==*22* set /a c=!c! + 1
if *%%m*==*22* set /a c=!c! + 1
if *%%n*==*22* set /a c=!c! + 1
if *%%0*==*22* set /a c=!c! + 1
if *%%p*==*22* set /a c=!c! + 1
if *%%q*==*22* set /a c=!c! + 1
if *%%r*==*22* set /a c=!c! + 1
if *%%s*==*22* set /a c=!c! + 1
if *%%t*==*22* set /a c=!c! + 1
if *%%u*==*22* set /a c=!c! + 1
if *%%v*==*22* set /a c=!c! + 1
if *%%w*==*22* set /a c=!c! + 1
if *%%x*==*22* set /a c=!c! + 1
if *%%y*==*22* set /a c=!c! + 1
if *%%z*==*22* set /a c=!c! + 1
)
echo count=%c%
echo.
echo Display yr2010.doc
echo.
Output:
C:test>yr812.bat
count=12
Display yr2010.doc
C:test>
-
Can we use FINDSTR
Vis,
( Code has not been fully tested but my price is right.)
REM This generic batch string counter should work for most files and strings
Rem replace * with double quote symbol
Rem Usage: cnt812.bat string file.txt
REM Usage: cnt812.bat the id.txt
C:test>type cnt812.bat
@echo off
set /a c=0
setlocal enabledelayedexpansion
for /f *tokens=1-26* %%a in (%2) do (
if *%%a*==*%1* set /a c=!c! + 1
if *%%b*==*%1* set /a c=!c! + 1
if *%%c*==*%1* set /a c=!c! + 1
if *%%d*==*%1* set /a c=!c! + 1
if *%%e*==*%1* set /a c=!c! + 1
if *%%f*==*%1* set /a c=!c! + 1
if *%%g*==*%1* set /a c=!c! + 1
if *%%h*==*%1* set /a c=!c! + 1
if *%%i*==*%1* set /a c=!c! + 1
if *%%j*==*%1* set /a c=!c! + 1
if *%%k*==*%1* set /a c=!c! + 1
if *%%l*==*%1* set /a c=!c! + 1
if *%%m*==*%1* set /a c=!c! + 1
if *%%n*==*%1* set /a c=!c! + 1
if *%%0*==*%1* set /a c=!c! + 1
if *%%p*==*%1* set /a c=!c! + 1
if *%%q*==*%1* set /a c=!c! + 1
if *%%r*==*%1* set /a c=!c! + 1
if *%%s*==*%1* set /a c=!c! + 1
if *%%t*==*%1* set /a c=!c! + 1
if *%%u*==*%1* set /a c=!c! + 1
if *%%v*==*%1* set /a c=!c! + 1
if *%%w*==*%1* set /a c=!c! + 1
if *%%x*==*%1* set /a c=!c! + 1
if *%%y*==*%1* set /a c=!c! + 1
if *%%z*==*%1* set /a c=!c! + 1
)
echo count=%c%
echo.
echo Display %2
echo.
type %2
Output:
C:test>cnt812.bat the id.txt
count=10
Display id.txt
the
the
the
the
the the the
the the the
C:test>
-
Can we use FINDSTR
sed s/the/the*n/g id.txt | findstr the | find /c /v **
count=10
* replace * with backslash symbol
** replace ** with two double quotes
sed for windows is an easy download
sed means stream editor
-
http://thesystemguard.com/NTCmdLib/Functions/SCOUNT.htm
-
@echo off
>substringcount.vbs echo substring = wscript.arguments(0)
>>substringcount.vbs echo longstring = wscript.arguments(1)
>>substringcount.vbs echo Subslen = Len(Substring)
>>substringcount.vbs echo longlen = Len(longstring)
>>substringcount.vbs echo Subcount = 0
>>substringcount.vbs echo Substart = InStr ( longstring, Substring )
>>substringcount.vbs echo If Substart ^> 0 Then
>>substringcount.vbs echo Do
>>substringcount.vbs echo Subcount = Subcount + 1
>>substringcount.vbs echo longstring = Mid( longstring, ( Substart + Subslen ) )
>>substringcount.vbs echo Substart = InStr ( longstring,Substring )
>>substringcount.vbs echo If Substart = 0 Then Exit Do
>>substringcount.vbs echo Loop
>>substringcount.vbs echo End If
>>substringcount.vbs echo wscript.echo Subcount
set mainstring="arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~"
set substring="INS*"
for /f "delims=" %%C in ('cscript //nologo substringcount.vbs %substring% %mainstring%') do set count=%%C
echo Found string %substring% %count% times in string %mainstring%
S:\>test.bat
Found string "INS*" 2 times in string "arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~"
-
@echo off
S:>test.bat
Found string \"INS*\" 2 times in \"string \"arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~\"
(http://i7.photobucket.com/albums/y268/billrich/ststr.jpg)
cntstr.bat INS* st813.txt
rem @echo off
sed s/INS*/INS**n/g st813.txt | findstr INS* | find /c /v **
count=2
echo Display string
Display string
type st813.txt
arun*America*MSC~INS*dfffs*dfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
* replace * with backslash
** replace with double quotes
( see above )
-
sed s/the/the*n/g id.txt | findstr the | find /c /v **
count=10
* replace * with backslash symbol
** replace ** with two double quotes
sed for windows is an easy download
sed means stream editor
Thanks for all the research work. I appreciate it.
You are really hard working.
I am thinking myself to be a beginner.
I think I must keep my mouth
:P :P :P :-X :-X :-X :P :P :P
Thanks Victoria.
-
vishuvishal: just out of curiousity, where are you from? :)
-
Found string \"INS*\" 2 times in \"string \"arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~\"
Where are those back slashes coming from?
-
vishuvishal: just out of curiousity, where are you from? :)
Not too far from Bill's trailer, I daresay.
-
Re: how to get the Count of string in file
« Reply #50 on: Today at 04:12:12 PM » Reply with quote
Quote from: BC_Programmer on Today at 03:45:20 PM
vishuvishal: just out of curiousity, where are you from? Smiley
Not too far from Bill's trailer, I daresay.
I am far from there.
I am somewhere from eastern side.
I hope you will hate me for this.
-
I am far from there.
And yet in the same time zone?
I am somewhere from eastern side.
Eastern side of what?
I hope you will hate me for this.
um, ok.
-
Where are those back slashes coming from?
I connect to Computerhope.com through a Proxy Server
The following is a guess about the orgin of random blackslashes:
The Editor at the Proxy Server posts my post here at Computerhope.com?
The Editor at the Proxy Server inserts the random backslashes?
Or the staff here at computerhope.com sets their editor to insert random backslashes?
I do not know how to correct the problem.
Thanks for your help.
-
I do not know how to correct the problem.
Go away. That'll fix it.
-
Go away. That will fix it.
What have I done to hurt anything?
Why should I leave?
-
Thanks for all the research work. I appreciate it.
You are really hard working.
I enjoy trying to answer questions.
I believe sed is a very useful tool for many problems.
Sed was written by AT&T many years ago for the Unix Operating System.
Sed is now used with many operating systems.
p.s. Ignore the negative comments by some of the other posters.
The people making the negative comments have a ton of good information when they
choose to help.
Why do they have a need to insult people who came to Computerhope looking for help?
Good Luck
-
Thanks for all the research work. I appreciate it.
(http://i7.photobucket.com/albums/y268/billrich/waldo.jpg)
-
And yet in the same time zone?
Eastern side of what?
Billrich's trailer? Billrich's head more like.
-
Why should I leave?
Because you were banned before and forbidden to return.
-
FWIW, vbs tidied up...
@echo off
>substringcount.vbs echo substring = wscript.arguments (0)
>>substringcount.vbs echo longstring = wscript.arguments (1)
>>substringcount.vbs echo Subslen = Len ( Substring )
>>substringcount.vbs echo count = 0
>>substringcount.vbs echo Do
>>substringcount.vbs echo Substart = InStr ( longstring, Substring )
>>substringcount.vbs echo If Substart ^> 0 then count = count + 1
>>substringcount.vbs echo longstring = Mid ( longstring, ( Substart + Subslen ) )
>>substringcount.vbs echo Loop Until Substart = 0
>>substringcount.vbs echo wscript.echo count
set bigstring="arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~"
set substring="INS*"
for /f "delims=" %%C in ('cscript //nologo substringcount.vbs %substring% %bigstring%') do set count=%%C
echo Found string %substring% %count% times in string %bigstring%
del substringcount.vbs
-
I wonder if Billrich ("Victoria") and Vishuvishal are one and the same person? If so he is laughing at us.
-
(http://i7.photobucket.com/albums/y268/billrich/fish.jpg)
C:test>type cntstr.bat
@echo off
sed s/%1/%1*n/g %2 |findstr %1| find /c /v **
echo Display string
echo.
type %2
Output:
C:test>cntstr.bat INS* bigstring.txt
count=2
Display string
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
* replace * with a backslash
** replace ** with double quotes
-
http://i7.photobucket.com/albums/y268/billrich/fish.jpg
-
@echo off
setlocal enabledelayedexpansion
>substringcount.vbs echo substring = wscript.arguments (0)
>>substringcount.vbs echo longstring = wscript.arguments (1)
>>substringcount.vbs echo Subslen = Len ( Substring )
>>substringcount.vbs echo count = 0
>>substringcount.vbs echo Do
>>substringcount.vbs echo Substart = InStr ( longstring, Substring )
>>substringcount.vbs echo If Substart ^> 0 then count = count + 1
>>substringcount.vbs echo longstring = Mid ( longstring, ( Substart + Subslen ) )
>>substringcount.vbs echo Loop Until Substart = 0
>>substringcount.vbs echo wscript.echo count
set substring=INS*
set infile=test.txt
set total=0
for /f "delims=" %%A in (test.txt) do (
set bigstring=%%A
for /f "delims=" %%C in ( ' cscript //nologo substringcount.vbs "%substring%" "!bigstring!" ' ) do set /a total=!total!+%%C
)
del substringcount.vbs
echo Found string %substring% %total% times in file %infile%
arun*America*MSC~INS*egggs*Segse*segse~ssgse*segse~INS*egggs*segseg*segs~
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
arun*America*MSC~INS*egggs*Segse*segse~ssgse*segse~INS*egggs*segseg*segs~
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
arun*America*MSC~INS*egggs*Segse*segse~ssgse*segse~INS*egggs*segseg*segs~
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
arun*America*MSC~INS*egggs*Segse*segse~ssgse*segse~INS*egggs*segseg*segs~
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
Found string INS* 16 times in file test.txt
-
C:test>type cntstr.bat
@echo off
sed s/%1/%1*n/g %2 |findstr %1| find /c /v **
echo Display string
echo.
type %2
Output:
C:test> cntstr.bat INS* test.txt
count=16
Display string
arun*America*MSC~INS*egggs*Segse*segse~ssgse*segse~INS*egggs*segseg*segs~
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
arun*America*MSC~INS*egggs*Segse*segse~ssgse*segse~INS*egggs*segseg*segs~
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
arun*America*MSC~INS*egggs*Segse*segse~ssgse*segse~INS*egggs*segseg*segs~
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
arun*America*MSC~INS*egggs*Segse*segse~ssgse*segse~INS*egggs*segseg*segs~
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
C:test>
* replace * with backslash
** repace ** with double quotes
-
http://i7.photobucket.com/albums/y268/billrich/fish.jpg
Use the image tag: [img]
-
Bill, will you please quit trolling my posts? I posted the text of the link, without image tags, in order to display the presence of "billrich" in the link to the image hosted in your photobucket account, to show that "Victoria" is in fact the banned troll Billrich/marvinengland etc.
Unlike Bill's "solution", mine does not rely on 3rd party addons. And I think my code looks prettier. I always think SED commands look like lines from a Martian's shopping list. Maybe that's why Bill likes them so much?
-
c:test> gawk *{m=gsub(*INS*,**); total+=m}END{print *total:* total}* file
Ghostdog,
Some of the members here at computerhope.com believe we should not use gawk and sed because the vbs script and batch look better.
-
Ghostdog,
Some of the members here at computerhope.com believe we should not use gawk and sed because the vbs script and batch look better.
I sure hope they work out a way of permanently banning you.
-
I sure hope they work out a way of permanently banning you.
Which post is offensive? I have done nothing wrong.
-
Which post is offensive?
All of them.
-
I have done nothing wrong.
Circumventing a ban is enough.
-
here's my Counting function, VBS:
Function GetCountStr(ByVal searchIn, ByVal SearchFor) As Long
GetCountStr = (Len(searchIn) - Len(Replace(searchIn, SearchFor, ""))) / Len(SearchFor)
End Function
dim inputstrm
Dim lookin,lookfor
Set inputstrm = CreateObject("Scripting.FileSystemObject").OpenTextFile(WScript.Arguments(0))
lookfor = WScript.Arguments(1)
lookin=inputstrm.ReadAll()
WScript.Echo GetCountStr(lookin,lookfor)
I've tested the function, but not the code using it.
-
Function GetCountStr(ByVal searchIn, ByVal SearchFor) As Long
GetCountStr = (Len(searchIn) - Len(Replace(searchIn, SearchFor, **))) / Len(SearchFor)
End Function
dim inputstrm
Dim lookin,lookfor
Set inputstrm = CreateObject(*Scripting.FileSystemObject*).OpenTextFile(WScript.Arguments(0))
lookfor = WScript.Arguments(1)
lookin=inputstrm.ReadAll()
WScript.Echo GetCountStr(lookin,lookfor)
I have tested the function, but not the code using it.
That is great, now test all the code and show the count for how many times a string appears in a file.
Look at Sidewinders code in reply 6 for how to do this.
Good Luck
-
It works, I had to remove the accidental Type declaration still left, since the "original" is VB6:
Also, Added case insensitive option "/i".
Function GetCountStr(ByVal searchIn, ByVal SearchFor,Byval CompareText)
CompareText=CBool(CompareText)
GetCountStr = (Len(searchIn) - Len(Replace(searchIn, SearchFor, "",1,-1,abs(CompareText)))) / Len(SearchFor)
End Function
dim inputstrm
Dim lookin,lookfor
'see if /i was specified....
for each looparg in WScript.Arguments
If UCase(looparg)="-I" or UCase(looparg)="/I" Then
ignorecase=true
Exit For
End If
Next
Set inputstrm = CreateObject("Scripting.FileSystemObject").OpenTextFile(WScript.Arguments(0))
lookfor = WScript.Arguments(1)
lookin=inputstrm.ReadAll()
WScript.Echo GetCountStr(lookin,lookfor,ignorecase)
I shall now endeavour to emulate the ridiculous manner in which Bill tests his code. I will refrain from the classic posting of the output from dir /? for no reason though.
test "input" file, "zwicky.txt":
in the 1930s and 1940s, many of Fritz Zwicky's colleagues regarded him as an irritating buffoon. Future generations of astronomers would look back on him as a creative genius.
"By the time I knew Fritz in 1953, he was thoroughly convinced that he had the inside track to ultimate knowledge, and that everyone else was wrong," says William Fowler, then a student at Caltech (The Californian Institute of Technology) where Zwicky taught and did research. Jesse Greenstein, a Caltech colleague of Zwicky's from the late 1940's onward, recalls Zwicky as "a self-proclaimed genius... There's no doubt that he had a mind which was quite extraordinary, But he was also, although he didn't admit it, untutored and not self-controlled.
... HE taught a course in physics for which the admission was at his pleasure. If he thought that a person was sufficiently devoted to his ideas, that person could be admitted... He was very much alone [ among the Caltech physics faculty, and was] not popular with the establishment... His publications often included violent attacks on other people."
Zwicky-- a stocky, cocky man, always ready for a fight -- did not hesitate to proclaim his inside track to ultimate knowledge, or to tout the revelations it brought. In lecture after lecture during the 1930s, and article after published article, he trumpeted the concept of a neutron star-- a concept that he, Zwicky, had invented to explain the origins of the most energetic phenomena seen by astronomers: supernovae, and cosmic rays. He even went on the air in a nationally broadcast radio show to popularize his neutron stars. But under close scrutiny, his articles and lectures were unconvincing. They contained little substantiation for his ideas.
It was rumoured that Robert Millikan (the man who had built Caltech into a powerhouse among science institutions), when asked in the midst of all this hoopla why he kept Zwicky at Caltech, replied that it just might turn out that some of Zwicky's far-out ideas were right. Millikan, unlike some others in the science establishment, must have seen hints of Zwicky's intuitive genius - a genius that became widely recognized only thirty five years later, when observational astronomers discovered real neutron stars in the sky and verified some of Zwicky's extravagant claims about them.
D:\>Cscript /NOLOGO countstr.vbs zwicky.txt caltech /i
5
D:\>Cscript /NOLOGO countstr.vbs zwicky.txt establishment
2
D:\>Cscript /NOLOGO countstr.vbs zwicky.txt Establishment
0
D:\>Cscript /NOLOGO countstr.vbs zwicky.txt Establishment /i
2
D:\>Cscript /NOLOGO countstr.vbs zwicky.txt zwicky /i
10
D:\>
-
C:test>type cntstr.bat
@echo off
sed s/%1/%1*n/g %2 |findstr %1| find /c /v **
echo.
rem type %2
Output:
C:test>cntstr.bat Zwicky zwicky.txt
count=10
C:\\test>cntstr.bat Caltech zwicky.txt
count=5
C:\\test>cntstr.bat establishment zwicky.txt
count=2
C:test>
* replace * with backslash
** replace ** with double quotes
-
( The following batch code with tokens found the right count. But I had to massage the input file. Someone with more token experience might correct the code? Thanks)
* replace * with a double quote.
C:test>type try813.bat
@echo off
set /a c=0
setlocal enabledelayedexpansion
for /f *tokens=1-26* %%a in (%2) do (
if *%%a*==*%1* set /a c=!c! + 1
if *%%b*==*%1* set /a c=!c! + 1
if *%%c*==*%1* set /a c=!c! + 1
if *%%d*==*%1* set /a c=!c! + 1
if *%%e*==*%1* set /a c=!c! + 1
if *%%f*==*%1* set /a c=!c! + 1
if *%%g*==*%1* set /a c=!c! + 1
if *%%h*==*%1* set /a c=!c! + 1
if *%%i*==*%1* set /a c=!c! + 1
if *%%j*==*%1* set /a c=!c! + 1
if *%%k*==*%1* set /a c=!c! + 1
if *%%l*==*%1* set /a c=!c! + 1
if *%%m*==*%1* set /a c=!c! + 1
if *%%n*==*%1* set /a c=!c! + 1
if *%%o*==*%1* set /a c=!c! + 1
if *%%p*==*%1* set /a c=!c! + 1
if *%%q*==*%1* set /a c=!c! + 1
if *%%r*==*%1* set /a c=!c! + 1
if *%%s*==*%1* set /a c=!c! + 1
if *%%t*==*%1* set /a c=!c! + 1
if *%%u*==*%1* set /a c=!c! + 1
if *%%v*==*%1* set /a c=!c! + 1
if *%%w*==*%1* set /a c=!c! + 1
if *%%x*==*%1* set /a c=!c! + 1
if *%%y*==*%1* set /a c=!c! + 1
if *%%z*==*%1* set /a c=!c! + 1
)
echo count=%c%
echo Display %2
rem type %2
Output:
C:test> try813.bat Zwicky zwicky.txt
count=10
-
Ghostdog,
Some of the members here at computerhope.com believe we should not use gawk and sed because the vbs script and batch look better.
this is the biggest joke of the year. awk/sed is excellent for parsing files and modifying it. Awk is also a little programming language capable of replacing cmd.exe. batch/vbscript look better? better in what sense? more lines of code means better? my gawk statement takes only 1 line, and it saves me enough time to go onto my other assignments. While you have to crack your head and come up with long and messy batch files like the last one you posted. By the time you finished, i am already off to bed and enjoying my sleep.
-
better in what sense?
More readable by others.
-
More readable by others.
vbscript maybe, but definitely not batch.
-
vbscript maybe, but definitely not batch.
I have to agree with you there. When I post one of those batch "solutions" where the batch file writes a vbscript on the fly, calls it, and then deletes the vbs, I get an uneasy feeling, like a surgeon advising somebody, when removing a gall stone with a carpenter's saw, to attach a scalpel blade to it with duct tape.
-
I have to agree with you there. When I post one of those batch "solutions" where the batch file writes a vbscript on the fly, calls it, and then deletes the vbs, I get an uneasy feeling, like a surgeon advising somebody, when removing a gall stone with a carpenter's saw, to attach a scalpel blade to it with duct tape.
I always recommend not to do hybrids, ie combining batch+vbscript. Mostly due to my own experiences, i find it difficult to read and troubleshoot due to intermixing of different syntaxes, etc. vbscript can do what batch does so I myself would write in entire in vbscript. Anyway, this is OT already...so ...
-
Project Gutenberg has some books in text file format. My code seems woefully slow compared to BC_Programmer's. Although either script counted "God" in the King James Bible in less than half a second. but see below...
Salmon-count.vbs
This is how I am going to try to do VBscripts in future...
Option Explicit
'Setup
Dim ObjFSO
Dim ObjTS
Dim StrFileName
Dim StrLookString
Dim StrThisline
Dim SngStartSec
Dim SngEndSec
Dim SngElapsed
Dim SngLineCount
Dim SngTotalCount
Dim SngSubsLen
Dim SngSubStart
Dim SngCaseSensitive
'Input filename
StrFileName=Wscript.Arguments(0)
'String to search for
StrLookString=Wscript.Arguments(1)
'Case type - 1 = case sensitive 0 = case insensitive
SngCaseSensitive = Wscript.Arguments (2)
'Length of string to search for
SngSubsLen = Len (StrLookString)
'if case insensitive search
'convert to lower case
If SngCaseSensitive = 0 Then StrLookString = LCase(StrLookString)
'Initialise File System Object
Set ObjFSO=Createobject("Scripting.Filesystemobject")
'Open input file
Set ObjTS=ObjFSO.Opentextfile(StrFileName)
'Store start time (secs since midnight)
SngStartSec = Timer
'Keep reading lines until all done
Do While Not ObjTS.Atendofstream
'Get line
StrThisLine=ObjTS.Readline
'if case insensitive search
'convert to lower case
If SngCaseSensitive = 0 Then StrThisLine=LCase(StrThisLine)
'Set count to zero
SngLineCount = 0
Do
'Is string in line? If so, get place
SngSubStart = InStr ( StrThisLine, StrLookString )
'If found, add 1 to counter
If SngSubStart > 0 then SngLineCount = SngLineCount + 1
'If found, chop off string before
StrThisLine = Mid ( StrThisLine, ( SngSubstart + SngSubsLen ) )
'Exit when no more found
Loop Until SngSubstart = 0
'Add count from this line to total
SngTotalCount = SngTotalCount + SngLineCount
Loop
'Close input file
ObjTS.Close
'Store end time (secs since midnight)
SngEndSec = Timer
'Subtract to get elapsed
SngElapsed = SngEndsec - SngStartSec
'Show results
wscript.echo SngTotalCount
wscript.echo formatnumber(SngElapsed,3)
BCP_count.vbs
Function GetCountStr(ByVal searchIn, ByVal SearchFor,Byval CompareText)
CompareText=CBool(CompareText)
GetCountStr = (Len(searchIn) - Len(Replace(searchIn, SearchFor, "",1,-1,abs(CompareText)))) / Len(SearchFor)
End Function
dim inputstrm
Dim lookin,lookfor
Dim StartSec, Endsec, Elapsed
'see if /i was specified....
for each looparg in WScript.Arguments
If UCase(looparg)="-I" or UCase(looparg)="/I" Then
ignorecase=true
Exit For
End If
Next
Startsec=Timer
Set inputstrm = CreateObject("Scripting.FileSystemObject").OpenTextFile(WScript.Arguments(0))
lookfor = WScript.Arguments(1)
lookin=inputstrm.ReadAll()
Endsec=Timer
Elapsed=Endsec - Startsec
WScript.Echo GetCountStr(lookin,lookfor,ignorecase)
wscript.echo Formatnumber(Elapsed, 3)
Salmon-count.vbs "H G Wells The War Of The Worlds.txt" "Martians" 1
156
0.043
BCP-count.vbs "H G Wells The War Of The Worlds.txt" "Martians"
156
0.016
Salmon-count.vbs "Complete Works Of Shakespeare.txt" "Hamlet" 1
113
0.688
BCP-count.vbs "Complete Works Of Shakespeare.txt" "Hamlet"
113
0.250
Salmon-count.vbs "Tolstoy War And Peace.txt" "Pierre" 1
1963
0.383
BCP-count.vbs "Tolstoy War And Peace.txt" "Pierre"
1963
0.145
Salmon-count.vbs "King James Bible.txt" "God" 1
4167
0.359
BCP-count.vbs "King James Bible.txt" "God"
4167
0.188
Salmon-count.vbs "Samuel Richardson Clarissa.txt" "she" 1
8861
1.156
bcp-count.vbs "Samuel Richardson Clarissa.txt" "she"
8861
0.234
but...
I downloaded a text file containing 1 million places of pi (1,000,000,002 bytes) with no carriage returns. I figured that my code wouldn't like that, so I used GNU fold to insert cr/lf pairs every 80 columns. However, when I tried BCP's code on it, oh dear! The system got awfully sluggish and I watched my available RAM go down from 3.2 GB to 24 MB before I used Process Explorer to terminate cscript.exe. But my "slow" code just chewed its way through in 1 minute 44 and a bit seconds...
salmon-count.vbs "1 billion places of pi.txt" "567" 0
975498
104.430
351,218 H G Wells The War Of The Worlds.txt
3,288,738 Tolstoy War And Peace.txt
4,397,206 King James Bible.txt
5,582,655 Complete Works Of Shakespeare.txt
5,616,676 Samuel Richardson Clarissa.txt
1,025,000,002 1 billion places of pi.txt
System:
Shuttle SN78SH7, AMD Phenom II 945 (quad core), 4 GB Crucial 800 MHz DDR2 RAM, Windows 7 64 bit, files read from Seagate 320GB external USB 2.0 drive.
-
Most probably due to the readall(). BCP's code reads the whole file into memory. If your 1 billion pi (or is it 1 million? ) text file size is very big, then that explains the sluggishly of fitting all into memory.
-
arun*America*MSC~INS*dfffs*Sdfsd*sdfsd~ssfsd*sdfsd~INS*dfffs*sdfsdf*sdfs~
I need to get a count of INS* in the above file. Am new to DOS Commands.
Arunavlp,
The posts at the end of your thread and in the middle are so far off topic that
the posters must start a new thread for their strange ideas.
Im pleased that your problem of counting how often a string appears in file has been answered several times.
The Sed solution is the best solution.
-
The Sed solution is the best solution.
Well, that's certainly the definitive answer.
Actually I still like my response back in post 6. By using the replace function to insert nulls in place of the all the occurrences of the search argument, the original string is effectively shortened (nulls have zero length). Using some 3rd grade arithmetic, you can calculate the difference in lengths between the original string and the replacement string. This gives the number of nulls that were added to the file. Dividing by the length of the search argument. the result is the number of occurrences of the substring in the original string.
Powershell can do this as a one liner more readable than SED. There is always more than one solution to any coding problem. Makes me wonder why many posters request a specific type solution.
8)
-
If your 1 billion pi (or is it 1 million? ) text file size is very big
One thousand and twenty-five thousand million and two bytes (1,025,000,002) as I posted above.
then that explains the sluggishly of fitting all into memory.
Did I imply that I did not already realise this?
-
There is always more than one solution to any coding problem. Makes me wonder why many posters request a specific type solution.
Unlike hobbyists at home using their own systems, many people asking for help have already partially completed a script and /or are using employer's computers on which restrictions are in place preventing installation of 3rd party software.
-
I just sort of threw mine together, wasn't too interested in making sure it worked for gigantic files :P
Here's a version that reads in chunks instead:
Function GetCountStr(ByVal searchIn, ByVal SearchFor,Byval CompareText)
CompareText=CBool(CompareText)
GetCountStr = (Len(searchIn) - Len(Replace(searchIn, SearchFor, "",1,-1,abs(CompareText)))) / Len(SearchFor)
End Function
dim inputstrm
Dim lookin,lookfor
'see if /i was specified....
for each looparg in WScript.Arguments
If UCase(looparg)="-I" or UCase(looparg)="/I" Then
ignorecase=true
Exit For
End If
Next
set FSO=CreateObject("Scripting.FileSystemObject")
set FileOpen = FSO.GetFile(WScript.Arguments(0))
'read in chunks of 32K:
chunksize = 32*1024
numchunks = FileOpen.Size \ (chunksize)
remainder = fileopen.Size mod (chunksize)
Set inputstrm = FSO.OpenTextFile(WScript.Arguments(0))
lookfor = WScript.Arguments(1)
Strhangoff=""
Do Until(inputstrm.AtEndOfStream)
readchunk = strhangoff + inputstrm.Read(chunksize)
RunnerCount=RunnerCount + GetCountStr(readchunk,lookfor,ignorecase)
strhangoff = right(readchunk,len(lookfor)-2) '-1 on the length so we don't grab the entire thing
'if it happens to be exactly on the end of the string, so nothing is counted twice.
Loop
WScript.Echo RunnerCount
I don't actually have a super extra large file to test it on, so I made one by duplicating the "zwicky.txt" file over itself several hundred times.
This one is certainly faster then the ReadAll() method idea. I've added a small provision so that it doesn't "miss" entries by reading half of the string at the end of a chunk and the rest on the next chunk (thereby not finding it) by copying a "hangoff" at the end of the previous chunk to the start of the next chunk. I make sure the chunk is shorter then the search string itself by one character, this prevents finding of the string twice in the edge case where it is found at the <very> end of a chunk (which otherwise would be counted twice- once in the first chunk and once in the next chunk in which it would be copied to).
The main difficulty was getting used to the <Microsoft> FileSystemObjects- I'm used to using my own library. Not sure if there would be much of a speed difference, there, but it's what I'm used to (not counting the .NET IO namespace). the interesting thing is that the only differences are method names (I chose "Eof" rather then "AtEndOfStream" to indicate the stream was at the end of the file) and of course ProgIDs, everything else could be pretty much exactly the same.
Actually I still like my response back in post 6. By using the replace function to insert nulls in place of the all the occurrences of the search argument, the original string is effectively shortened (nulls have zero length). Using some 3rd grade arithmetic, you can calculate the difference in lengths between the original string and the replacement string. This gives the number of nulls that were added to the file. Dividing by the length of the search argument. the result is the number of occurrences of the substring in the original string.
Actually, your idea is pretty much a slight variant of what my routine does:
Function GetCountStr(ByVal searchIn, ByVal SearchFor,Byval CompareText)
CompareText=CBool(CompareText)
GetCountStr = (Len(searchIn) - Len(Replace(searchIn, SearchFor, "",1,-1,abs(CompareText)))) / Len(SearchFor)
End Function
It replaces the text being searched for with an empty string, and then does the math. It's actually easier to do it this way rather then replacing it with null, since the size difference between the original and the "replaced" version will be off by an exact multiple of the length of the string to search for. I wrote this a good few years ago, and had to "translate" from the VB6 it was written in to VBS.
Powershell can do this as a one liner more readable than SED. There is always more than one solution to any coding problem. Makes me wonder why many posters request a specific type solution.
It could be a one-liner in VBScript, but it would be both hard to read and somewhat silly. And it would require ReadAll() again.
Using .NET 4.0/ C# 4.0 it might even be possible to read in a number of chunks at once and then "count" the occurrences of each one in parallel using the Parallel For construct. The same would be possible in 3.5 but would require the manual spinning of said threads and less then enviable use of locks to prevent resource contention. It's pretty interesting that such a simple problem can have such varied solutions, but not at all surprising.
-
He he...
I hope this is form for dos.
Not for VBS or VB or C
Don't mind it.
But, I started liking batch programming.
I really appreciate your knowledge of expertise.
As I think windows functionality can be operated from dos. Cause window itself is dos operated operating system.
So, I think you must count on batch. Rather than other languages.
If I said anything dis-hearting the integrity of any programmer. I really apologize for that.
I didn't mean that way.
But, can you point which is the best IDE for the season.
Like, C is the best language.
comment appreciated.
I know this is going off topic.
Thanks and regards.
Vishu
-
One thousand and twenty-five thousand million and two bytes (1,025,000,002) as I posted above.
so its 1 million (but your filename passed to your vbscript states 1 billion. )
Did I imply that I did not already realise this?
appears to me. You showed a benchmark between BCP and your code, then says BCP's one is sluggish after a while without stating your reasons and conclusion of your findings. Makes one wonder why it happens right?
-
so its 1 million (but your filename passed to your vbscript states 1 billion. )appears to me. You showed a benchmark between BCP and your code, then says BCP's one is sluggish after a while without stating your reasons and conclusion of your findings. Makes one wonder why it happens right?
Really don't know what you talking about.
-
The Sed solution is the best solution.
not really! If its a big file, using your method of substituting the word to include newlines, (which is expensive compared to pure string counting) , and then piping to 2 calls of find command to find the count is not the best way to go. The best way is to count the number of words found AS YOU ITERATE THE FILE (with whatever tool that is processing it) and put the count in memory. That said, sed is not the best tool to use in this case.
-
Really don't know what you talking about.
sorry i don't care if you know or not. My words are not for you.
-
One thousand and twenty-five thousand million and two bytes (1,025,000,002)
so its 1 million (but your filename passed to your vbscript states 1 billion. )
a Billion is a thousand millions... (In North America, at least)
-
so its 1 million (but your filename passed to your vbscript states 1 billion. )
a Billion is a thousand millions... (In North America, at least)
ok ok. But i am talking about post #83. where ST said he download "1 million places of pi", then his file name for testing the benchmark is "1 billion places of pi". He is showing a benchmark, and when there are ambiguities, its only natural for the inquisitive mind to ask questions.
-
ok ok. But i am talking about post #83. where ST said he download "1 million places of pi", then his file name for testing the benchmark is "1 billion places of pi". He is showing a benchmark, and when there are ambiguities, someone like me will question.
Doesn't much matter if it's a billion or a million, as long as the same inputs were used to test both- the exact size is more a curiousity (except in some cases).
-
a billion and a million is different.
-
Two \\ should be one
C:\\test>type cntstr.bat
rem @echo off
sed s/%1/%1\\n/g %2 | egrep -c %1
C:\\test>cntstr.bat the yz.txt
C:\\test>rem @echo off
C:\\test>sed s/the/the\\n/g yz.txt | egrep -c the
10
C:\\test>type yz.txt
the
the
the
the
the the the
the the the
this example will also count words like thesis, stethescope, etc, which is not exactly the word "the". egrep is also deprecated. Use grep -E
grep -Eo "\bthe\b" file|wc -l
the above does not need to do substitution on the entire file and gets the exact string.
-
a billion and a million is different.
Not in this case. What difference would it have on the results? sure, the numbers will be larger for a billion then for a million, but it's not the actual number that's important, it's how the two numbers compare.
ST performed two tests: one with a smaller file, and one with a larger file. the two tests revealed that with a larger amount of data to read, my method causes a large IO bottleneck. Two points of reference is enough for a crude line-chart comparison of the two, and while it may not be entirely accurate, it can reveal specific trends in the two functions. For example, we can determine that my routine seems to run at something like O((n/4)^2), whereas his is a more linear method whose time taken is linearly related to the length of the file. In mine, this is not the case because additional overhead is required for the system to properly manage the larger amount of memory being used to store the entire string.
What is important here is that we are comparing the programs used, As long as the inputs are the same the comparisons are valid.
if you test program A and Program B with Input C, it's a fair comparison between A and B as long as C is the same for both.
It doesn't matter if there was a mixup over the specifics of the size of C. The comparison was between A and B.
If you compare a Quick Sort with a Merge Sort, wether you are testing with a million or a billion elements is largely redundant; what's important is the comparison. If there was confusion over the layout of the data (such as how a quicksort takes longer then a merge sort with a nearly sorted array) and it was relevant, then yes, I would agree. but while there is indeed some ambiguity, it's irrelevant.
-
Not in this case. What difference would it have on the results? sure, the numbers will be larger for a billion then for a million, but it's not the actual number that's important, it's how the two numbers compare.
If its a larger file, then your method of slurping all into memory is not a good solution. That's the difference. why do you say its not important? If the test files are like 1 thousand vs 100 , then of course your method will work. Size of the test samples do matter when doing benchmarks as it will affect the design of the algorithm being used.
-
this example will also count words like thesis, stethescope, etc, which is not exactly the word the egrep is also deprecated. Use grep -E
grep -Eo bthe file|wc -l
the above does not need to do substitution on the entire file and gets the exact string.
Ghost,
Your grep works. I had an old 2005 version.
Your skill level has improved. Who is your Tutor?
C:test>grep -Eo the yz.txt
the
the
the
the
the
the
the
the
the
the
C:test>grep -Eo the yz.txt | wc -l
10
C:test>type yz.txt
the
the
the
the
the the the
the the the
C:test>
-
If its a larger file, then your method of slurping all into memory is not a good solution. That's the difference. why do you say its not important? If the test files are like 1 thousand vs 100 , then of course your method will work. Size of the test samples do matter when doing benchmarks as it will affect the design of the algorithm being used.
Reread my post.
-
Reread my post.
I reiterate my point. Size of a file does not matter if what you are comparing is the result of the output between to 2 pieces of code. That is, you want to make sure the output produced by the 2 pieces of code are the same. Size of file does matter in a benchmark, when you are concerned about the way the program is written and the algorithm used. That's is whether you have use the most optimized method when dealing with big files.
Because of the size of the file, you have chosen to read the files in chunks. That's a direct consequence of taking size into consideration when designing your program. That's why size does matter in a benchmark. 1 million is way different 1 billion!
-
Ghost,
Your grep works
of course. Its better than using sed, which you proclaim is the "best".
. I had an old 2005 version.
time to change. We are not living in olden times anymore
Your skill level has improved. Who is your Tutor?
i have been playing with *nix since ancient times. my tutor is greg and bill rich, now its you and vishu...
-
I reiterate my point. Size of a file does not matter if what you are comparing is the result of the output between to 2 pieces of code.
That's what I said.
you want to make sure the output produced by the 2 pieces of code are the same.
Agree.
Size of file does matter in a benchmark, when you are concerned about the way the program is written and the algorithm used. That's is whether you have use the most optimized method when dealing with big files.
Yes, it does. but only if you perform a <single> benchmark. ST did two. therefore there are two points of reference and as I noted a linear formula can be derived from those two data points that roughly approximates a short range of the values of whatever the actual relationship between them are. A third data point will be enough to create a parabola, but, that doesn't mean that the performance relationship is a parabola, it's just all you can do with 3 points. It could very well be a cubic function of the size of the input.
The thing is, here, we can <SEE> the code. we can see why, right off, a larger file would make a difference. It doesn't matter if that larger file is larger by a million or a billion bytes, it's still larger and that difference is reflected thusly in the timings, and the reason is rather obvious as partially evidenced by your quick mention of it.
Because of the size of the file, you have chosen to read the files in chunks.
That's a direct consequence of taking size into consideration when designing your program. That's why size does matter in a benchmark. 1 million is way different 1 billion!
Oh, yes, of course, because everybody knows that you can't read in chunks for both 1 million and 1 billion. I obviously specifically designed it for the exact size that ST gave, I was in no way trying to make it more generic and efficient for smaller files (which it is, even a 128K file will benefit from chunk reading because it causes less stress on the task allocator and also causes less process memory fragmentation).
If you want to get right down to it, all benchmarks are flawed because of the timing code, it changes the results by being there, but you can't get results without it. the difference is that that benchmark code surrounds all the different timed blocks and therefore that fact can be ignored in the results.
I will agree that there are certainly instances where a million and a billion are a significant difference algorithm-wise, but at the same time, is not even the slightest floating point error in an algorithm a huge difference when it comes to the algorithms for trigonometric functions? It's a matter of the goal of the code in question as to exactly what constitutes a significant difference. In this case, because essentially ST was testing a large file (that was all I considered, I wasn't making sweeping design changes based on the fact that it was in millions as opposed to billions, but rather generic changes where it won't matter wether it was a million or a billion. Will the timing be different for a billion and a million? Of course it will. And I will agree that in that sense, the results are flawed. But you assume that my changes are based on his results, when in fact they are merely based on the simple premise that it doesn't work properly for large files. I didn't pay very close attention to the specific timings of them, because all I needed to know was that it was slower with larger files. I didn't need to know how many milliseconds it took to process with X characters.
-
You showed a benchmark between BCP and your code, then says BCP's one is sluggish after a while without stating your reasons and conclusion of your findings.
I assumed that it was obvious. Sorry you missed it.
-
ST said he download "1 million places of pi", then his file name for testing the benchmark is "1 billion places of pi".
That is absolutely true, I did, but I did then give the file size immediately after.
(1,000,000,002 bytes) with no carriage returns.
A billion places of decimals? No CRs? One byte per character? One each for the 3 and the decimal point, and 1,000,000,000 for the decimal places.
-
That is absolutely true, I did, but I did then give the file size immediately after.
so now why don't you go edit your post and correct the typo? change million to billion.
-
so now why don't you go edit your post and correct the typo? change million to billion.
A:) because he can't
and B:) it doesn't really matter.
I mean, come on:
I downloaded a text file containing 1 million places of pi (1,000,000,002 bytes)
It doesn't take a rocket scientist to see that the bracketed value is in fact 1 billion and 2. Just because this confused you doesn't make it ambiguous, especially since it was later referenced as a billion. In fact it is only noted as a million in the single quoted passage. The fact that you are now throwing up a shitestorm because of a obvious typo that is in no way ambiguous (it's clearly a billion, especially, you know, given the file size is a billion)
-
A:) because he can't
why can't?
and B:) it doesn't really matter.
yes it does. Especially when you are proofing something. A typo is a typo and it should be corrected. If not, its not clear
someone might think he meant a million and all his billions are wrong. isn't that so?
-
so shall we go through GD74's posts looking for typos? The spirit of Billrich has deeply impregnated this thread.
-
so shall we go through GD74's posts looking for typos?
go ahead if you are too bored. I don't care. If there are anything i am trying to proof and there are typos , i would be glad to amend it.
The spirit of Billrich has deeply impregnated this thread
don't associate me with that guy. If you want to do that, look at yourself in the mirror and tell me why you are any different
-
go ahead if you are too bored. I don't care. If there are anything i am trying to proof and there are typos , i would be glad to amend it. don't associate me with that guy. If you want to do that, look at yourself in the mirror and tell me why you are any different
offensive; reported
-
Not quite sure what is so offensive about the comment in question, but this discussion has obviously gone far beyond the original intent of the thread. In the future, a bit more maturity and less arguing would be preferred. Topic locked.