Computer Hope

Microsoft => Microsoft DOS => Topic started by: Freerefill on May 02, 2009, 08:10:50 PM

Title: Finding multiple strings in a text file
Post by: Freerefill on May 02, 2009, 08:10:50 PM
Salutations.

I've asked this question in another forum and sadly got no help.

Basically, I'm trying to find text in a file. Sounds easy enough, but the text I'm trying to find is never the same between files, nor is it in the same location. What IS the same, however, are pieces that come before and after it. The problem is that those individual lines that are the same throughout the multiple files are repeated throughout each individual file.. so I can't find just one line, I need to find the entire set.

I'd give you an example, but let me make one up to make it easier. Suppose I had three text files with this information:

File 1:
1
orange
2
3
4
apple
apple
orange
banana
5

File 2:
1
orange
2
apple
orange
grape
3
4
5
apple
6

File 3:
1
2
apple
3
4
...
12253
12254
apple
orange
kumquat
12255
12256

The challenge is to find the string (or strings, there should be more than one), but since they're always different and in a different location, we must find, not just "apple" and not just "orange" but "apple" followed immediately by "orange" (the pattern is actually about 8 lines long, but I used 2 in this example).

I hacked together some code, which I think is incredibly pitiful.. found the location of every line of text, then compared the numbers in some nested "FOR" loops.. long story short, it's not working and I don't know why. I'm sure there's a better way, and I'd very much like to learn.

Any help would be tremendously appreciated. ^.^
Title: Re: Finding multiple strings in a text file
Post by: Dias de verano on May 03, 2009, 12:48:46 AM
Quote
then compared the numbers in some nested "FOR" loops.. long story short, it's not working and I don't know why.

That's roughly what I would be doing. A common problem with loops, nested or otherwise, is failure to use delayed expansion - or maybe there is some other thing that we might be able to spot.

Title: Re: Finding multiple strings in a text file
Post by: Freerefill on May 03, 2009, 01:13:20 AM
Alright.. well I was hesitant to post this because it seems, to me, to be very embarrassing and noobish.. but, if I'm on the right track and you can spot something, here it is:

@echo off
REM Clear existing files
if exist C:\fr1.txt del C:\fr1.txt
if exist C:\fr2.txt del C:\fr2.txt
if exist C:\fr3.txt del C:\fr3.txt
if exist C:\fr4.txt del C:\fr4.txt
if exist C:\fr1a.txt del C:\fr1a.txt
if exist C:\fr2a.txt del C:\fr2a.txt
if exist C:\fr3a.txt del C:\fr3a.txt
if exist C:\fr4a.txt del C:\fr4a.txt
if exist C:\newfile.txt del C:\newfile.txt
set success=0

REM Export selected data to files
findstr /I /N /X "{ACAD_REACTORS" C:\dwg.dxf > C:\fr1.txt
findstr /I /N /X "330" C:\dwg.dxf > C:\fr2.txt
findstr /I /N /X "C" C:\dwg.dxf > C:\fr3.txt
findstr /I /N /X "102" C:\dwg.dxf > C:\fr4.txt

REM Obtain the first number from the first file. May need a loop.
for /f "eol= tokens=1 delims=:" %%i in (C:\fr1.txt) do @echo ^%%i>>C:\fr1a.txt
for /f "eol= tokens=1 delims=:" %%i in (C:\fr2.txt) do @echo ^%%i>>C:\fr2a.txt
for /f "eol= tokens=1 delims=:" %%i in (C:\fr3.txt) do @echo ^%%i>>C:\fr3a.txt
for /f "eol= tokens=1 delims=:" %%i in (C:\fr4.txt) do @echo ^%%i>>C:\fr4a.txt

REM Grab the first string in the first file
for /f "eol= tokens=1 delims=" %%a in (c:\fr1a.txt) do (
set /A you=%%a+1
for /f "eol= tokens=1 delims="d %%b in (c:\fr2a.txt) do (
set /A lost=%%b
if "%you%"=="%lost%" set you >> c:\newfile.txt
)
)

set success
pause


... right at that "IF" statement is where it seems to screw up. The output -should- be every individual instance in which the two are the same.. but instead, the output is every individual instance in which the two are the same repeated for every instance that they're the same....

If I can get the nested "FOR" loops to work, as well as that "IF statement, I can nest a few more loops and check the totaled output.
Title: Re: Finding multiple strings in a text file
Post by: Dias de verano on May 03, 2009, 01:46:00 AM
That code does not look n00bish at all! You should see some of the efforts we get on here!

At first glance, a couple of points...

1.

I presume the 'd' is a typo in this line?

Code: [Select]
for /f "eol= tokens=1 delims="d %%b in (c:\fr2a.txt) do (
2.

Cmd.exe, by default, expands ALL variables at run time, and percent-sign variables created inside a parenthetical expression (e.g. extended IF structure or FOR loop) will be blank. This can trip up the unwary. Windows 2000 and later cmd language included "delayed expansion". You can use it within a batch file by doing 2 things:

(a) including this line somewhere before the code in question. At the beginning straight after @echo off is a good place.

Code: [Select]
setlocal enabledelayedexpansion
(b) To be expanded correctly the variables in question have ! exclamation points and not % percent signs. Thus:

Code: [Select]
if "!you!"=="!lost!"
... Google for "delayed expansion" for a fuller description.

3.

Did you mean 'echo' (not 'set') here? And did you mean literal "you" or the contents of the variable !you! ... ?

Code: [Select]
if "%you%"=="%lost%" set you >> c:\newfile.txt
Title: Re: Finding multiple strings in a text file
Post by: gh0std0g74 on May 03, 2009, 01:59:30 AM
what should or final output be?
Title: Re: Finding multiple strings in a text file
Post by: Freerefill on May 03, 2009, 02:17:11 AM
Nice! It's starting to work! Thank you very much, Dias!

So the 4-deep nested "FOR" loop proved to take a wee bit too long to process (I had enough time to wait, rethink, and edit the code into something that would work faster before it finished) so I re-worked it to three 2-deep nested loops:

REM Grab the first string in the first file
for /f "eol= tokens=1 delims=" %%a in (c:\fr1a.txt) do (
set /A var1=%%a+3
for /f "eol= tokens=1 delims=" %%b in (c:\fr2a.txt) do (
set /A var2=%%b+2
if "!var1!"=="!var2!" echo !var1!>> c:\for1.txt
)
)

for /f "eol= tokens=1 delims=" %%a in (c:\fr3a.txt) do (
set /A var1=%%a+1
for /f "eol= tokens=1 delims=" %%b in (c:\for1.txt) do (
set /A var2=%%b
if "!var1!"=="!var2!" echo !var1!>> c:\for2.txt
)
)

for /f "eol= tokens=1 delims=" %%a in (c:\fr4a.txt) do (
set /A var1=%%a
for /f "eol= tokens=1 delims=" %%b in (c:\for2.txt) do (
set /A var2=%%b
if "!var1!"=="!var2!" echo !var1!>> c:\for3.txt
)
)

Yes, the "d" was a typo.. still scratching my head over that one.

Enabling delayed expansion did the trick, the output is becoming what I want it to be now. Unfortunately, the repetition of the pattern throughout the files has provided me with 11 places in which the pattern repeats, and I'm searching for 1...

Which brings me to my next problem. Right now, I have 10 "FOR" loops (three of them nested), 11 files being created and the file I'm searching through is a slim 45KB instead of the 10,000KB that it could get up to. And, I'm only searching for a pattern which is 4 lines of text. If I need to expand that into 8 or more, that's going to start consuming time. Is there any way to manipulate the flow of data so that it works more efficiently?

As for output, what I'm looking for is a single instance of a known pattern within the file, which comes immediately before the data I'm searching for. In order to find its location, I grabbed the line numbers from each text I was searching for, and output them, then compared them. My comparison of the 4 files (one for each string in the pattern) has yielded 11 locations where the 4-string pattern is found. I need to find a pattern which yields 1 single location. Once I can find that, I'm sure I can figure out the rest on my own. Don't want to make it too easy on myself. :P
Title: Re: Finding multiple strings in a text file
Post by: gh0std0g74 on May 03, 2009, 02:25:14 AM
to a newbie, its good effort. to a non newbie, the code is so inefficient. 10 nested for loops??? there is definitely more better ways to do what you want, just that i can't understand what you are writing. that's why i ask you to show your final output!!  best if you can show samples of you input files also, because they are apples and oranges, and nowhere did i see {ACAD_REACTORS or 330.....
Title: Re: Finding multiple strings in a text file
Post by: Dias de verano on May 03, 2009, 02:28:34 AM
Query...

Source file...

Code: [Select]
Some lines
Of text
That are irrelevant (?)
line containing (or consisting of?) {ACAD_REACTORS
line containing (or consisting of?) 330           
line containing (or consisting of?) C             
line containing (or consisting of?) 102           
The Data
You are
Looking for
(How many lines?)

Title: Re: Finding multiple strings in a text file
Post by: Dias de verano on May 03, 2009, 02:29:41 AM
to a newbie, its good effort. to a non newbie, the code is so inefficient. 10 nested for loops??? there is definitely more better ways to do what you want, just that i can't understand what you are writing. that's why i ask you to show your final output!!

The multiple temp files are inefficient too
Title: Re: Finding multiple strings in a text file
Post by: gh0std0g74 on May 03, 2009, 02:32:57 AM
The multiple temp files are inefficient too

true
Title: Re: Finding multiple strings in a text file
Post by: Freerefill on May 03, 2009, 02:33:55 AM
I've attached the file I'm searching through, and here's my code as it stands:

@echo off
REM Clear existing files
if exist C:\fr1.txt del C:\fr1.txt
if exist C:\fr2.txt del C:\fr2.txt
if exist C:\fr3.txt del C:\fr3.txt
if exist C:\fr4.txt del C:\fr4.txt
if exist C:\fr1a.txt del C:\fr1a.txt
if exist C:\fr2a.txt del C:\fr2a.txt
if exist C:\fr3a.txt del C:\fr3a.txt
if exist C:\fr4a.txt del C:\fr4a.txt
if exist C:\for1.txt del C:\for1.txt
if exist C:\for2.txt del C:\for2.txt
if exist C:\for3.txt del C:\for3.txt
setlocal enabledelayedexpansion
set success=0

REM Export selected data to files
findstr /I /N /X "{ACAD_REACTORS" C:\dwg.txt > C:\fr1.txt
findstr /I /N /X "330" C:\dwg.txt > C:\fr2.txt
findstr /I /N /X "C" C:\dwg.txt > C:\fr3.txt
findstr /I /N /X "102" C:\dwg.txt > C:\fr4.txt

REM Obtain the first number from the first file. May need a loop.
for /f "eol= tokens=1 delims=:" %%i in (C:\fr1.txt) do @echo ^%%i>>C:\fr1a.txt
for /f "eol= tokens=1 delims=:" %%i in (C:\fr2.txt) do @echo ^%%i>>C:\fr2a.txt
for /f "eol= tokens=1 delims=:" %%i in (C:\fr3.txt) do @echo ^%%i>>C:\fr3a.txt
for /f "eol= tokens=1 delims=:" %%i in (C:\fr4.txt) do @echo ^%%i>>C:\fr4a.txt

REM Grab the first string in the first file
for /f "eol= tokens=1 delims=" %%a in (c:\fr1a.txt) do (
set /A var1=%%a+3
for /f "eol= tokens=1 delims=" %%b in (c:\fr2a.txt) do (
set /A var2=%%b+2
if "!var1!"=="!var2!" echo !var1!>> c:\for1.txt
)
)

for /f "eol= tokens=1 delims=" %%a in (c:\fr3a.txt) do (
set /A var3=%%a+1
for /f "eol= tokens=1 delims=" %%b in (c:\for1.txt) do (
set /A var4=%%b
if "!var3!"=="!var4!" echo !var3!>> c:\for2.txt
)
)

for /f "eol= tokens=1 delims=" %%a in (c:\fr4a.txt) do (
set /A var5=%%a
for /f "eol= tokens=1 delims=" %%b in (c:\for2.txt) do (
set /A var6=%%b
if "!var5!"=="!var6!" echo !var5!>> c:\for3.txt
)
)

pause

... set up to run as a .bat file. Running it yields the following, in the file C:\for3.txt:

1905
1921
1937
1965
1993
2017
2037
2053
2077
2225
2245
2325
2345

Which is the line number for every instance of the string "102" which immediately follows the strings "{ACAD_REACTORS", "330" and "C"

[attachment deleted by admin]
Title: Re: Finding multiple strings in a text file
Post by: Dias de verano on May 03, 2009, 02:40:43 AM
So what is it that you want to happen? You want to know the location of the first occurence of the line following your 4 line sequence? Or the location of all of them?

is it the case that the 4 line sequence will only occur once in a dwg file?


Title: Re: Finding multiple strings in a text file
Post by: Freerefill on May 03, 2009, 02:47:24 AM
In a nutshell, I'm trying to find all of them. Which is what the code is currently doing.

What I want to do is expand the pattern to more strings, from 4 to possibly upwards of 10 or more, because that will narrow down the search, and if I'm correct, yield one single instance of a pattern, yet to be determined, from which I can gather the data I really need, which I'm sure I can do on my own.

If you look through the attached file, search for the string "Layout1". It will appear twice within the file. The string represents a facet in a drawing program, a layout tab to be precise (those of you who have worked with AutoCAD will understand this better). You'll also find, near it, a "Layout2" and a "Model". Those bits of data are what I ultimately want to find. If you look just before the first set in which these three bits of data appear, you'll see the 4 lines that my code currently searches for.

As I said, "Layout1" will not always be "Layout1". It could be "Sheet1" or "Tab1" or "Pecan Butter". It can be anything. Hence the need to find something that IS common.

If I'm still confusing you, let me know... but I don't know how else I can describe it...
Title: Re: Finding multiple strings in a text file
Post by: gh0std0g74 on May 03, 2009, 02:58:44 AM
here's a vbscript
Code: [Select]
Set objFS = CreateObject("Scripting.FileSystemObject")
strFile = "c:\test\file.txt"
Set objFile = objFS.OpenTextFile(strFile)
Dim first
Dim sec
Dim third
first=""
sec =""
third=""
i=0 'line count
Do Until objFile.AtEndOfStream
i=i+1
strLine = objFile.ReadLine
If strLine = "102" Then
If third = "{ACAD_REACTORS" And sec="330" And first="C" Then
WScript.Echo "line number: " , i
WScript.Echo "third: " ,third
WScript.Echo "sec: " ,sec
WScript.Echo "first: " ,first
End If
End If
third = sec
sec = first
first=strLine
Loop


output:
Code: [Select]
C:\test>cscript /nologo test.vbs |more
line number:  1905
third:  {ACAD_REACTORS
sec:  330
first:  C
line number:  1921
third:  {ACAD_REACTORS
sec:  330
first:  C
line number:  1937
third:  {ACAD_REACTORS
sec:  330
first:  C
line number:  1965
third:  {ACAD_REACTORS
sec:  330
first:  C
line number:  1993
third:  {ACAD_REACTORS
sec:  330
first:  C
line number:  2017
third:  {ACAD_REACTORS
sec:  330
first:  C
line number:  2037
third:  {ACAD_REACTORS
sec:  330
first:  C
line number:  2053
third:  {ACAD_REACTORS
sec:  330
first:  C
line number:  2077
third:  {ACAD_REACTORS
sec:  330
first:  C
line number:  2225
third:  {ACAD_REACTORS
sec:  330
first:  C
line number:  2245
third:  {ACAD_REACTORS
sec:  330
first:  C
line number:  2325
third:  {ACAD_REACTORS
sec:  330
first:  C
line number:  2345
third:  {ACAD_REACTORS
sec:  330
first:  C
Title: Re: Finding multiple strings in a text file
Post by: Freerefill on May 03, 2009, 11:08:42 AM
Thank you.. but I was really looking for a .bat file..

I'm well aware of the inefficiencies, but since I'm still new to this, I have no clue how to improve. I did the best I could with what I had, and since I still don't even know exactly how a "token" works, I had to produce some sort of visual output to make sure I was on the right track. Hence the multiple output files.

All that said.. if there are so many problems with my code.. how do I fix them? And no, I'm not looking for a quick fix, I want to know the methods and implement them myself.. I do want to learn.
Title: Re: Finding multiple strings in a text file
Post by: Dias de verano on May 04, 2009, 01:11:55 AM
Batch example:

gh0std0g74's scheme for stepping back the strings to store the previous 3 lines read from source file acknowledged.

FOR line written thus to avoid (documented, apparently) problem of for /f skipping blank lines

[Details in alt.msdos.batch.nt thread headed "Bug in For /F" (long Google Groups link shortened)]: http://tinyurl.com/cwr484

I wondered why the line numbers in my output didn't match those from gh0std0g74's VBscript, but then I noticed that my batch was skipping blank lines in the input file. They match now.

Code: [Select]

@echo off
setlocal enabledelayedexpansion
set string3=
set string2=
set string1=
set i=0
set f=0

set patt3={ACAD_REACTORS
set patt2=330
set patt1=C
set patt0=102

set inputfile=dwg.txt

for /f "tokens=1* delims=]" %%a in ('find /n /v "" ^<%inputfile%') do (
set /a i+=1
set string0=%%b
if "!string0!"=="%patt0%" (
set seq=1
if "!string3!"=="%patt3%" set /a seq+=1
if "!string2!"=="%patt2%" set /a seq+=1
if "!string1!"=="%patt1%" set /a seq+=1
)
if !seq! equ 4 (
set /a f+=1

echo !string3!
echo !string2!
echo !string1!
echo !string0! [Line !i!] [!f!]
echo --------------------------

set seq=0
)
set string3=!string2!
set string2=!string1!
set string1=!string0!
)