Welcome guest. Before posting on our computer help forum, you must register. Click here it's easy and free.

Author Topic: Finding multiple strings in a text file  (Read 9509 times)

0 Members and 1 Guest are viewing this topic.

Freerefill

    Topic Starter


    Rookie

    Thanked: 2
    Finding multiple strings in a text file
    « on: May 02, 2009, 08:10:50 PM »
    Salutations.

    I've asked this question in another forum and sadly got no help.

    Basically, I'm trying to find text in a file. Sounds easy enough, but the text I'm trying to find is never the same between files, nor is it in the same location. What IS the same, however, are pieces that come before and after it. The problem is that those individual lines that are the same throughout the multiple files are repeated throughout each individual file.. so I can't find just one line, I need to find the entire set.

    I'd give you an example, but let me make one up to make it easier. Suppose I had three text files with this information:

    File 1:
    1
    orange
    2
    3
    4
    apple
    apple
    orange
    banana
    5

    File 2:
    1
    orange
    2
    apple
    orange
    grape
    3
    4
    5
    apple
    6

    File 3:
    1
    2
    apple
    3
    4
    ...
    12253
    12254
    apple
    orange
    kumquat
    12255
    12256

    The challenge is to find the string (or strings, there should be more than one), but since they're always different and in a different location, we must find, not just "apple" and not just "orange" but "apple" followed immediately by "orange" (the pattern is actually about 8 lines long, but I used 2 in this example).

    I hacked together some code, which I think is incredibly pitiful.. found the location of every line of text, then compared the numbers in some nested "FOR" loops.. long story short, it's not working and I don't know why. I'm sure there's a better way, and I'd very much like to learn.

    Any help would be tremendously appreciated. ^.^
    There are two things in this world you should never worry about: that which you can change, and that which you cannot change.

    Dias de verano

    • Guest
    Re: Finding multiple strings in a text file
    « Reply #1 on: May 03, 2009, 12:48:46 AM »
    Quote
    then compared the numbers in some nested "FOR" loops.. long story short, it's not working and I don't know why.

    That's roughly what I would be doing. A common problem with loops, nested or otherwise, is failure to use delayed expansion - or maybe there is some other thing that we might be able to spot.


    Freerefill

      Topic Starter


      Rookie

      Thanked: 2
      Re: Finding multiple strings in a text file
      « Reply #2 on: May 03, 2009, 01:13:20 AM »
      Alright.. well I was hesitant to post this because it seems, to me, to be very embarrassing and noobish.. but, if I'm on the right track and you can spot something, here it is:

      @echo off
      REM Clear existing files
      if exist C:\fr1.txt del C:\fr1.txt
      if exist C:\fr2.txt del C:\fr2.txt
      if exist C:\fr3.txt del C:\fr3.txt
      if exist C:\fr4.txt del C:\fr4.txt
      if exist C:\fr1a.txt del C:\fr1a.txt
      if exist C:\fr2a.txt del C:\fr2a.txt
      if exist C:\fr3a.txt del C:\fr3a.txt
      if exist C:\fr4a.txt del C:\fr4a.txt
      if exist C:\newfile.txt del C:\newfile.txt
      set success=0

      REM Export selected data to files
      findstr /I /N /X "{ACAD_REACTORS" C:\dwg.dxf > C:\fr1.txt
      findstr /I /N /X "330" C:\dwg.dxf > C:\fr2.txt
      findstr /I /N /X "C" C:\dwg.dxf > C:\fr3.txt
      findstr /I /N /X "102" C:\dwg.dxf > C:\fr4.txt

      REM Obtain the first number from the first file. May need a loop.
      for /f "eol= tokens=1 delims=:" %%i in (C:\fr1.txt) do @echo ^%%i>>C:\fr1a.txt
      for /f "eol= tokens=1 delims=:" %%i in (C:\fr2.txt) do @echo ^%%i>>C:\fr2a.txt
      for /f "eol= tokens=1 delims=:" %%i in (C:\fr3.txt) do @echo ^%%i>>C:\fr3a.txt
      for /f "eol= tokens=1 delims=:" %%i in (C:\fr4.txt) do @echo ^%%i>>C:\fr4a.txt

      REM Grab the first string in the first file
      for /f "eol= tokens=1 delims=" %%a in (c:\fr1a.txt) do (
      set /A you=%%a+1
      for /f "eol= tokens=1 delims="d %%b in (c:\fr2a.txt) do (
      set /A lost=%%b
      if "%you%"=="%lost%" set you >> c:\newfile.txt
      )
      )

      set success
      pause


      ... right at that "IF" statement is where it seems to screw up. The output -should- be every individual instance in which the two are the same.. but instead, the output is every individual instance in which the two are the same repeated for every instance that they're the same....

      If I can get the nested "FOR" loops to work, as well as that "IF statement, I can nest a few more loops and check the totaled output.
      There are two things in this world you should never worry about: that which you can change, and that which you cannot change.

      Dias de verano

      • Guest
      Re: Finding multiple strings in a text file
      « Reply #3 on: May 03, 2009, 01:46:00 AM »
      That code does not look n00bish at all! You should see some of the efforts we get on here!

      At first glance, a couple of points...

      1.

      I presume the 'd' is a typo in this line?

      Code: [Select]
      for /f "eol= tokens=1 delims="d %%b in (c:\fr2a.txt) do (
      2.

      Cmd.exe, by default, expands ALL variables at run time, and percent-sign variables created inside a parenthetical expression (e.g. extended IF structure or FOR loop) will be blank. This can trip up the unwary. Windows 2000 and later cmd language included "delayed expansion". You can use it within a batch file by doing 2 things:

      (a) including this line somewhere before the code in question. At the beginning straight after @echo off is a good place.

      Code: [Select]
      setlocal enabledelayedexpansion
      (b) To be expanded correctly the variables in question have ! exclamation points and not % percent signs. Thus:

      Code: [Select]
      if "!you!"=="!lost!"
      ... Google for "delayed expansion" for a fuller description.

      3.

      Did you mean 'echo' (not 'set') here? And did you mean literal "you" or the contents of the variable !you! ... ?

      Code: [Select]
      if "%you%"=="%lost%" set you >> c:\newfile.txt

      gh0std0g74



        Apprentice

        Thanked: 37
        Re: Finding multiple strings in a text file
        « Reply #4 on: May 03, 2009, 01:59:30 AM »
        what should or final output be?

        Freerefill

          Topic Starter


          Rookie

          Thanked: 2
          Re: Finding multiple strings in a text file
          « Reply #5 on: May 03, 2009, 02:17:11 AM »
          Nice! It's starting to work! Thank you very much, Dias!

          So the 4-deep nested "FOR" loop proved to take a wee bit too long to process (I had enough time to wait, rethink, and edit the code into something that would work faster before it finished) so I re-worked it to three 2-deep nested loops:

          REM Grab the first string in the first file
          for /f "eol= tokens=1 delims=" %%a in (c:\fr1a.txt) do (
          set /A var1=%%a+3
          for /f "eol= tokens=1 delims=" %%b in (c:\fr2a.txt) do (
          set /A var2=%%b+2
          if "!var1!"=="!var2!" echo !var1!>> c:\for1.txt
          )
          )

          for /f "eol= tokens=1 delims=" %%a in (c:\fr3a.txt) do (
          set /A var1=%%a+1
          for /f "eol= tokens=1 delims=" %%b in (c:\for1.txt) do (
          set /A var2=%%b
          if "!var1!"=="!var2!" echo !var1!>> c:\for2.txt
          )
          )

          for /f "eol= tokens=1 delims=" %%a in (c:\fr4a.txt) do (
          set /A var1=%%a
          for /f "eol= tokens=1 delims=" %%b in (c:\for2.txt) do (
          set /A var2=%%b
          if "!var1!"=="!var2!" echo !var1!>> c:\for3.txt
          )
          )

          Yes, the "d" was a typo.. still scratching my head over that one.

          Enabling delayed expansion did the trick, the output is becoming what I want it to be now. Unfortunately, the repetition of the pattern throughout the files has provided me with 11 places in which the pattern repeats, and I'm searching for 1...

          Which brings me to my next problem. Right now, I have 10 "FOR" loops (three of them nested), 11 files being created and the file I'm searching through is a slim 45KB instead of the 10,000KB that it could get up to. And, I'm only searching for a pattern which is 4 lines of text. If I need to expand that into 8 or more, that's going to start consuming time. Is there any way to manipulate the flow of data so that it works more efficiently?

          As for output, what I'm looking for is a single instance of a known pattern within the file, which comes immediately before the data I'm searching for. In order to find its location, I grabbed the line numbers from each text I was searching for, and output them, then compared them. My comparison of the 4 files (one for each string in the pattern) has yielded 11 locations where the 4-string pattern is found. I need to find a pattern which yields 1 single location. Once I can find that, I'm sure I can figure out the rest on my own. Don't want to make it too easy on myself. :P
          There are two things in this world you should never worry about: that which you can change, and that which you cannot change.

          gh0std0g74



            Apprentice

            Thanked: 37
            Re: Finding multiple strings in a text file
            « Reply #6 on: May 03, 2009, 02:25:14 AM »
            to a newbie, its good effort. to a non newbie, the code is so inefficient. 10 nested for loops??? there is definitely more better ways to do what you want, just that i can't understand what you are writing. that's why i ask you to show your final output!!  best if you can show samples of you input files also, because they are apples and oranges, and nowhere did i see {ACAD_REACTORS or 330.....

            Dias de verano

            • Guest
            Re: Finding multiple strings in a text file
            « Reply #7 on: May 03, 2009, 02:28:34 AM »
            Query...

            Source file...

            Code: [Select]
            Some lines
            Of text
            That are irrelevant (?)
            line containing (or consisting of?) {ACAD_REACTORS
            line containing (or consisting of?) 330           
            line containing (or consisting of?) C             
            line containing (or consisting of?) 102           
            The Data
            You are
            Looking for
            (How many lines?)


            Dias de verano

            • Guest
            Re: Finding multiple strings in a text file
            « Reply #8 on: May 03, 2009, 02:29:41 AM »
            to a newbie, its good effort. to a non newbie, the code is so inefficient. 10 nested for loops??? there is definitely more better ways to do what you want, just that i can't understand what you are writing. that's why i ask you to show your final output!!

            The multiple temp files are inefficient too

            gh0std0g74



              Apprentice

              Thanked: 37
              Re: Finding multiple strings in a text file
              « Reply #9 on: May 03, 2009, 02:32:57 AM »
              The multiple temp files are inefficient too

              true

              Freerefill

                Topic Starter


                Rookie

                Thanked: 2
                Re: Finding multiple strings in a text file
                « Reply #10 on: May 03, 2009, 02:33:55 AM »
                I've attached the file I'm searching through, and here's my code as it stands:

                @echo off
                REM Clear existing files
                if exist C:\fr1.txt del C:\fr1.txt
                if exist C:\fr2.txt del C:\fr2.txt
                if exist C:\fr3.txt del C:\fr3.txt
                if exist C:\fr4.txt del C:\fr4.txt
                if exist C:\fr1a.txt del C:\fr1a.txt
                if exist C:\fr2a.txt del C:\fr2a.txt
                if exist C:\fr3a.txt del C:\fr3a.txt
                if exist C:\fr4a.txt del C:\fr4a.txt
                if exist C:\for1.txt del C:\for1.txt
                if exist C:\for2.txt del C:\for2.txt
                if exist C:\for3.txt del C:\for3.txt
                setlocal enabledelayedexpansion
                set success=0

                REM Export selected data to files
                findstr /I /N /X "{ACAD_REACTORS" C:\dwg.txt > C:\fr1.txt
                findstr /I /N /X "330" C:\dwg.txt > C:\fr2.txt
                findstr /I /N /X "C" C:\dwg.txt > C:\fr3.txt
                findstr /I /N /X "102" C:\dwg.txt > C:\fr4.txt

                REM Obtain the first number from the first file. May need a loop.
                for /f "eol= tokens=1 delims=:" %%i in (C:\fr1.txt) do @echo ^%%i>>C:\fr1a.txt
                for /f "eol= tokens=1 delims=:" %%i in (C:\fr2.txt) do @echo ^%%i>>C:\fr2a.txt
                for /f "eol= tokens=1 delims=:" %%i in (C:\fr3.txt) do @echo ^%%i>>C:\fr3a.txt
                for /f "eol= tokens=1 delims=:" %%i in (C:\fr4.txt) do @echo ^%%i>>C:\fr4a.txt

                REM Grab the first string in the first file
                for /f "eol= tokens=1 delims=" %%a in (c:\fr1a.txt) do (
                set /A var1=%%a+3
                for /f "eol= tokens=1 delims=" %%b in (c:\fr2a.txt) do (
                set /A var2=%%b+2
                if "!var1!"=="!var2!" echo !var1!>> c:\for1.txt
                )
                )

                for /f "eol= tokens=1 delims=" %%a in (c:\fr3a.txt) do (
                set /A var3=%%a+1
                for /f "eol= tokens=1 delims=" %%b in (c:\for1.txt) do (
                set /A var4=%%b
                if "!var3!"=="!var4!" echo !var3!>> c:\for2.txt
                )
                )

                for /f "eol= tokens=1 delims=" %%a in (c:\fr4a.txt) do (
                set /A var5=%%a
                for /f "eol= tokens=1 delims=" %%b in (c:\for2.txt) do (
                set /A var6=%%b
                if "!var5!"=="!var6!" echo !var5!>> c:\for3.txt
                )
                )

                pause

                ... set up to run as a .bat file. Running it yields the following, in the file C:\for3.txt:

                1905
                1921
                1937
                1965
                1993
                2017
                2037
                2053
                2077
                2225
                2245
                2325
                2345

                Which is the line number for every instance of the string "102" which immediately follows the strings "{ACAD_REACTORS", "330" and "C"

                [attachment deleted by admin]
                There are two things in this world you should never worry about: that which you can change, and that which you cannot change.

                Dias de verano

                • Guest
                Re: Finding multiple strings in a text file
                « Reply #11 on: May 03, 2009, 02:40:43 AM »
                So what is it that you want to happen? You want to know the location of the first occurence of the line following your 4 line sequence? Or the location of all of them?

                is it the case that the 4 line sequence will only occur once in a dwg file?



                Freerefill

                  Topic Starter


                  Rookie

                  Thanked: 2
                  Re: Finding multiple strings in a text file
                  « Reply #12 on: May 03, 2009, 02:47:24 AM »
                  In a nutshell, I'm trying to find all of them. Which is what the code is currently doing.

                  What I want to do is expand the pattern to more strings, from 4 to possibly upwards of 10 or more, because that will narrow down the search, and if I'm correct, yield one single instance of a pattern, yet to be determined, from which I can gather the data I really need, which I'm sure I can do on my own.

                  If you look through the attached file, search for the string "Layout1". It will appear twice within the file. The string represents a facet in a drawing program, a layout tab to be precise (those of you who have worked with AutoCAD will understand this better). You'll also find, near it, a "Layout2" and a "Model". Those bits of data are what I ultimately want to find. If you look just before the first set in which these three bits of data appear, you'll see the 4 lines that my code currently searches for.

                  As I said, "Layout1" will not always be "Layout1". It could be "Sheet1" or "Tab1" or "Pecan Butter". It can be anything. Hence the need to find something that IS common.

                  If I'm still confusing you, let me know... but I don't know how else I can describe it...
                  There are two things in this world you should never worry about: that which you can change, and that which you cannot change.

                  gh0std0g74



                    Apprentice

                    Thanked: 37
                    Re: Finding multiple strings in a text file
                    « Reply #13 on: May 03, 2009, 02:58:44 AM »
                    here's a vbscript
                    Code: [Select]
                    Set objFS = CreateObject("Scripting.FileSystemObject")
                    strFile = "c:\test\file.txt"
                    Set objFile = objFS.OpenTextFile(strFile)
                    Dim first
                    Dim sec
                    Dim third
                    first=""
                    sec =""
                    third=""
                    i=0 'line count
                    Do Until objFile.AtEndOfStream
                    i=i+1
                    strLine = objFile.ReadLine
                    If strLine = "102" Then
                    If third = "{ACAD_REACTORS" And sec="330" And first="C" Then
                    WScript.Echo "line number: " , i
                    WScript.Echo "third: " ,third
                    WScript.Echo "sec: " ,sec
                    WScript.Echo "first: " ,first
                    End If
                    End If
                    third = sec
                    sec = first
                    first=strLine
                    Loop


                    output:
                    Code: [Select]
                    C:\test>cscript /nologo test.vbs |more
                    line number:  1905
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C
                    line number:  1921
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C
                    line number:  1937
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C
                    line number:  1965
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C
                    line number:  1993
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C
                    line number:  2017
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C
                    line number:  2037
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C
                    line number:  2053
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C
                    line number:  2077
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C
                    line number:  2225
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C
                    line number:  2245
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C
                    line number:  2325
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C
                    line number:  2345
                    third:  {ACAD_REACTORS
                    sec:  330
                    first:  C

                    Freerefill

                      Topic Starter


                      Rookie

                      Thanked: 2
                      Re: Finding multiple strings in a text file
                      « Reply #14 on: May 03, 2009, 11:08:42 AM »
                      Thank you.. but I was really looking for a .bat file..

                      I'm well aware of the inefficiencies, but since I'm still new to this, I have no clue how to improve. I did the best I could with what I had, and since I still don't even know exactly how a "token" works, I had to produce some sort of visual output to make sure I was on the right track. Hence the multiple output files.

                      All that said.. if there are so many problems with my code.. how do I fix them? And no, I'm not looking for a quick fix, I want to know the methods and implement them myself.. I do want to learn.
                      There are two things in this world you should never worry about: that which you can change, and that which you cannot change.

                      Dias de verano

                      • Guest
                      Re: Finding multiple strings in a text file
                      « Reply #15 on: May 04, 2009, 01:11:55 AM »
                      Batch example:

                      gh0std0g74's scheme for stepping back the strings to store the previous 3 lines read from source file acknowledged.

                      FOR line written thus to avoid (documented, apparently) problem of for /f skipping blank lines

                      [Details in alt.msdos.batch.nt thread headed "Bug in For /F" (long Google Groups link shortened)]: http://tinyurl.com/cwr484

                      I wondered why the line numbers in my output didn't match those from gh0std0g74's VBscript, but then I noticed that my batch was skipping blank lines in the input file. They match now.

                      Code: [Select]

                      @echo off
                      setlocal enabledelayedexpansion
                      set string3=
                      set string2=
                      set string1=
                      set i=0
                      set f=0

                      set patt3={ACAD_REACTORS
                      set patt2=330
                      set patt1=C
                      set patt0=102

                      set inputfile=dwg.txt

                      for /f "tokens=1* delims=]" %%a in ('find /n /v "" ^<%inputfile%') do (
                      set /a i+=1
                      set string0=%%b
                      if "!string0!"=="%patt0%" (
                      set seq=1
                      if "!string3!"=="%patt3%" set /a seq+=1
                      if "!string2!"=="%patt2%" set /a seq+=1
                      if "!string1!"=="%patt1%" set /a seq+=1
                      )
                      if !seq! equ 4 (
                      set /a f+=1

                      echo !string3!
                      echo !string2!
                      echo !string1!
                      echo !string0! [Line !i!] [!f!]
                      echo --------------------------

                      set seq=0
                      )
                      set string3=!string2!
                      set string2=!string1!
                      set string1=!string0!
                      )



                      « Last Edit: May 04, 2009, 01:35:43 AM by Dias de verano »