Announcement

Collapse
No announcement yet.

Finding duplicates lines in text files

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding duplicates lines in text files

    Hi guys

    I've got a job of merging 3 sites into one soon and am trying to get an idea of duplicates etc... so I can sort through the data.

    I've used DIR /s /AB /b > blah.txt to output a couple shares into a text file. What i'm after is a script / something that can compare the lines and report on duplicates.

    Any ideas? I've tried a couple shareware apps but they are awful.

    Thanks

  • #2
    Re: Finding duplicates lines in text files

    When you say "merging 3 sites"... what do you mean by sites? Active Directory sites? Websites?

    Please give full details of what you are trying to achieve.
    Gareth Howells

    BSc (Hons), MBCS, MCP, MCDST, ICCE

    Any advice is given in good faith and without warranty.

    Please give reputation points if somebody has helped you.

    "For by now I could have stretched out my hand and struck you and your people with a plague that would have wiped you off the Earth." (Exodus 9:15) - I could kill you with my thumb.

    "Everything that lives and moves will be food for you." (Genesis 9:3) - For every animal you don't eat, I'm going to eat three.

    Comment


    • #3
      Re: Finding duplicates lines in text files

      3 offices will be migrating into 1.

      We have plenty of shares but one in particular (on each site) which has ALL of our clients data/file (different offices deal with different clients but there will be some duplicates).

      I have 3 text files with all of the folders listed for each site and I wanted to simply print the duplicate folders so I can investigate file dates etc..

      It's bascially to make the move alot smoother so I can simply copy it all over instead of comparing / worrying about overwritting newer data.

      Comment


      • #4
        Re: Finding duplicates lines in text files

        Originally posted by ethos View Post
        I have 3 text files with all of the folders listed for each site and I wanted to simply print the duplicate folders so I can investigate file dates etc.
        If you got access to a database, you could create three tables (one for each office) and insert the file names (maybe other properties as well) into the tables. You could rather easily query the DB for files that exist in more than one of the tables. I'd pick this approach if you got at least a few thousands of files on each site.

        For the actual file management and comparison, there are tools like Beyond Compare. Quite usefull for comparing and merging files and folders.

        -vP

        Comment


        • #5
          Re: Finding duplicates lines in text files

          You could Google it:

          http://www.google.com/search?hl=en&q...te+file+finder

          Comment


          • #6
            Re: Finding duplicates lines in text files

            Merge the text files (you can use "COPY /B" to accomplish this).
            then,

            Run this batch to find duplicate lines.
            Code:
            :: "Search for duplicate lines"
            :: forums.petri.com/showthread.php?t=32793
            :: author: Remco Simons [NL] 2009
            
            @echo off
            Setlocal ENABLEDELAYEDEXPANSION
            
            ::# Search for duplicate lines in:
            Set "TXTFile=c:\test\test.txt"
            
            echo/%TXTFile%
            echo/                  (Empy lines are not being counted!) &echo/
            echo/------------------------------------------------------------------------------+
            
            title Find duplicate lines:
            Set "skipLines=,"
            For /f "usebackq delims=" %%! in ("%TXTFile%") do (
              Set/a lnCnt2=0
              Set/a iCnt3=0
              Set "fndLines="
              Set "doubleLine="
              Set/a lnCnt1=!lnCnt1!+1
              Set "readline=%%!"
            
              For /f "usebackq delims=" %%* in ("%TXTFile%") do (
                Set/a lnCnt2=!lnCnt2!+1
                If !lnCnt1! LSS !lnCnt2! (
                  Set/a "l=10000+!lnCnt2!" & Set "l=!l:~1!"
                  (echo/!skipLines! |Find /v ",!l!,")>nul &&(
                    Set "compareline=%%*"
                    If /i "!readline!" EQU "!compareline!" (
                      Set/a iCnt3=!iCnt3!+1
                      Set "skipLines=!skipLines:~0,-1!,!l!,"
                      Set "fndLines=!fndLines!, !l!"
                      Set "doubleLine=!compareline!
                    )
                  )
                )
              )
            
              If !iCnt3! GTR 0 (
                Set "fndLines=!fndLines:~2!
                ECHO/Line !lnCnt1!:
                ECHO/"!doubleLine!"
                ECHO/, the same line was found !iCnt3! more times
                ECHO/  at the line(s^): !fndLines!
                echo/------------------------------------------------------------------------------+
              )
            )
            
            echo/&echo/Done & pause>nul
            \Rems
            Last edited by Rems; 9th February 2009, 19:26.

            This posting is provided "AS IS" with no warranties, and confers no rights.

            __________________

            ** Remember to give credit where credit's due **
            and leave Reputation Points for meaningful posts

            Comment


            • #7
              Re: Finding duplicates lines in text files

              Originally posted by joeqwerty View Post
              I'm comparing lines in a text document not files and like I said i've tried a few shareware apps with no luck.

              Also alot of them will just be selecting two directories and finding dup files.

              If I could find it on google easily I wouldn't have made a thread.

              Comment


              • #8
                Re: Finding duplicates lines in text files

                Originally posted by Rems View Post
                Merge the text files (you can use "COPY /B" to accomplish this).
                then,

                Run this batch to find duplicate lines.
                Excellent, that should do the job!

                Thanks very much Rems

                Comment


                • #9
                  Re: Finding duplicates lines in text files

                  Or simply use Notepad++
                  Marcel
                  Technical Consultant
                  Netherlands
                  http://www.phetios.com
                  http://blog.nessus.nl

                  MCITP(EA, SA), MCSA/E 2003:Security, CCNA, SNAF, DCUCI, CCSA/E/E+ (R60), VCP4/5, NCDA, NCIE - SAN, NCIE - BR, EMCPE
                  "No matter how secure, there is always the human factor."

                  "Enjoy life today, tomorrow may never come."
                  "If you're going through hell, keep going. ~Winston Churchill"

                  Comment


                  • #10
                    Re: Finding duplicates lines in text files

                    Originally posted by Dumber View Post
                    Or simply use Notepad++
                    I did have a look at Notepad++ but couldn't find the option, assumed it wasn't possible.

                    Comment


                    • #11
                      Re: Finding duplicates lines in text files

                      With notepad++ you can compare files.
                      If used it more then once
                      Marcel
                      Technical Consultant
                      Netherlands
                      http://www.phetios.com
                      http://blog.nessus.nl

                      MCITP(EA, SA), MCSA/E 2003:Security, CCNA, SNAF, DCUCI, CCSA/E/E+ (R60), VCP4/5, NCDA, NCIE - SAN, NCIE - BR, EMCPE
                      "No matter how secure, there is always the human factor."

                      "Enjoy life today, tomorrow may never come."
                      "If you're going through hell, keep going. ~Winston Churchill"

                      Comment


                      • #12
                        Re: Finding duplicates lines in text files

                        Originally posted by ethos View Post
                        I did have a look at Notepad++ but couldn't find the option, assumed it wasn't possible.
                        For Notepad++ you can use the 'Compare Plugin' to show the difference between 2 files (side by side).
                        This Plugin is already included since Notepad++ v5.

                        But since you have more than two files it would be too complicated to use Notepad++. And besites that, you are not looking for the differences between the files because you want to find duplicate lines therefore you should look for similarity between the files instead. I m not sure if the Plugin is capable of doing this.

                        The batch in my previous reply is very easy to write and to use.
                        Run it three times - and if it give the same results all three times then it works for your merged files.

                        If the batch does not work (ie. when the file is large and does contain a lot of duplicate lines),
                        then use a vbScript (ohwell you can use the script anyway since it is written already),
                        Code:
                        '# "Search for duplicate lines"
                        '# forums.petri.com/showthread.php?t=32793
                        '# author: Remco Simons [NL] 2009
                        
                        Const ForReading = 1
                        Const dictKey    = 1
                        Const dictItem   = 2
                        
                        ' Search for duplicate lines in:
                        TXTFile = "c:\test\test.txt"
                        
                        Set objDictionary1 = CreateObject("Scripting.Dictionary")
                        objDictionary1.CompareMode = vbTextCompare
                        
                        Set objDictionary2 = CreateObject("Scripting.Dictionary")
                        objDictionary2.CompareMode = vbTextCompare
                        
                        Set objFSO = CreateObject("Scripting.FileSystemObject")
                        Set objFile = objFSO.OpenTextFile(TXTFile, ForReading)
                        
                        iCnt = 0
                        Do Until objFile.AtEndOfStream
                            strName = objFile.ReadLine
                            If Not strName = "" Then  
                               iCnt = iCnt + 1
                               If (objDictionary1.Exists(strName) = False) Then
                                  objDictionary1.Add strName, iCnt
                               Else
                                  foundfirst = objDictionary1.Item(strName)
                                  foundfirst = Right("0000"&CStr(foundfirst),4) _
                                   & " - - - - - - - - - - - - - - - - - - -"
                                  If (objDictionary2.Exists(foundfirst) = False) Then
                                     objDictionary2.Add foundfirst, UCase(strName)
                                  End If
                                  objDictionary2.Add Right("0000"&CStr(iCnt),4), strName
                               End If
                            End If
                        Loop
                        
                        objFile.Close
                        objDictionary1.RemoveAll
                        SortDictionary objDictionary2, dictItem, dictKey
                        
                        Dim arrOut()
                        Z = objDictionary2.Count: If Z > 0 Then
                          ' create an 2D-array to store dictionary information
                          ReDim arrOut(Z,2)
                          X = 0
                          ' populate the string array
                          For Each objKey In objDictionary2
                             arrOut(X,dictKey)  = CStr(objKey)
                             arrOut(X,dictItem) = CStr(objDictionary2(objKey))
                             X = X + 1
                          Next
                          wscript.echo Join2D(arrOut, vbNewLine)
                        Else
                          wscript.echo "No duplicate entries found"
                        End If
                        
                        objDictionary2.RemoveAll
                        wscript.quit
                        
                        
                        
                        Function SortDictionary(objDict, intSortA, intSortB)
                          ' http://support.microsoft.com/kb/246067
                          ' declare our variables
                          Dim strDict()
                          Dim objKey
                          Dim strKey,strItem
                          Dim X,Y,Z
                        
                          ' get the dictionary count
                          Z = objDict.Count
                        
                          ' we need more than one item to warrant sorting
                          If Z > 1 Then
                            ' create an array to store dictionary information
                            ReDim strDict(Z,2)
                            X = 0
                            ' populate the string array
                            For Each objKey In objDict
                                strDict(X,dictKey)  = CStr(objKey)
                                strDict(X,dictItem) = CStr(objDict(objKey))
                                X = X + 1
                            Next
                        
                            ' perform a a shell sort of the string array
                            For X = 0 to (Z - 2)
                              For Y = X to (Z - 1)
                                If StrComp((strDict(X,intSortA) & strDict(X,intSortB)), _
                                 (strDict(Y,intSortA) & strDict(Y,intSortB)), _
                                 vbTextCompare) > 0 Then
                                    strKey  = strDict(X,dictKey)
                                    strItem = strDict(X,dictItem)
                                    strDict(X,dictKey)  = strDict(Y,dictKey)
                                    strDict(X,dictItem) = strDict(Y,dictItem)
                                    strDict(Y,dictKey)  = strKey
                                    strDict(Y,dictItem) = strItem
                                End If
                              Next
                            Next
                        
                            ' erase the contents of the dictionary object
                            objDict.RemoveAll
                        
                            ' repopulate the dictionary with the sorted information
                            For X = 0 to (Z - 1)
                              objDict.Add strDict(X,dictKey), strDict(X,dictItem)
                            Next
                        
                          End If
                        End Function
                        
                        Function Join2D(varData, strDelim)
                           Dim strOutpu
                           For i = LBound(varData, 1) to UBound(varData, 1)
                              For j = LBound(varData, 2) to UBound(varData, 2)
                                 If Len(strOutput) > 0 _
                                   Then strOutput = stroutput & strDelim
                                 strOutput = stroutput & varData(i, j)
                              Next
                           Next
                           Join2D = strOutput
                        End Function
                        If there are many duplicate files found then it is better to redirect the output from the batch or the vbscript to a text-file instead of sending it to the screen.

                        \Rems
                        Last edited by Rems; 10th February 2009, 23:21.

                        This posting is provided "AS IS" with no warranties, and confers no rights.

                        __________________

                        ** Remember to give credit where credit's due **
                        and leave Reputation Points for meaningful posts

                        Comment


                        • #13
                          Re: Finding duplicates lines in text files

                          Originally posted by Dumber View Post
                          With notepad++ you can compare files.
                          If used it more then once
                          This 'duplicate % file % finder' in Clone % Remover can help with it. It can compare files.. well, and delete unnecessary copies...
                          Last edited by Rems; 23rd August 2009, 09:33. Reason: Live link REMOVED

                          Comment


                          • #14
                            Re: Finding duplicates lines in text files

                            I actually ended up using a php script:

                            To find duplicates from within one text file:
                            $file = file('textfile.txt');
                            $a = array();

                            foreach($file as $f)
                            {
                            ++$a[strtolower(trim($f))];
                            }

                            foreach($a as $key => $value)
                            {
                            if($value > 1)
                            {
                            echo $key . PHP_EOL;
                            }
                            }

                            To compare two text files:

                            <?php
                            $a = file('textfile1.txt');
                            $b = file('textfile2.txt');


                            foreach($a as $key => $value)
                            {
                            $a[$key] = strtolower($value);
                            }


                            foreach($b as $key => $value)
                            {
                            $b[$key] = strtolower($value);
                            }


                            foreach(array_intersect($a, $b) as $x) { echo $x; }

                            To compare three text files:

                            <?php
                            $a = file('textfile1.txt');
                            $b = file('textfile2.txt');
                            $c = file('textfile3.txt');


                            foreach($a as $key => $value)
                            {
                            $a[$key] = strtolower($value);
                            }


                            foreach($b as $key => $value)
                            {
                            $b[$key] = strtolower($value);
                            }


                            foreach($c as $key => $value)
                            {
                            $c[$key] = strtolower($value);
                            }
                            foreach(array_intersect($a, $b, $c) as $x) { echo $x; }

                            Comment


                            • #15
                              Re: Finding duplicates lines in text files

                              I had a lot of duplicates of musical files on my PC. I used to delete duplicate files hyperlink removed by moderator, which is called Clone % Remover. It is a usefull program for all files, not only music. May be, it will solve your problem.
                              Good luck!


                              Moderator edit:
                              @Larry81, have you seen the answer two posts up? Do you by any chance know this banned user?
                              Are you aware you're just sign in and advertising shareware in your first post? Have you read the forum rules?

                              _
                              Last edited by Rems; 23rd August 2009, 09:29.

                              Comment

                              Working...
                              X