Welcome guest. Before posting on our computer help forum, you must register. Click here it's easy and free.

Author Topic: Duplicate file finder?  (Read 3186 times)

0 Members and 1 Guest are viewing this topic.

comda

    Topic Starter


    Adviser
  • Thanked: 6
    • Yes
  • Experience: Experienced
  • OS: Windows XP
Duplicate file finder?
« on: February 22, 2018, 03:09:31 PM »
Greetings CH!

I have a TON of files I need to delete. issue is, I have duplicates on duplicates on multiple drives etc. Ive got a great piece of software on Mac to find these files, but when I search for Windows Duplicate file finder, I get a LOT of suggestions. Because this piece of software will be going through all my personal data, is there a legit safe proper one to use that you guys can suggest that you guys have used?

Ive got these suggestions, but has anyone actually used them? Whats the risk that someones spying on these things haha?

https://lifehacker.com/the-best-duplicate-file-finder-for-windows-1696492476

Thanks

DaveLembke



    Sage
  • Thanked: 662
  • Certifications: List
  • Computer: Specs
  • Experience: Expert
  • OS: Windows 10
Re: Duplicate file finder?
« Reply #1 on: February 23, 2018, 08:17:11 AM »
I've always attacked this sort of problem with batch scripts.

Do you only want the newest or do you want every different date/time stamped file of same name, and do you need to maintain any specific folder structure for files to adhere to?

Mark.



    Adviser
  • Forum Regular
  • Thanked: 67
    • Yes
  • Certifications: List
  • Computer: Specs
  • Experience: Experienced
  • OS: Windows 10
Re: Duplicate file finder?
« Reply #2 on: February 23, 2018, 02:04:41 PM »
I've always been as scared as all *censored* with these dup file finders and the trust users put into them.
using them blindly seems like a recipe for disaster in my books.
all I can suggest is you have a great backup regime in place before proceeding. :)

patio

  • Moderator


  • Genius
  • Maud' Dib
  • Thanked: 1769
    • Yes
  • Experience: Beginner
  • OS: Windows 7
Re: Duplicate file finder?
« Reply #3 on: February 23, 2018, 02:58:02 PM »
If you write the script proper so it ignores any Win system files it should be fine...

That being stated i agree with Mark...dup file findors can bork a perfectly fine PC.
" Anyone who goes to a psychiatrist should have his head examined. "

DaveLembke



    Sage
  • Thanked: 662
  • Certifications: List
  • Computer: Specs
  • Experience: Expert
  • OS: Windows 10
Re: Duplicate file finder?
« Reply #4 on: February 24, 2018, 05:21:38 AM »
I agree with Patio and Mark and thats another reason why scripting such as through batch control is best, you can see exactly as it will behave and have total control. However as Mark stated its good to make sure that you have a full backup on a different drive of your data before you run a script that would delete duplicates.

comda

    Topic Starter


    Adviser
  • Thanked: 6
    • Yes
  • Experience: Experienced
  • OS: Windows XP
Re: Duplicate file finder?
« Reply #5 on: March 01, 2018, 11:04:39 AM »
Alright. So what now then? Im not talented enough to create my own batch file, the only safe thing I can think of doing is clearing one hard drive the hard way going through and finding stuff, and then literally create a dumping ground, where I start moving stuff, and each time windows says the file already exists simply delete it.

nil

  • Global Moderator


  • Intermediate
  • Thanked: 15
    • Experience: Experienced
    • OS: Linux variant
    Re: Duplicate file finder?
    « Reply #6 on: August 11, 2018, 01:35:53 PM »
    I'm a little late to the party here.. but I wanted to share a solution I came up with to deal with this problem myself, in case it can still be useful to you.

    This Python script will compare all files located in or beneath a root directory. If more than one file has the same contents, it reports them as duplicates.

    Here's the code:

    Code: [Select]
    # for https://www.computerhope.com/forum/index.php/topic,164994.msg973770.html#msg973770

    import sys
    import argparse
    import os
    import hashlib

    if sys.version_info[0] != 3:                   
        print("This script requires Python 3.")
        sys.exit(1)

    # Arguments

    parser = argparse.ArgumentParser(description='Compare contents of all files from a root directory, and report duplicates.')
    parser.add_argument('searchroot', help="Path of the directory where the search/comparison should begin.")
    parser.add_argument('-o', '--outputfile', help="Output the report to a file, OUTPUTFILE. If this option is omitted, the report is printed to standard output.", default=None, required=False)
    args = parser.parse_args()

    if not os.path.isdir(args.searchroot):
        print("SEARCHROOT (", args.searchroot, ") is not a directory. Exiting.")
        sys.exit(1)

    # Functions

    def hashfile(path, blocksize = 63356):   # arbitrary sane blocksize
        """ Compute the MD5 hash of a file, located at path.
            Later, the hash digest will be used to quickly compare the contents of files.
        """
        hashobj = hashlib.new('md5')                 # create new md5 hash object
        try:
            with open(path, 'rb') as fd:             # open file in read-binary mode
                try:                                     
                    buf = fd.read(blocksize)         # read the first chunk of the file
                    while len(buf) > 0:              # do until we're out of data:
                        hashobj.update(buf)          # append buf to the md5 hash obj
                        buf = fd.read(blocksize)     # read next chunk and loop
                except:
                    print ("Could not read ", path, ", skipping") # e.g. permission denied
                    return None
        except:
            print ("Could not open ", path, ", skipping")   # e.g. NTFS Junction is walked, but throws an error on open()
            return None
        return hashobj.hexdigest()          # hash the data and return the digest
       
    def build_master_hash(root_folder):     # root_folder is root of the search
        """ Build a master list of md5 hashes of every file encountered.
            Concurrently build a second list containing the duplicates only.
        """
        masterHashList = {}                 # {'hash': ['/path/to/file', ... ], ... }
        dupes_only = {}                     # same, but only the duplicates
        for dirpath, subdirs, file_list in os.walk(root_folder, followlinks=False):           
            print('Processing directory: %s' % dirpath)
            for filename in file_list:
                path = os.path.join(dirpath, filename)     # get full path
                if os.path.isfile(path):                   # operate on regular files only
                    hash = hashfile(path)                  # hash file contents
                    if hash is not None:                   # proceed only if we got a hash
                        if hash in masterHashList:             # if we've seen it before
                            masterHashList[hash].append(path)  # append path info to that hash
                            dupes_only[hash] = masterHashList[hash] # and add to dupes
                        else:                                  # otherwise
                            masterHashList[hash] = [path]      # add new hash & filename
        return masterHashList, dupes_only

    def generate_report(dupes, outputFile=None):
        """ Generate human-readable list of duplicates. """
        report = ""                                    # this will be a long text file
        for dupe in dupes:                             # for each entry in dupes
            report += "--\r\n"                         # add a separator line and CRLF
            for filename in dupes[dupe]:                 # for each filename associated with a hash
                report += str(filename) + "\r\n"         # add it to the report
                totalString = "\r\nTotal files with duplicates: " + str(len(dupes)) + "\r\n"
        report += totalString                          # Finally, include a total
        if outputFile is not None:                     # If the -o/--outputfile option was specified
            with open(outputFile, 'wt') as fd:         
                print(totalString, "\r\nWriting report to: ", outputFile)
                fd.write(report)                       # (over)write the report to that file
        else:                                          # otherwise
            print(report)                              # print it to the screen (standard output)

    # Start
       
    print ("hi")                                       

    master, dupes = build_master_hash(args.searchroot) # Recursively hash contents of files starting at searchroot
                                                       # NOTE: omit trailing backslash in DOS paths
    generate_report(dupes, args.outputfile)  # generate report, write to file or screen
                                             # NOTE: if output file exists, it will be overwritten
    sys.exit()

    # End

    To use this program, first install Python 3 if you don't have it installed. You can get it at: https://www.python.org/downloads/

    When installing, select option to add Python to your PATH environment variable so you can run python from any directory.

    Copy the above code and paste it into a text file (using notepad for instance), and save it as find-dupes.py.

    Run it with: python find-dupes.py "C:\path\to be\searched" -o outputfile.txt

    It will generate a report and write it to outputfile.txt. If you omit -o outputfile, it will print the report to the screen instead.

    I've tested it on Windows 10 and it appears to work correctly





    Note that this only identifies duplicates, it doesn't modify any of the files it finds. (Figuring out which file to keep is more difficult to automate, but I'm going to have to figure out a solution for that part -- probably for each file, ask the user which one to keep.)

    Another limitation is all the files to be compared have to be on the same drive. This suits my purpose so I left it this way.. but if you need to run this on more than one drive (compare all of C:\ to all of D:\ for example) it could be rewritten that way, where you can provide searchroot1 searchroot2 etc.. let me know

    Hope this helps --
    « Last Edit: August 11, 2018, 01:59:25 PM by nil »
    Do not communicate by sharing memory; instead, share memory by communicating.

    --Effective Go

    patio

    • Moderator


    • Genius
    • Maud' Dib
    • Thanked: 1769
      • Yes
    • Experience: Beginner
    • OS: Windows 7
    Re: Duplicate file finder?
    « Reply #7 on: August 11, 2018, 05:34:14 PM »
    Still dangerous as it doesn't exclude Win system files as stated above...
    " Anyone who goes to a psychiatrist should have his head examined. "

    nil

    • Global Moderator


    • Intermediate
    • Thanked: 15
      • Experience: Experienced
      • OS: Linux variant
      Re: Duplicate file finder?
      « Reply #8 on: August 11, 2018, 06:13:45 PM »
      There is zero danger. No files are modified. Any directory can be excluded from the search.
      Do not communicate by sharing memory; instead, share memory by communicating.

      --Effective Go