Author Topic: Duplicate file finder? (Read 3172 times)

comda · « **on:** February 22, 2018, 03:09:31 PM »

Greetings CH!

I have a TON of files I need to delete. issue is, I have duplicates on duplicates on multiple drives etc. Ive got a great piece of software on Mac to find these files, but when I search for Windows Duplicate file finder, I get a LOT of suggestions. Because this piece of software will be going through all my personal data, is there a legit safe proper one to use that you guys can suggest that you guys have used?

Ive got these suggestions, but has anyone actually used them? Whats the risk that someones spying on these things haha?

https://lifehacker.com/the-best-duplicate-file-finder-for-windows-1696492476

Thanks

DaveLembke · « **Reply #1 on:** February 23, 2018, 08:17:11 AM »

I've always attacked this sort of problem with batch scripts.

Do you only want the newest or do you want every different date/time stamped file of same name, and do you need to maintain any specific folder structure for files to adhere to?

Mark. · « **Reply #2 on:** February 23, 2018, 02:04:41 PM »

I've always been as scared as all *censored* with these dup file finders and the trust users put into them.
using them blindly seems like a recipe for disaster in my books.
all I can suggest is you have a great backup regime in place before proceeding.

patio · « **Reply #3 on:** February 23, 2018, 02:58:02 PM »

If you write the script proper so it ignores any Win system files it should be fine...

That being stated i agree with Mark...dup file findors can bork a perfectly fine PC.

DaveLembke · « **Reply #4 on:** February 24, 2018, 05:21:38 AM »

I agree with Patio and Mark and thats another reason why scripting such as through batch control is best, you can see exactly as it will behave and have total control. However as Mark stated its good to make sure that you have a full backup on a different drive of your data before you run a script that would delete duplicates.

comda · « **Reply #5 on:** March 01, 2018, 11:04:39 AM »

Alright. So what now then? Im not talented enough to create my own batch file, the only safe thing I can think of doing is clearing one hard drive the hard way going through and finding stuff, and then literally create a dumping ground, where I start moving stuff, and each time windows says the file already exists simply delete it.

nil · « **Reply #6 on:** August 11, 2018, 01:35:53 PM »

I'm a little late to the party here.. but I wanted to share a solution I came up with to deal with this problem myself, in case it can still be useful to you.

This Python script will compare all files located in or beneath a root directory. If more than one file has the same contents, it reports them as duplicates.

Here's the code:

Code: [Select]

# for https://www.computerhope.com/forum/index.php/topic,164994.msg973770.html#msg973770

import sys
import argparse
import os
import hashlib 

if sys.version_info[0] != 3:                    
    print("This script requires Python 3.")
    sys.exit(1)

# Arguments

parser = argparse.ArgumentParser(description='Compare contents of all files from a root directory, and report duplicates.')
parser.add_argument('searchroot', help="Path of the directory where the search/comparison should begin.")
parser.add_argument('-o', '--outputfile', help="Output the report to a file, OUTPUTFILE. If this option is omitted, the report is printed to standard output.", default=None, required=False)
args = parser.parse_args()

if not os.path.isdir(args.searchroot):
    print("SEARCHROOT (", args.searchroot, ") is not a directory. Exiting.")
    sys.exit(1)

# Functions

def hashfile(path, blocksize = 63356):   # arbitrary sane blocksize
    """ Compute the MD5 hash of a file, located at path. 
        Later, the hash digest will be used to quickly compare the contents of files.
    """
    hashobj = hashlib.new('md5')                 # create new md5 hash object
    try:
        with open(path, 'rb') as fd:             # open file in read-binary mode
            try:                                      
                buf = fd.read(blocksize)         # read the first chunk of the file
                while len(buf) > 0:              # do until we're out of data:
                    hashobj.update(buf)          # append buf to the md5 hash obj
                    buf = fd.read(blocksize)     # read next chunk and loop
            except:
                print ("Could not read ", path, ", skipping") # e.g. permission denied
                return None
    except:
        print ("Could not open ", path, ", skipping")   # e.g. NTFS Junction is walked, but throws an error on open()
        return None
    return hashobj.hexdigest()          # hash the data and return the digest
    
def build_master_hash(root_folder):     # root_folder is root of the search
    """ Build a master list of md5 hashes of every file encountered.
        Concurrently build a second list containing the duplicates only.
    """
    masterHashList = {}                 # {'hash': ['/path/to/file', ... ], ... }
    dupes_only = {}                     # same, but only the duplicates
    for dirpath, subdirs, file_list in os.walk(root_folder, followlinks=False):            
        print('Processing directory: %s' % dirpath)
        for filename in file_list:
            path = os.path.join(dirpath, filename)     # get full path
            if os.path.isfile(path):                   # operate on regular files only
                hash = hashfile(path)                  # hash file contents
                if hash is not None:                   # proceed only if we got a hash
                    if hash in masterHashList:             # if we've seen it before
                        masterHashList[hash].append(path)  # append path info to that hash
                        dupes_only[hash] = masterHashList[hash] # and add to dupes
                    else:                                  # otherwise
                        masterHashList[hash] = [path]      # add new hash & filename
    return masterHashList, dupes_only

def generate_report(dupes, outputFile=None):
    """ Generate human-readable list of duplicates. """
    report = ""                                    # this will be a long text file 
    for dupe in dupes:                             # for each entry in dupes
        report += "--\r\n"                         # add a separator line and CRLF
        for filename in dupes[dupe]:                 # for each filename associated with a hash
            report += str(filename) + "\r\n"         # add it to the report
            totalString = "\r\nTotal files with duplicates: " + str(len(dupes)) + "\r\n"
    report += totalString                          # Finally, include a total
    if outputFile is not None:                     # If the -o/--outputfile option was specified
        with open(outputFile, 'wt') as fd:         
            print(totalString, "\r\nWriting report to: ", outputFile)
            fd.write(report)                       # (over)write the report to that file
    else:                                          # otherwise
        print(report)                              # print it to the screen (standard output)

# Start
    
print ("hi")                                       

master, dupes = build_master_hash(args.searchroot) # Recursively hash contents of files starting at searchroot
                                                   # NOTE: omit trailing backslash in DOS paths
generate_report(dupes, args.outputfile)  # generate report, write to file or screen 
                                         # NOTE: if output file exists, it will be overwritten
sys.exit()

# End

To use this program, first install Python 3 if you don't have it installed. You can get it at: https://www.python.org/downloads/

When installing, select option to add Python to your PATH environment variable so you can run python from any directory.

Copy the above code and paste it into a text file (using notepad for instance), and save it as find-dupes.py.

Run it with: python find-dupes.py "C:\path\to be\searched" -o outputfile.txt

It will generate a report and write it to outputfile.txt. If you omit -o outputfile, it will print the report to the screen instead.

I've tested it on Windows 10 and it appears to work correctly

Note that this only identifies duplicates, it doesn't modify any of the files it finds. (Figuring out which file to keep is more difficult to automate, but I'm going to have to figure out a solution for that part -- probably for each file, ask the user which one to keep.)

Another limitation is all the files to be compared have to be on the same drive. This suits my purpose so I left it this way.. but if you need to run this on more than one drive (compare all of C:\ to all of D:\ for example) it could be rewritten that way, where you can provide searchroot1 searchroot2 etc.. let me know

Hope this helps --

patio · « **Reply #7 on:** August 11, 2018, 05:34:14 PM »

Still dangerous as it doesn't exclude Win system files as stated above...

nil · « **Reply #8 on:** August 11, 2018, 06:13:45 PM »

There is zero danger. No files are modified. Any directory can be excluded from the search.

Computer Hope Forum

News:

Author Topic: Duplicate file finder? (Read 3172 times)

comda

Duplicate file finder?

DaveLembke

Re: Duplicate file finder?

Mark.

Re: Duplicate file finder?

patio

Re: Duplicate file finder?

DaveLembke

Re: Duplicate file finder?

comda

Re: Duplicate file finder?

nil

Re: Duplicate file finder?

patio

Re: Duplicate file finder?

nil

Re: Duplicate file finder?