Author Topic: Merging External Hard Drives together batch question (Read 3538 times)

DaveLembke · « **on:** September 15, 2017, 11:09:17 AM »

Big Mess...

So I got a 4TB external drive and I was going to just create folders such as 80GB, 160GB, 500GB, 1TB, 1.5TB and xcopy the contents from these 5 external hard drives over to this 4TB drive, but then I got thinking that on many of these drives I have lots of redundant data that is going to waste space on the 4TB, and these drives will consume up to 3.240 TB. Maybe there is a way to merge the data all in one place and have only 1 copy of each file under same file name that is the same date/time stamp, but also have those files of older dates also copied over to the 4TB.

Windows is good when copying at detecting duplicates and prompting when they are detected asking what you want to do such as replace one date/time file with another or keep both copies. BUT the bigger mess is that I have the same files scattered among various paths and only really need 1 copy of each same file name at each date/time stamp, and paths aren't important.

Looking online I found that CCleaner has a duplicate feature which I never knew about. But it requires a very manual approach to what to do to each and every duplicate found with that date/time stamp. Looking for an automatic method.

Checking here to see if there is a nifty batch or other script method that someone can point me to that will copy only 1 copy of each filename of each date/time stamp to a destination drive so that if say I have 12 files all named the same file name and of which there are 3 different date/time stamps, it will only copy the 3 different date/time stamped files of the 12 to the destination drive no matter of the pathing.

To me if it was to be done in batch maybe it would involved a dynamic exclusion list file that has rules some how to know that a file of that name and date/time stamp has already been copied and to ignore xcopying another copy of it to the destination from from the source drive, but not sure how to build this exclusion list that would control how xcopy only copies a single instance of files which can be same file name with many date/time stamps and of non specific ( wild card ) pathing. The destination path doesnt need to be kept for duplicates so its ok if branches of the tree arent replicated from the source drives to the destination drive.

Two of my external source drives I know have heavy redundancy as for I wrote a program in a mixture of C and System Calls to Command Shell that I used to take CD-R and DVD-R data disc's and hit a key and it would create a folder named 1, 2, 3, 4 and xcopy the entire contents of the disc to the external drive at the folder iteration that it was currently at. Many backups were copied to the external drives and so there are some projects that have many same filename files and various date/time stamps. Whereas many people would be happy with the latest version of whatever file it is, I like having all files of same name of all date/time stamps so that if say I have a project that I decide I want to pick up with the source code from an earlier version of the code and code it differently, I can use that version as the base to build on from vs the newest version which may be completely reworked code which doesnt apply well to the original source type because lots would need to be chopped out and declarations removed etc to baseline it to build a different branch of code from earlier code as the best explanation of why I want all different date/time stamps of same file names.

The original source drives I am going to put into storage as archives for read-only future use and the 4TB will be a singe drive that is used as a one place to find all my data dating back to the late 1980s early 1990s with even data from 5.25" disks in there in some places from years ago when I threw away trash bags full of floppies and burned it all to a 650MB CD-R in the late 1990s to save space as well as CD-R discs hold up better to age vs magnetic tape storage methods. But not all forms of media are age proof from dying and so I have constantly backed up all my data rolling forward onto newer means of data storage as well as multiple places with the data stored so that if say that 80GB external died tomorrow, the data is likely also on the 160GB that replaced it as the main drive years ago when 160GB use to be a lot of space

Currently I am better at naming convention for my folders and projects to try to avoid like file names and add numeric version indications to projects such as file001 file002 or file09152017 to have a means of avoiding like file names as well as when wanting to find a specific file of a certain version its way easier. So my sloppiness of the past with my data I have improved upon but right now I just have a pile of data in various paths with lots of unneeded redundancy and figured I'd check here to see if anyone has a good script or ideas on how to achieve what I am trying to do.

Feeling kind of like there might be an easier method then current idea of that it would require exporting a DIR dump and have a script that then copies 1 line for each file name of each unique date/time stamp no matter of path, and then an XCOPY that runs down through that list with each line on it executed with an individual file copy to destination for only one instance if like file names of like date/time stamps, but all date/time stamps that are different copied to destination.

Or maybe even there is a tool out there other than batch and other than xcopy that can do this that I am unaware of. I am thinking there must be others out there who also have a mess of data all thrown into one place or multiple places with redundancy that is unnecessary to have say 12 files on the same drive of the same date/time stamp and only 1 needed etc.

Even if it was me copying via xcopy all data to the 4TB drive and running a batch or script that kills off duplicates of same filenames of same date/time stamp that would work too as a more simplistic approach than detecting duplicates during the copy process. So after the copying is done its basically just a redundancy garbage cleanup but all date/time stamp versions of same file name there being one of each unique time stamp of that same file name, but if the file is located in multiple paths only the first instance of that file with that date/time stamp is protected and all others of same name and date/time stamp deleted or tossed into a folder named duplicates which i can look through quickly and then delete on my own etc.

BC_Programmer · « **Reply #1 on:** September 15, 2017, 12:09:00 PM »

I'm sure it's possible in batch- which I have to say because no matter how impossible I think something is somebody like Salmon Trout shows up with a batch file - but I think that the files being in different paths might complicate things a bit.

Presuming I had a similar task (I don't, but it's fun to pretend) I would probably have written a program to do it.

The program would effectively be given a Source Directory, a Target directory, and a list of other directories. I'm thinking of the psuedocode as such:

if there are other directories specified
for each file in all of those other directories
calculate a unique hash for this file. (in your case, this would be the filename and the last modified date, but it could of course also be based on content of the file too)
store that unique value in a HashSet. If it already exists, then don't try to add it (If there is already an entry then that's a collision, which would mean that the file is duplicated within the specified "other" folders)
(It would also be possible to index additional information- storing the path of that file for example, via a HashMap rather than a HashSet)
Now, process all the files in the specified drive.
With every file calculate a hash in the same way as above.
If that hash already exists in the Map/Set
then we know that this file is already in one of those other folders and we skip it.
otherwise, copy the file to the target directory, preserving the path (eg E:\some\folder\place\foundfile.txt get's copied to T:\500GB as T:\500GB\some\folder\place\foundfile.txt

In your case then it would mean using Xcopy to copy the first drive, then using this imagined program for each subsequent copy, including the "other" folders as the folders copied too so far. Each new copy would basically ignore files that were already present in those other directories.

It would also be possible to do something fancy, like instead of doing nothing, it creates a hard link to the file that was found in the other folders. This would result in the directories all preserving exactly what they looked like on the drive, but all duplicated files (based on the hash) would share the same data. and not take up any additional space. Or it could write up a "report" for each target listing the file paths that weren't copied and where they already existed.

DaveLembke · « **Reply #2 on:** September 15, 2017, 01:47:47 PM »

Thanks BC for your insight on this. Also I like your multiple directions that are different than my hypothetical but possible ones. The hard link was an interesting one. One file and links at the multiple paths that point to 1 file of the same name and date/time stamp.

But it got me thinking.... aren't hard and soft links something that is mapped on a single system that by which they were created from, where external drive used by a different computer it wouldnt work out the same when accessing data at the external directly via USB connection? Also if other computers treat the 4TB drive as different drive letters not sure if the links would break or if they have wildcard to device root of external drive no matter of assigned drive letter if they travel with the drive to work on any modern Windows based computer. $:-\$

Quote

It would also be possible to do something fancy, like instead of doing nothing, it creates a hard link to the file that was found in the other folders. This would result in the directories all preserving exactly what they looked like on the drive, but all duplicated files (based on the hash) would share the same data. and not take up any additional space. Or it could write up a "report" for each target listing the file paths that weren't copied and where they already existed.

BC_Programmer · « **Reply #3 on:** September 15, 2017, 02:19:28 PM »

Hard links work via NTFS. Basically, NTFS file structures basically have a MFT entry that says what the file is named and points at where the data for the file is. Hard Links are when more than one MFT Entry points at the same data. Of course there are caveats- changing one changes all other "files" pointing at that data for obvious reasons- though those might be desirable side effects anyway.

DaveLembke · « **Reply #4 on:** September 16, 2017, 08:11:46 PM »

Thanks for your clarification on how hard links work BC. I always thought that the links were system bound and not part of the master file table. While the MFT has an index of all data I always thought the system that the drive connects to keeps track of hard and soft links.

I think I am going to just manually go through and target the larger redundant files and work my way down from there using CCleaner's duplicate finder feature. CCleaner has ability to select search for duplicates of greater than a specific size that I can specify. I set it to greater than 100MB and then found duplicates of Linux ISO's and stuff that I dont need to have 6 copies of Linux Mint 9 LTS.

I was messing around with some ideas and then put a stop to them when I realized that for each duplicate my routine was going to have to search through the entire drive with every pass and its slow and very inefficient. CCleaner has a export to text file for duplicates and I was going to use this as a cheat sheet for my program to know what and where to target duplicates but realized that I probably should just manually pick away at this, as well as there is some data I am finding that I really dont need to keep around anymore so I am trashing some stuff entirely that I dont have a purpose for such as Open Office installer Version 3.2.1 so I guess I will pick away at it.

Computer Hope Forum

News:

Author Topic: Merging External Hard Drives together batch question (Read 3538 times)

DaveLembke

Merging External Hard Drives together batch question

BC_Programmer

Re: Merging External Hard Drives together batch question

DaveLembke

Re: Merging External Hard Drives together batch question

BC_Programmer

Re: Merging External Hard Drives together batch question

DaveLembke

Re: Merging External Hard Drives together batch question