The new forums will be named Coin Return (based on the most recent vote)! You can check on the status and timeline of the transition to the new forums here.
Please vote in the Forum Structure Poll. Polling will close at 2PM EST on January 21, 2025.

Help With WinVis: Finding and eliminating duplicate files

lodgerlodger Humble NarratorRegistered User regular
edited December 2009 in Games and Technology
My hard drive is clogged. And a big part of the problem is redundant files (documents, pictures, music) in multiple copies spread across different folders, or copies of folders nested Russian Doll Style, within the same folder. Is there any utility that will let me make one big search to identify and corral all files that have an identical twin elsewhere on the drive?

lodger on

Posts

  • Kris_xKKris_xK Registered User regular
    edited December 2009
    Try the Technology Forum, they'll be more useful than us nerds.

    Kris_xK on
    calvinhobbessleddingsig2.gif
  • RSPRSP Registered User regular
    edited December 2009
    I figure this is a generally useful thing, so I went ahead and wrote a primitive implementation in Python. If you have a Windows machine, you'll probably need to install Python 2.6 from http://www.python.org . If you're on Linux or a Mac, you likely already have it.
    import os
    import hashlib
    hashdict={} #will be a dictionary in form {hexdigest : filepath}
    duplicates=open("results.txt","w")
    mode=raw_input("log findings to a text file, or ask about removing duplicates as I go? type either 'log' or 'ask':\n> ")
    def dirsearch(thisdir):
    	for something in os.listdir(thisdir):
    		something=os.path.join(thisdir,something)
    		print "reading ",something
    		if os.path.isfile(something):
    			currentfile=open(something)
    			currentmd5sum=hashlib.md5(currentfile.read()).hexdigest()
    			currentfile.close()
    			if currentmd5sum not in hashdict:
    				hashdict[currentmd5sum]=something
    			else:
    				somethingelse=hashdict[currentmd5sum]
    				if mode=="log":
    					duplicates.write("match found: "+something+" hashes to same digest as "+somethingelse+"\n")
    				elif mode=="ask":
    					whichone=raw_input("\n\n\nidentical hash found:\n\nfile A: "+something+"\n\nfile B: "+somethingelse+"\n\nType 'a' or 'A' to remove A, 'b' or 'B' to remove B, or just hit enter to leave both alone.\n> ")
    					if whichone.lower=="a":
    						os.remove(something)
    					elif whichone.lower=="b":
    						os.remove(somethingelse)
    		elif os.path.isdir(something):
    			dirsearch(something)
    if __name__=="__main__":
    	dirsearch(os.getcwd())
    	duplicates.close()
    

    Once you've gotten Python, which is a pretty handy thing to have overall, just copy and paste this into a text file, name it "whateveryouwant.py", and double click to run it. It treats the folder it's located in as the top level of the search, so running it off of your desktop, for example, won't do much.

    edit: a word of caution.

    Be careful with this. If you tell it to delete a file, then it will do exactly that - permanently delete it and free up HDD space. It won't move it to the recycle bin or anything like that, so you don't really have a recovery option. I don't suggest running it from the root directory as there are probably identical, but important, system files. Keep it within your user folders e.g. C:\Documents and Settings\Me or C:\Users\Me

    RSP on
Sign In or Register to comment.