Wastholm.com

Log In

Got Duplicates?

posted Sep '07 by peter

Isn't it annoying when you have a directory (or, worse, an entire directory tree) full of files and you suspect there are duplicates here and there — two or more files with the exact same content — silently wasting your precious disk space, but you don't know which files are identical to which?

Well, I thought so anyway, so I whipped up this little Perl script a couple of years ago. Only recently, it occurred to me that maybe someone else somewhere sometimes has this same problem.

So here it is, for your downloading enjoyment: dups. The script currently uses MD5 hashes to determine whether or not two files are identical, so it requires Digest::MD5.

By default, dups scans the current working directory and prints out identical files in pairs, one pair per line, e.g. foo.txt == bar.txt, but you can give an arbitrary number of names of directories and/or files to have it check those instead. There are also a couple of switches you can play with. -a, for instance, makes it also check files whose names begin with a period (when you're checking a directory; file names given on the command line will always be checked). -r tells it to scan directory trees recursively (i.e., check subdirectories, and their subdirectories, and so on). -u prints out all unique files found, instead of the pairs mentioned earlier. And there are a couple more. If you get confused, just do dups -h and the script will helpfully tell you all the switches it understands.

An obvious improvement would be to have the script actually do something with the duplicates (or unique files) it finds — soft- or hardlink duplicates to each other, simply remove duplicates, or something along those lines. Another might be a "picky mode," in which the script would use diff to make absolutely sure that two different files didn't just happen to get the same MD5 hash.

Send your ideas, bug reports, thanks, complaints etc. this way.

© Wastholm Media 1997–2008