Kjetil's Information Center: A Blog About My Projects

Duplicate File Remover

Here is a long bourne shell one-liner I hacked together to (re)move duplicate files from a directory structure:

mkdir "./duplicate" && find . -type f -exec md5sum {} \; | \
sort | uniq -D -w 32 | awk '{ print $1, length, $2 }' | \
sort -n | awk '($1 == x) { print $1, $3 } ($1 != x) { x = $1 }' | \
cut -b 33- | xargs -I {} mv -v {} "./duplicate"
          

I could have used perl or python, but that is not as fun or challenging!

The weirdest part is the use of awk to print the length of the line in between the MD5 sum and the filename. This is required to be able to sort so that the files with the longest paths are printed last. This in turn makes sure that files deeper down in the directory structure will be moved instead. The other awk part will remove the first line in a series of identical MD5 sums, this is required because one file will of course have to remain in the directory structure!

I prefer to move (mv) the files instead of actually removing (rm) them, so everything can be double checked afterwards.

By the way, this one-liner can be used to remove any remaining empty directories in the structure:

find . -type d | sort -r | xargs rmdir --ignore-fail-on-non-empty
          


Topic: Scripts and Code, by Kjetil @ 22/02-2010, Article Link