![]() ESN 15375-090413-708850-86 |
|
Document Name: Removing duplicate files Document Description: Removing duplicate files2009/04/13 This is a Perl script to remove duplicate files. It considers a file to be a duplicate if it has the same name and the same number of bytes. THAT COULD BE A VERY BAD ASSUMPTION. You could do an MD5 sum to be sure the files really are dupes, you could even compare byte by byte, but THIS SCRIPT doesn't do any of that. It's designed to be quick, simple and easy for you to modify if you need something a little different. This script keeps the OLDEST instance of the files. It's simple enough to change it to keep the newest; just reverse the sense of the test. It's possible for two files to have the same age. In that case, the script won't delete either of them unless some other matching file is older. I just dashed this off quickly this morning so CHECK THE RESULTS carefully before you uncomment the "unlink" line. Or leave it in and just redirect this to a list for manual review and removal. That's the safest way if you see anything odd in the list, don't remove it until you are sure it's OK to do so. It would be easy enough to modify this to have it ignore files older than a certain age; you might do that to avoid deleting important system files. You could exclude files owned by root, or add any other conditions that make sense to you. This works from the current directory down. I'd warn you to be very careful running this from "/" without adding extra restrictions. A SCRIPT LIKE THIS COULD BE VERY DANGEROUS. Don't use it without understanding it or without testing it. Note: the use of the "sprintf" is to avoid the possibility of a file name that happens to be numeric and somehow matches the the size and name of another file. That's extremely unlikely, but that's why it's there. If you Google for "delete duplicate files" you will find lots of scripts and programs for this purpose. One of them may be exactly what you need. See Remove Duplicate Files for a version that does use MD5 Digest.
#!/usr/bin/perl
# dupekill - Tony Lawrence, http://aplawrence.com/Unixart/remove_duplicate_files.html
# Purpose: kill duplicate files
# Keep oldest version, check name and file size
#
# Feel free to copy this, modify it, use it - with or without credit
# WARNING: this is potentially very dangerous. Test, understand, test.
# See the web page above for enhancements and more warnings
use File::Find;
use File::Basename;
sub walking {
$size= -s _;
$age=(stat(_))[9];
$name=basename($File::Find::name);
$key=sprintf("%.12d%s",$size,$name);
$myfiles{$key}=$age if not $myfiles{$key};
$myfiles{$key}=$age if ($myfiles{$key} > $age);
$counts{$key}++;
}
sub killing {
$size= -s _;
$age=(stat(_))[9];
$name=basename($File::Find::name);
$realname=$File::Find::name;
$key=sprintf("%.12d%s",$size,$name);
return if $counts{$key} < 2;
$date=scalar localtime($age);
if (($myfiles{$key} == $age)) {
#print "Keeping $realname $age $date\n";
# uncomment above for testing
return;
}
# you could do more tests here like MD5::Digest or even
# a byte by byte comparison
# or exclude files owned by root, over a certain age, whatever
#
push @killem, $realname;
# if you waqnt to actually reference $realname here, use /$realname
# because File::Find changes directories as it walks
}
find (\&walking, '.');
find (\&killing, '.');
foreach(@killem) {
print "$_\n";
#unlink($_);
# uncomment above line to actually remove files.
}
Author: Anthony Lawrence - Contact Author Publisher: Anthony Lawrence Licensee Name: Anthony Lawrence Reference URL: http://aplawrence.com/Unixart/remove_duplicate_files.html Copyright: All Rights Reserved Registration Date: 4/13/2009 10:25:56 AM UTC Views: 1027 |
