Fancy file systems as ZFS or WAFL have dedupe options to reduce the diskspace of recurring files. These file systems look at each file and only keep one copy of the same file if the file is present in multiple folders. Most of these file systems keep huge tables in memory to speed up this process.
For example the backup I take from my desktop and the backup from misses her desktop will have 90% the same family photo’s. This takes extra space on the NAS device I rather use for more useful stuff.
I prefer not to use WAFL (not free) or ZFS (not in the kernel) because of these memory requirements and/or license issues. I use rdfind on a regular XFS filesystem on RHEL7. I run this on a cronjob every week. Tue, rdfind is not a good as ZFS but I’m not running a fast access, many multiple-documents file server.
No packages are available for Centos so I installed it with these instructions:
yum install gcc-c++ nettle-devel wget http://rdfind.pauldreik.se/rdfind-1.3.4.tar.gz tar -xzvf rdfind-1.3.4.tar.gz cd rdfind-1.3.4 ./configure make make install
Once complete you can give it a test run like this:
[kevin@storage rdfind-1.3.4]$ rdfind -dryrun true /srv/Backup/ (DRYRUN MODE) Now scanning "/srv/Backup", found 75016 files. (DRYRUN MODE) Now have 75016 files in total. (DRYRUN MODE) Removed 0 files due to nonunique device and inode. (DRYRUN MODE) Now removing files with zero size from list...removed 560 files (DRYRUN MODE) Total size is 83971044262 bytes or 78 Gib (DRYRUN MODE) Now sorting on size:removed 14087 files due to unique sizes from list.60369 files left. (DRYRUN MODE) Now eliminating candidates based on first bytes:removed 21542 files from list.38827 files left. (DRYRUN MODE) Now eliminating candidates based on last bytes:removed 1839 files from list.36988 files left. (DRYRUN MODE) Now eliminating candidates based on md5 checksum:removed 1532 files from list.35456 files left. (DRYRUN MODE) It seems like you have 35456 files that are not unique (DRYRUN MODE) Totally, 6 Gib can be reduced. (DRYRUN MODE) Now making results file results.txt
Remove the dryrun to execute