Recently I was trying to clean up my backup disks and realized I have a mess. So first I started to upload everything to Dropbox, but soon I realized I have lots of files in 2, 3, sometimes even 4 copies. I was looking for a tool that will allow me to locate the duplicate files and found rdfind.

It is nince. It first goes over the directory structure I give it on the command line and then starts to filter files.

  • First it will eliminate all the files that have a unique size
  • Then it will compare the first few bytes of the files that have the same size
  • Then it will compare the last few bytes of the file that have the same size
  • Finally, among the still remaining candidates for equality it will compare their checksum using md4 or sha1

The problem I had is that I have many huge files. Quite a few are larger than 1 Gb and many over 10 Gb. I have about 2TB data.

It takes a lot of time to go over all the files, especially computing the checksums takes a lot of time as the files need to be read.

Add to the injury that many of these files are still located on slow-to-access external disks.

I thought it would be better if I could skip the checksum step, but there was no such flag.

So what can I do.

First I opened a ticket explaining my issue, but then I thought, what if I try to implement it?

I clones the source code. It seems it is written in C++. I've never written C++ and even with C I have very little experience.

grep release *

and found the release_new_version.txt that contained the instructions:

git clean -xdf .
./bootstrap.sh
./configure
make dist-gzip

The first problem I encountered was a missing package. I had to install it:

sudo apt-get install nettle-dev

Then I bumped intor the following error when running make:

$ make
make  all-am
make[1]: Entering directory '/home/gabor/work/rdfind'
g++ -DHAVE_CONFIG_H -I.     -g -O2 -MT rdfind.o -MD -MP -MF .deps/rdfind.Tpo -c -o rdfind.o rdfind.cc
rdfind.cc: In function ‘Options parseOptions(Parser&)’:
rdfind.cc:225:30: error: ‘numeric_limits’ is not a member of ‘std’
  225 |     o.maximumfilesize = std::numeric_limits<decltype(o.maximumfilesize)>::max();
      |                              ^~~~~~~~~~~~~~
rdfind.cc:225:45: error: expected primary-expression before ‘decltype’
  225 |     o.maximumfilesize = std::numeric_limits<decltype(o.maximumfilesize)>::max();
      |                                             ^~~~~~~~~~~~~~~~~~~~~~~~~~~
make[1]: *** [Makefile:658: rdfind.o] Error 1
make[1]: Leaving directory '/home/gabor/work/rdfind'
make: *** [Makefile:537: all] Error 2

Luckily a little search for the error message brought me to this page. Not surprisingly I am not the first one to see this error. Better yet, apparently it has already been fixed in a branch.

So I using git branch -a I checked and saw that there is a branch called devel. I switched to that branch and ran the whole process again. This time make worked and it create a file called rdfind in the root directory of the project.

Now that I knew I can compile and build the executable I started to look at the code.

First I wanted to find where it processes the command-line parameters so I'll be able to add a parameter called -nochecksum.

I searched for the string "argv" as that is the name of the variable that holds the command-line parameters in many languages. Quite soon I found that is it in the rdfind.cc file.

examples/rdfind-checksum.diff

commit d3c6a523bf531fe4bf46c8b13397b111b3fb6634
Author: Gabor Szabo <gabor@szabgab.com>
Date:   Thu Sep 15 14:37:50 2022 +0300

    allow for '-checksum none' to disable checksum filtering. #118

diff --git a/rdfind.cc b/rdfind.cc
index 64dd8f6..cd35314 100644
--- a/rdfind.cc
+++ b/rdfind.cc
@@ -61,7 +61,7 @@ usage()
     << " -followsymlinks    true |(false) follow symlinks\n"
     << " -removeidentinode (true)| false  ignore files with nonunique "
        "device and inode\n"
-    << " -checksum           md5 |(sha1)| sha256\n"
+    << " -checksum           md5 |(sha1)| sha256 | none\n"
     << "                                  checksum type\n"
     << " -deterministic    (true)| false  makes results independent of order\n"
     << "                                  from listing the filesystem\n"
@@ -103,6 +103,7 @@ struct Options
   bool followsymlinks = false;        // follow symlinks
   bool dryrun = false;                // only dryrun, don't destroy anything
   bool remove_identical_inode = true; // remove files with identical inodes
+  bool checksum = true;      // use some checksum
   bool usemd5 = false;       // use md5 checksum to check for similarity
   bool usesha1 = false;      // use sha1 checksum to check for similarity
   bool usesha256 = false;    // use sha256 checksum to check for similarity
@@ -174,8 +175,10 @@ parseOptions(Parser& parser)
         o.usesha1 = true;
       } else if (parser.parsed_string_is("sha256")) {
         o.usesha256 = true;
+      } else if (parser.parsed_string_is("none")) {
+        o.checksum = false;
       } else {
-        std::cerr << "expected md5/sha1/sha256, not \""
+        std::cerr << "expected md5/sha1/sha256/none, not \""
                   << parser.get_parsed_string() << "\"\n";
         std::exit(EXIT_FAILURE);
       }
@@ -237,8 +240,10 @@ parseOptions(Parser& parser)
   // done with parsing of options. remaining arguments are files and dirs.
 
   // decide what checksum to use - if no checksum is set, force sha1!
-  if (!o.usemd5 && !o.usesha1 && !o.usesha256) {
-    o.usesha1 = true;
+  if (o.checksum) {
+    if (!o.usemd5 && !o.usesha1 && !o.usesha256) {
+      o.usesha1 = true;
+    }
   }
   return o;
 }
@@ -356,17 +361,19 @@ main(int narg, const char* argv[])
     { Fileinfo::readtobuffermode::READ_FIRST_BYTES, "first bytes" },
     { Fileinfo::readtobuffermode::READ_LAST_BYTES, "last bytes" },
   };
-  if (o.usemd5) {
-    modes.emplace_back(Fileinfo::readtobuffermode::CREATE_MD5_CHECKSUM,
-                       "md5 checksum");
-  }
-  if (o.usesha1) {
-    modes.emplace_back(Fileinfo::readtobuffermode::CREATE_SHA1_CHECKSUM,
-                       "sha1 checksum");
-  }
-  if (o.usesha256) {
-    modes.emplace_back(Fileinfo::readtobuffermode::CREATE_SHA256_CHECKSUM,
-                       "sha256 checksum");
+  if (o.checksum) {
+    if (o.usemd5) {
+      modes.emplace_back(Fileinfo::readtobuffermode::CREATE_MD5_CHECKSUM,
+                         "md5 checksum");
+    }
+    if (o.usesha1) {
+      modes.emplace_back(Fileinfo::readtobuffermode::CREATE_SHA1_CHECKSUM,
+                         "sha1 checksum");
+    }
+    if (o.usesha256) {
+      modes.emplace_back(Fileinfo::readtobuffermode::CREATE_SHA256_CHECKSUM,
+                         "sha256 checksum");
+    }
   }
 
   for (auto it = modes.begin() + 1; it != modes.end(); ++it) {

At the end I felt it is better to change the already existing -checksum flag and allow it to receive the value "none" to avoid the checsum calculations.

I was lazy and did not add any new tests, but it could be done later if the author of the code requests it.

After making the change I pushed out the code to my forked repository of the project. The GitHub Actions were triggered and I saw that all the tests pass.

I was very pleasantly surprised by the ease of finding the instructions how to build the code, the existing tests and the ease to make the adjustments.

We'll see if the Pull-Request is accepted.