Deanna Thomas


15.11.2022

2 min read

Share on:

Cleaning up a dirty Git history

Sometimes we commit things and don’t notice the consequences. If we’re lucky, we recognize what we’ve done and have the chance to undo it right away. Other times, we can go years without noticing our mistakes and it becomes more and more difficult to clean up the skeletons in our closet. In case I wasn’t clear, I’m talking about Git.

Why would you want to rewrite history in the first place?

Imagine this: you are working on a personal project in a private Git repository. You commit an API key, or perhaps some large images or files used for testing. After hundreds (or even more) commits, you decide to open-source your project. But, you can’t make it public. The API key will become exposed and the large files will slow down cloning and take up space on contributors' local machines.

⁠With each commit, this sensitive or troublesome data was copied again and again into the repository’s history. You can’t just remove the files because anyone can just revisit the repository at any point in time before the deletion and still see the information. So, what must be done is to go back in time and remove the file from every single commit which included the file.

⁠This is similar to what happened at Tractable. We noticed two of our repositories contained large binary files for our various test cases. This was no problem for us in the past, but as new features were added and test suites became more robust, the repositories (and all their branches) began taking up to 10 minutes to clone. And gigabytes worth of storage were being taken up on our local machine.

Prepare for trouble

Rewriting Git history can be dangerous. Files and changes can be lost permanently and cause conflicts with local copies. So, the first thing we did was create full copies of our repository to have a reliable back-up in case things didn’t go according to plan. Depending on how your organization works, a fork would also be a good option.

⁠Secondly, we let our engineering team know that they could lose their unmerged changes if they were working on this repository between the time the rewrite had started and completed. When rewriting history, the entire Git tree changes, leaving your local copy unable to merge and contribute to the older version.

It’s [unwanted data] clobberin’ time

The first method we tried was the Git native filter-branch command. This command is used to rewrite Git history, which is exactly what we wanted to do. From the docs:

  • filter-branch lets you rewrite Git revision history by rewriting the branches mentioned in the <rev-list options>, applying custom filters on each revision. Those filters can modify each tree (e.g. removing a file or running a perl rewrite on all files) or information about each commit. Otherwise, all information (including original commit times or merge information) will be preserved.

Okay, great! It sounded like we were on the right track. We prepared a small bash script to run which looked something like this:

⁠Let’s take a look at what’s going on here:

for entry in "$dir"/*

  • This will get us all the files within the directory in the entry variable

git filter-branch

-f

  • Force remove. From the docs, “Git filter-branch refuses to start with an existing temporary directory or when there are already refs starting with refs/original/, unless forced… The original refs, if different from the rewritten ones, will be stored in the namespace refs/original/."

–-index-filter

  • This is the filter to rewrite the index file of the Git repository. There are other options, like –tree-filter, but this actually will check out each revision; taking substantially more time to run

"git rm -rf --cached --ignore-unmatch $entry"

  • This is the index-filter we are using. Very simply, we use the Git remove command for each file in the bash for-loop.

It seemed like we had a good solution, so we ran the script. However, it ended up taking over 12 hours, and that was only for one branch! Given we had almost 50 branches in the repository, this solution wasn’t optimal so we searched for a way to improve the processing time.

Movin’ on up

Even Git authors agree that the solution is suboptimal. When the filter-branch command is run, a message appears in the command prompt:

So, as was kindly suggested, we had a look into the suggested git-filter-repo tool. The documentation is extremely thorough and easy to work with, so we were able to come up with an equivalent command fairly quickly:

git filter-repo --invert-paths --path tests/assets/

Yep, that’s it. It’s not much, but let’s dig into the command:

--invert-paths

  • This basically says to “keep all files which do not match the following path”. Instead of running it one file at a time like in the above command, we can filter an entire path at once.

--path

  • This is the path we want to remove. It there are multiple paths, each one can be added to the command (i.e. --path tests/path1 --path tests/path2 etc.)

Running this command took only about 2 seconds. We were much happier with this solution.

The final countdown

After checking the results, there were two more things we needed to do to complete our clean up. We noticed that when checking the Git tags of the branch we applied filter-repo to, the removed files still existed. it seemed that tags were unaffected by this command. Therefore, we needed to apply the change to every commit and to every tag in the repository.

Lastly, we needed to apply the script to every branch in the repository to prevent some sort of merge nightmare, just in case a working branch might merge into the main cleaned up branch.

Our final script ended up looking like this:

Here’s what we have:

  • git clone

    • git filter-repo requires a fresh copy of the repository in order to do its magic. If it senses any kind of local change, it refuses to execute the command (for our own good, we can assume).

  • for b in `git branch -r | grep -v -- '->'`

    • This is a simple loop which goes through all the branches inside super-cool-repo.

  • git checkout ${b##origin/}

    • We strip the origin/ from the branch name that we get with git branch -r and then checkout the branch.

  • git filter-repo

    • We use --invert-paths  and --path  the same as we did in the first version of our command, with every path containing unwanted asset files added to the command.

  • --refs

    • This command specifies the refs that will be rewritten. Because we want to rewrite the refs at every git tag, we can evaluate $(git tag -l) to specify that we also want to apply this to each tag.

And thus, with our repo containing about 1400 commits, 1500 tags, 50 branches, we were able to clean up about 250 files in a whopping 8 seconds!

Fin~

Cleaning git history can be frightening. There is room for errors which git exists to prevent. However, with proper back ups and the right tools, it was incredibly fast and easy for us to clean up unwanted data in a long history of commits.