Cleaning up a dirty Git history

Sometimes we commit things and don’t notice the consequences. If we’re lucky, we recognize what we’ve done and have the chance to undo it right away. Other times, we can go years without noticing our mistakes and it becomes more and more difficult to clean up the skeletons in our closet. In case I wasn’t clear, I’m talking about Git.

Why would you want to rewrite history in the first place?

Prepare for trouble

It’s [unwanted data] clobberin’ time

filter-branch lets you rewrite Git revision history by rewriting the branches mentioned in the <rev-list options>, applying custom filters on each revision. Those filters can modify each tree (e.g. removing a file or running a perl rewrite on all files) or information about each commit. Otherwise, all information (including original commit times or merge information) will be preserved.

This will get us all the files within the directory in the entry variable

Force remove. From the docs, “Git filter-branch refuses to start with an existing temporary directory or when there are already refs starting with refs/original/, unless forced… The original refs, if different from the rewritten ones, will be stored in the namespace refs/original/."

This is the filter to rewrite the index file of the Git repository. There are other options, like –tree-filter, but this actually will check out each revision; taking substantially more time to run

This is the index-filter we are using. Very simply, we use the Git remove command for each file in the bash for-loop.

Movin’ on up

This basically says to “keep all files which do not match the following path”. Instead of running it one file at a time like in the above command, we can filter an entire path at once.

This is the path we want to remove. It there are multiple paths, each one can be added to the command (i.e. --path tests/path1 --path tests/path2 etc.)

The final countdown

git clone
- git filter-repo requires a fresh copy of the repository in order to do its magic. If it senses any kind of local change, it refuses to execute the command (for our own good, we can assume).
for b in `git branch -r | grep -v -- '->'`
- This is a simple loop which goes through all the branches inside super-cool-repo.
git checkout ${b##origin/}
- We strip the origin/ from the branch name that we get with git branch -r and then checkout the branch.
git filter-repo
- We use --invert-paths and --path the same as we did in the first version of our command, with every path containing unwanted asset files added to the command.
--refs
- This command specifies the refs that will be rewritten. Because we want to rewrite the refs at every git tag, we can evaluate $(git tag -l) to specify that we also want to apply this to each tag.