Anonymize DOCX Comments

July 7, 2017

Word document is still popular format in localization industry. Translators and reviewers are using tracked changes to discuss potential translation issues. Often times these people are from different companies and it’s required to preserve their anonymity. Word has functionality to make documents anonymous, but it’s blunt tool which will wipe all private information. And you want to know which comment comes from which person.

At some point in my career I was introduced to this solution. Which involves saving document as RTF, editing it via text editor (other than Word) and saving it as Word document again. I can understand how it can be useful to non-technical people. But don’t do it this way if you have an engineer, it’s wrong.

Every DOCX file is basically a zip archive which contains series of XML files. Comments are located in word/comments.xml. XPath to each comment’s author is //w:comment/@w:author, so it’s trivial to write an app (or even XSL Transformation) which will change all these names for you. As zip is quite standard format this app can even unzip DOCX for you, change mentioned file and zip it again. And that’s exactly what my app does.

I’ve written it to learn more about C which is my new love. Previously I’ve written similar app in Go, but because Microsoft’s XMLs are full of namespaces and Go doesn’t like those too much I’ve used regex and simple text replacement. And you should never edit XMLs like that, so I’ll keep this version as my embarrassing secret.

You can find the full source code on GitHub. You’ll need libarchive and libxml2 on your system to compile it. As it’s written in C I doubt anybody will ever use it, as people who still bother with C can write something better themselves. Or are keeping away from MS Office to start with. But it was amazing learning experience as I’ve learnt how to use two libraries. Got familiar with Valgrind which is great tool. Learnt that C++ isn’t so different from C while adapting solution I’ve found on Stack Overflow to my needs. I’ve also found out that even experienced programmers make silly mistakes as I’ve fixed memory leak in dictionary structure I’ve borrowed from “21st Century C, 2nd Edition” by Ben Klemens. I highly recommend this book and mistake is already fixed in author’s GitHub repo, but I’ve used the code from the book. So I’ve got a moment of pride once I’ve corrected my teacher’s code:)

I know at least one person who’ll look at this code and say “you’re retarded”, but hey, I’m learning. If you want to include this solution in your app then focus on comments.c file which has everything related to word/comments.xml. It should be trivial to rewrite it in any other language.

comments powered by Disqus