Why BOM?

· by Peter · Read in about 3 min · (514 words) ·

BOM is a Unicode character, U+FEFF byte order mark (BOM). In context of UTF-8 the most import thing is that it confirms that the file is UTF-8 encoded (most probably). Because there’s no other way to be almost certain (you never get 100% confidence with encodings). Of course there are methods of heuristic analysis which can offer high accuracy. I myself am using Mozilla Universal Charset Detector, but it’s still guessing. Even though educated one.

Any latin character fits into first half of ASCII Table, so it can be UTF-8 encoded the same way as ASCII. Hence computer isn’t able to tell if file is ASCII or UTF-8 if it contains only English text as it’s the same in both encodings. If your editor is “stupid” and you’ll start adding non-latin characters to this file, it may want to keep encoding as ASCII (or some of its extended variants) which will render any extended characters rubbish.

Sometimes there’s a problem even if your file is correctly encoded in UTF-8, but you have no BOM. Especially with the apps like Excel. Perform an experiment. Write some text into a file, use as many of special characters in some foreign language as you can. Save two versions of this file, both encoded using UTF-8, but one with and the other without BOM. Now open Excel and import both files. See? The one without BOM was imported with broken special characters.

By no means BOM is a solution by itself to all your problems with encoding. Sometimes having a BOM can pose a problem. It’s just important to know which of your apps recognise and require it and which don’t. Otherwise you’ll land in troubles even though your file is indeed fine.

I’m writing about it, because it’s not obvious to non-technical users. And even technical ones often don’t pay attention. And there’s a lot of applications which require files without BOM. Which is fine as they, usually, know how to handle such files. The problem arises when somebody is trying to take such file as-is and use it in the other app which expects BOM to be there and, when there isn’t one, is performing the guessing incorrectly. Which leads to user being confused about what’s wrong as input file looks good.

I’ve written simple library in C which is doing your UTF-8 BOM related activities for you. It can check if BOM is present in the file (as BOM isn’t visible when viewing file as text in most editors), add it or remove it depending on your needs. If you don’t know what to do with it, hire an engineer!

I’ve used a lot of shortcuts in this text, because I wanted to focus on BOM itself and why it is important. If you’re interested in Unicode and UTF-8 you should definitely read more about it.

Lastly, my friend has written a BOM removal tool for me quite some time ago, also in C. I’ve lost it and I’ve created this one from scratch basing on my current knowledge. So any eventual plagiarism wasn’t intended;)