Encoding Detection Revised

Thursday, 26 August 2010 | Dominik Haumann

In recent KDE releases up to version 4.4 Kate unfortunately very often selected the wrong encoding. The result is that e.g. german umlauts (öäü) show up as cryptic signs in the text editor. What I’ve seen lots of times is that in this case people start to fix those characters manually for the entire document. In other words: They totally do not get at all that the text document simply was opened with the wrong encoding. In fact, the users usually do not even know what encoding is at all. While this is of course kind of sad, this certainly won’t change…

Given this fact, the only correct “fix” is a very good automatic encoding detection, such that the encoding is usually chosen correctly. In the rewrite of Kate’s text buffer for KDE 4.5, Christoph also rewrote the file loader including the encoding detection. The detection now works as follows:

try selected encoding by the user (through the open-file-dialog or the console)
try encoding detection (some intelligent trial & error method)
use fallback encoding

In step 1, Kate tries to use the encoding specified in the open-file-dialog or the one given when launching Kate from the console. On success, we are done.

The encoding detection in step 2 first tries unicode encoding by looking for a Byte Order Mark (BOM). If found, it is certain that the text document is unicode encoded. If there is no BOM, Kate next uses a tool from KDElibs (KEncodingProber) to detect the correct encoding. This is basically trial & error: Try encoding A, if there are characters in the document the encoding is not able to represent, try encoding B. Then C and so on… Unfortunately, this also doesn’t always work, because a byte sequence might be valid in several encodings and represent different characters. This is why it’s more or less impossible to get the encoding always right. There is simply no way…

If the encoding detection fails, Kate uses a fallback encoding. You can configure this fallback encoding in the editor component settings in the “Open/Save” category. If the fallback encoding fails as well, the document is marked as read-only and a warning is shown.

What about Kile and KDevelop?

One of the applications that heavily suffered of the wrong encoding detection in the past was the LaTeX editor Kile. The same holds probably for KDevelop (although it’s usually less critical with source code). The good news is, that with KDE >= 4.5 the problems with respect to wrong encoding should be gone. So it’s certainly worth to update if you are affected by this issue.