• Encoding Detection Revised

    by  • 2010-08-26 • Users • 12 Comments

    In recent KDE releases up to version 4.4 Kate unfortunately very often selected the wrong encoding. The result is that e.g. german umlauts (öäü) show up as cryptic signs in the text editor. What I’ve seen lots of times is that in this case people start to fix those characters manually for the entire document. In other words: They totally do not get at all that the text document simply was opened with the wrong encoding. In fact, the users usually do not even know what encoding is at all. While this is of course kind of sad, this certainly won’t change…

    Given this fact, the only correct “fix” is a very good automatic encoding detection, such that the encoding is usually chosen correctly. In the rewrite of Kate’s text buffer for KDE 4.5, Christoph also rewrote the file loader including the encoding detection. The detection now works as follows:

    1. try selected encoding by the user (through the open-file-dialog or the console)
    2. try encoding detection (some intelligent trial & error method)
    3. use fallback encoding

    In step 1, Kate tries to use the encoding specified in the open-file-dialog or the one given when launching Kate from the console. On success, we are done.

    The encoding detection in step 2 first tries unicode encoding by looking for a Byte Order Mark (BOM). If found, it is certain that the text document is unicode encoded.  If there is no BOM, Kate next uses a tool from KDElibs (KEncodingProber) to detect the correct encoding. This is basically trial & error: Try encoding A, if there are characters in the document the encoding is not able to represent, try encoding B. Then C and so on… Unfortunately, this also doesn’t always work, because a byte sequence might be valid in several encodings and represent different characters. This is why it’s more or less impossible to get the encoding always right. There is simply no way…

    If the encoding detection fails, Kate uses a fallback encoding. You can configure this fallback encoding in the editor component settings in the “Open/Save” category. If the fallback encoding fails as well, the document is marked as read-only and a warning is shown.

    What about Kile and KDevelop?

    One of the applications that heavily suffered of the wrong encoding detection in the past was the LaTeX editor Kile. The same holds probably for KDevelop (although it’s usually less critical with source code). The good news is, that with KDE >= 4.5 the problems with respect to wrong encoding should be gone. So it’s certainly worth to update if you are affected by this issue.

    About

    Dominik is a PhD student at the Control Theory and Robotics Lab, TU Darmstadt, as part of the Research Training Group GKMM (GRK1362). My research focuses on state estimation in distributed systems. As hobby, I contribute to the KDE project and work on the Kate application and editor component.

    http://www.kate-editor.org

    12 Responses to Encoding Detection Revised

    1. nobody
      2010-08-26 at 19:42

      There’s also the algorithm Mozilla uses to guess the encoding, I always wondered why you don’t use it.

    2. jkt
      2010-08-26 at 19:49

      Perhaps you could give Enca (http://gitorious.org/enca) a try, it uses some statistics, so it’s better than simply working in a sequence.

      • 2010-08-26 at 20:23

        KEncodingProber uses statistics, therefor we have this already. The sequence if around this, as we try to honor user settings above probing, if applicable.

    3. uetsah
      2010-08-26 at 20:01

      “Unfortunately, this also doesn’t always work, because a byte sequence might be valid in several encodings and represent different characters. This is why it’s more or less impossible to get the encoding always right. There is simply no way…”

      Well, there’s always the possibility to evaluate the resulting text against a dictionary of all know words of all languages for each encoding, to see which encoding results in the largest number of recognized words… And why stop there? Include a neural network algorithm that trains itself to recognize how much “sense” the contents of opened files make under each encoding… As you can see, there’s always a way!

      (Just kidding. Encoding detection works great for me in KDE 4.5, great work…)

    4. 2010-08-26 at 20:42

      It isn’t really clear to me if Kate has a more intelligent way to detect encoding besides using KEncodingProber or KEncodingDetector. I tried both of the latter, and I even failed to detect UTF-8. Is there a trick/code that you could share or “port” to kdecore? You even say that the fallback encoding could “fail”. I don’t understand how I can actually check if using a specific encoding “works” or “fails”. I would like to fix bug 228172. Many thanks (for the best editor out there, just misses CygnusEd like macro recording ;)

      • haumann
        2010-08-27 at 23:11

        As I understand, the only additional stuff that would maybe really be of interest is the BOM detection (if not already there in KEncodingProber). BOMs are optionally used on the beginning of a file. So if there is no “file-support” in KEncodingProber (no idea), then it’s not really useful for you.

    5. Locke
      2010-08-26 at 21:12

      I really appreciate that :)
      It was really annoying in the past when you had some iso-8859-1 and some utf-8 files. I upgraded to KDE SC 4.5 yesterday and tried out the encoding detection just a moment ago. Works like a charm, thank you guys :)

    6. Jan Hnila
      2010-08-27 at 08:34

      Users, who do not know, what an encoding is,
      will very probably have their locale set to something, that can be big hint to the possible encoding of files, they are using:

      e.g. if some user has locale German, it is more probable, that the character is really ü,ö,ä…

      I really think, this needs to be taken into account (if it not already): the list of probable encodings for a language/locale,
      maybe even sorted by the probability of use within such community…

      • haumann
        2010-08-27 at 23:08

        Yes, that can be done, and I’d assume that KEncodingProber already does something like that — at least that’s the place where it belongs :)
        We could write thousands and thousands of lines for encoding detection and it will still not always work. So better have something simple that works in almost all cases than having to maintain thousands of lines of code just for supporting 1 corner case more. In other words: We’ll most likely not implement it in Kate directly :^)

    7. moltonel
      2010-08-27 at 10:21

      As good as the detection code is, it is not infaillible, so it would be great if there was a visible clue when the detection done was not fully thrustworthy (either fallback was used, or the statistics show too many possible encodings). A very neat way to do it would be a bar appearing at the top of the edit area saying “This looks like [a href=encoding docs]encoding foo[/a] but I may be wrong. You can select another encoding now [button: list of statistically-possible encodings] or change the encoding at any later time using menu>foo>bar. [button: encoding is fine]“

      • haumann
        2010-08-27 at 23:13

        The term “not fully trustworthy” does not really make sense: If we found a suitable encoding, we’re simply done. That’s it.
        If we did not find a suitable encoding, you get an error message in form of a modal dialog that warns you and the document is marked as read-only.

        It indeed makes sense imo to show a bar on the top, that’s a nice idea. Do you volunteer? :-)

    Leave a Reply

    Your email address will not be published. Required fields are marked *


    *