Monday, October 26, 2009

More Unicode in Python (Lao this time)

Last post I took on a fairly well established language for Python identifiers and UTF-8 encoded text: Armenian (with a little Russian thrown in at the end).

This time I'll relate something a bit more challenging: the southeast Asian language Lao.

Before I get started, a quick recap:

1) Python version 3.0 and higher can handle Unicode identifiers and UTF-8 encoded source.

2) Getting non-ASCII characters to show up correctly, from least difficult medium to most difficult:

  a) Web Browser
  b) Desktop Application
  c) Console

Lao is a bit of an exception, at least the way I have things set up. The GUI (idle) with some tweaking, works pretty well. The Python Wiki, not so much.

Here's the LaoLanguage page on the Python Wiki (in Opera on my laptop):




Yuck! It's supposed to be seven characters wide (not eleven) and what we in the West would call accent marks are supposed to sit above the character to be modified, not northeast of them.

Well, let's try it in idle. After some messing around with the font I discovered that Courier 10 point size 14 seems to yield the best results:




Not perfect, but not bad either. This is what it's supposed to look like from laoconnection:





From here on out we should be able to dispense with the screen shots. I assigned the variable laostring the text value I pasted in to idle. Now we can inspect it a little bit:

>>> len(laostring)
11
>>> # only 7 character widths
>>> import unicodedata
>>> for charx in laostring:
        print(unicodedata.name(charx))


LAO LETTER SO TAM
LAO VOWEL SIGN YY
LAO TONE MAI EK
LAO LETTER KHO SUNG
LAO LETTER O
LAO TONE MAI THO
LAO LETTER YO
LAO VOWEL SIGN EI
LAO LETTER MO
LAO TONE MAI EK
LAO LETTER NO

>>> 


The Unicode Standard doesn't handle all languages the same way. Whereas "ä" takes only one codepoint, the single character (sort of) LAO LETTER SO TAM, LAO VOWEL SIGN YY, LAO TONE MAI EK takes up three.

Stuff like this is second nature to people working with East Asian languages daily. For me, it takes some getting used to, especially the string length thing. Rendering is an entirely different problem, but I am hoping that with practice and experimentation, I'll get a better handle on what works. Suggestions welcome.

5 comments:

  1. The aggregate of the three characters LAO LETTER SO TAM, LAO VOWEL SIGN YY, LAO TONE MAI EK is merely graphical; calling it a "character" is grossly unfair. The Unicode term for it is "grapheme cluster" (§2.11).

    Unicode does have characters such as ä that are functionally equivalent to a letter with a combining character, but it is an exception for some simple and "traditional" (i.e. extended ASCII) cases; most of them have decomposition rules to replace them with their parts.
    In other words, the correct length of that Lao word is 11 characters.

    ReplyDelete
  2. Lorenzo,
    Thanks for the explanation. I hadn't meant any umbrage to the Unicode Consortium, but I could have worded it better.
    My understanding is that Unicode has to honor previous encodings, which is why ä is one codepoint.
    Again, thanks.
    Carl T.

    ReplyDelete
  3. For what its worth it seems to work as expected in Ubuntu 9.04's console with Python 2.6:

    print(u'\u0E8A\u0EB7\u0EC8\u0E82\u0EAD\u0EC9\u0EA2\u0EC1\u0EA1\u0EC8\u0E99')
    ຊື່ຂອ້ຢແມ່ນ

    ReplyDelete
  4. i think the important point here is exactly as said before: in stone age, we thought that "one character is one byte". then came multibyte encodings and unicode, so for a time we lived in bronze age, thinking that "one character is one codepoint". now in iron age, we've finally found out that we do not have a very clear concept of what a "character" is at all—i've read so many discussions and proposals, it has become quite obvious to me that there cannot be a simple and definite answer.

    as a german, i perceive the "ä" very much as "one character", though i was tought at school it belongs to the small group of german sonderbuchstaben (special letters), äöüÄÖÜß, and though when i write it it is very clearly first an "a", then two dots on the top. as a historical accident, "ä" belongs to those things that you can write in at least two different ways in unicode.

    you may think that latin letters receive a favored treatment in unicode, and you would be right. be it said that unicode's treatment of the different scripts is not of a completely uniform quality. i am doing a lot with the chinese and japanese parts of the standard, and it may surprise people to hear that quality of encoding varies *from codepoint to codepoint* (mainly due to the existence of character variants, which is in fact a difficult problem, with or without computers). this leads to subtle and tremendously vexing problems.

    in your case, maybe one could settle on a terminology like "(horizontal) writing position" or such, and avoid the term "character" altogether in terminology. i am not completely happy with the statement that the string quoted has "11 characters". the same term, "character", is also often used to denote a "full rights" element of the chinese script (in contrast to other elements like punctuation and so on). even in chinese, it can be quite hard to determine whether a glyph is "really a character" or "merely a figure of writing". so it may be better to choose a word less laden with profound problems.

    using this term, "position" one can state that "in latin and indic scripts alike, a single horizontal writing position can be variously encoded using one or more codepoints", and that "there are a number of unicode normalization forms that permit to translate any such sequence into one that fulfills a number of formal requirements, and that may change the number of codepoints used for any given writing position".

    ReplyDelete
  5. @schmichael - I can't see it at the computer I'm at, but I look forward to seeing a nice Lao word when I get home tonight. CBT

    @flow - this was a very thoughtful response, and I appreciate the time you put into it.

    What I now realize is that Unicode and script transference to the computer is the life work of many people. Words like "character" (in both senses!) should not be used lightly. Henceforth, I will attempt to handle the subject matter more delicately.

    FWIW, I am grateful for the Unicode standard and the work invested in it.

    ReplyDelete