Tuesday, December 1, 2009

More Unicode and Python 3.x

Unicode has a number of categories for characters which are exposed in Python through the unicodedata module.  unicodedata has been around since Python 2.0.  The dir function gives a quick view of its scope:

>>> import unicodedata
>>> dir(unicodedata)
['UCD', '__doc__', '__file__', '__name__', '__package__', 'bidirectional', 'category', 'combining', 'decimal', 'decomposition', 'digit', 'east_asian_width', 'lookup', 'mirrored', 'name', 'normalize', 'numeric', 'ucd_3_2_0', 'ucnhash_CAPI', 'unidata_version']
>>>


This post will deal with the 'combining' category of some  glyphs with some 'decomposition' dealt with out of necessity.

Combining characters are the characters that, in general, do not stand alone.  Instead they appear in the same character space as another main character to form a grapheme cluster with it (or at least they should, if they render correctly on the display).  In the Unicode charts these characters are the ones that appear with a stippled circle where the main character should be.

This is a revisiting of the Lao string shown below (courtesy of laoconnection).



We'll take another look at that first character width, a grapheme cluster of three Unicode characters:




I've printed out the hexadecimal Unicode value, the name of the character, the combining class, and the character itself.

I had seen diagrams in the Unicode publication of combining classes corresponding to specific positions around a base letter (in this case LAO LETTER SO TAM).  It had been my naive hope that I would get combining codes back that would tell me that one combining character should be above the main one, and the tone mark should be above and to the right.

No, that is not how it works for this particular grapheme cluster. These, as the names indicate, are specifically Lao characters in the Unicode range of the Lao character set.  To my understanding, all that can be gleaned from the combining codes is that the zero one should follow the main character first,  because it has a lower number, 0 versus 122.

This  archived e-mail from the Unicode mailing list sheds some light on the Combining codes for us uninitiated.  All the combining codes between 10 and 132 are language specific, not position specific.  It is only in the 200 series of combining codes that have positions around the main character specified.  It actually makes sense to segregate languages where possible.  Mixing and matching glyphs from various character sets would be a mess.  Instead of a grapheme cluster you would have another kind of cluster, but we shan't go there.

Let's see if we can find some combining characters with the 200 series code designation.  How about something Vietnamese?

>>> vietstr = 'tiếng Việt'
>>> # charcters 2 and 8 look most interesting
>>> vietchar = vietstr[2]
>>> import unicodedata
>>> unicodedata.name(vietchar)
'LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE'
>>> # need to decompose to see combination character
>>> decompx = unicodedata.decomposition(vietchar)
>>> decompx
'00EA 0301'
>>> # will need to decompose recursively
>>> # let's look at char 0X301 first
>>> unicodedata.name(chr(0X301))
'COMBINING ACUTE ACCENT'
>>> unicodedata.combining(chr(0X301))
230
>>> # 230 should be above and so it is
>>> # now decompose 0XEA
>>> unicodedata.name(chr(0XEA))
'LATIN SMALL LETTER E WITH CIRCUMFLEX'
>>> unicodedata.decomposition(chr(0XEA))
'0065 0302'
>>> # 0X65 is lower case E
>>> chr(0X65)
'e'
>>> unicodedata.name(chr(0X302))
'COMBINING CIRCUMFLEX ACCENT'
>>> unicodedata.combining(chr(0X302))
230
>>> # OK both are above as they should be

>>> vietchar = vietstr[8]
>>> vietchar
'ệ'
>>> unicodedata.name(vietchar)
'LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW'
>>> unicodedata.decomposition(vietchar)
'1EB9 0302'
>>> # our friend 0X302 we've seen already
>>> unicodedata.decomposition(chr(0x1EB9))
'0065 0323'
>>> unicodedata.name(chr(0X323))
'COMBINING DOT BELOW'
>>> unicodedata.combining(chr(0X323))
220
>>> # 220 corresponds to below


It's apparent that the positional combining codes apply in large part to characters that are either part of the Latin character set or work with it.  Knowledge of the 200 series combining values is probably useful when working with latinized writing systems for languages originating outside of Europe (Asia, Africa).

Otherwise, a familiarity with the script of the language in question (in this case, Lao), its Unicode chart and code point range, and its specific combining characters are required to work with it effectively at the level of characters and bytes.

No comments:

Post a Comment