Tuesday, December 22, 2009

Python's unicodedata Module

unicodedata first apeared in Python 2.5.  Some names currently associated with checkins to Python's svn source tree are von Löwis, Lemburg, Forgeot d'Arc, Ruohonen, Lundh.  I probably won't get a chance to meet you folks, so I'll take this quick opportunity to say "Thanks".  unicodeata is a lot of fun and brings everything you need from Unicode into the Python interpreter.

There is plenty of information about Unicode out at unicode.org.  Putting it together with a module and a script is essentially what unicodedata's authors have done.  For those interested, the makeunicodedata.py(http://svn.python.org/view/python/trunk/Tools/unicode/makeunicodedata.py?view=markup) script written by F. Lundh gives a high level overview of how some parts of the module are generated.

Rather than just rehash the unicodedata documentation, I'd like to give an example of each method or member of the module and craft a few handy tricks toward the end of this blog entry.

To the interpreter!

Python 3.1.1 (r311:74480, Nov 29 2009, 22:24:25)
[GCC 4.2.1 20070719  [FreeBSD]] on freebsd7
Type "copyright", "credits" or "license()" for more information.
>>> import unicodedata
>>> dir(unicodedata)
['UCD', '__doc__', '__file__', '__name__', '__package__', 'bidirectional', 'category', 'combining', 'decimal', 'decomposition', 'digit', 'east_asian_width', 'lookup', 'mirrored', 'name', 'normalize', 'numeric', 'ucd_3_2_0', 'ucnhash_CAPI', 'unidata_version']
>>> # attempt to deal with UCD and ucd_3_2_0
>>> # ucd_3_2_0 is essentially unicodedata,
>>> # but for an earlier version of Unicode
>>> # (3.2.0)
>>> # Version of Unicode Python 3.1 is using:
>>> unicodedata.unidata_version
>>> unicodedata.UCD

>>> unicodedata.ucd_3_2_0

>>> # UCD isn't meant to be used extensively
>>> # in the Python interpreter.
>>> # Everything I tried with it (including
>>> # inheritance) raised an exception.
>>> type(unicodedata.ucd_3_2_0) == unicodedata.UCD
>>> # except that - let's leave these underlying
>>> # details to the folks who wrote the module.
>>> # Speaking of which - ucnhash_CAPI
>>> help(unicodedata.ucnhash_CAPI)
Help on PyCapsule object:

class PyCapsule(object)
 |  Capsule objects let you wrap a C "void *" pointer in a Python
 |  object.  They're a way of passing data through the Python interpreter
 |  without creating your own custom type.
 |  Capsules are used for communication between extension modules.
 |  They provide a way for an extension module to export a C interface
 |  to other extension modules, so that extension modules can use the
 |  Python import mechanism to link to one another.
 |  Methods defined here:
 |  __repr__(...)
 |      x.__repr__() <==> repr(x)

>>> # Clear as C!  Onto something more tangible . . .
>>> # name and lookup are useful and easy

normalize and decomposition are two methods I've covered extensively (or at least verbosely) in two previous posts here and here.  We're going to skip them.


UAX #9 outlines the bidirectional algorithm and describes associated codes.

The bidirectional algorithm is a bit involved and I've only scratched the surface here.  There are a number of codes and even invisible characters for overriding the expected direction of a group of characters.  This is the best I can do with this for now.

>>> # Parentheses are the classic example
>>> # of mirrored characters
>>> parens = '()'
>>> for charx in parens:


East Asian Width is covered in UAX #11.  In short, the width of the character usually corresponds to the space it takes up in memory or in disk.  After that, it gets more complicated.  The 'a' in the screenshot is an ASCII character, and like the rest of ASCII, falls under East Asian Narrow (Na).  The Burmese character, on the other hand, does not fall under the East Asian Character Set, and takes on the designation Neutral (N).

Hey!  Wait a second - Myanmar is a lot closer to East Asia than Europe and the Americas.  Folks, just trust me and Unicode on this one.  I'm not an expert; nonetheless, that's the spec.  It's an example of why it's good not to assume too much about Unicode without reading the UAX's and the code charts first.  Let's move on.

I have previously written blog post that covers combining codes a bit.  We'll still look a little bit at them along with category codes.

 Lo is a lowercase letter.  Since Devangari doesn't have case, all the letters that can stand on their own are classified this way.  Mc is combining-spacing - the character cannot stand alone, but when combined with another character it adds some width to the group of characters - the carriage moves to the left for those of us who learned to touch type on a typewriter back in the Jurassic.

A category of Mn is for combining non-spacing.  In this case the virama is below the main character and does not move the cursor forward.

The combining code is typically zero.  The virama is such a common character in almost all of the languages of India that it gets its own special code of nine.  Actually, there are a number of Unicode characters for the virama - one for each language or character set:


All that is left are the numeric categories numeric, digit, and decimal.

>>> # numeric, digit, and decimal
>>> unicodedata.numeric('9')
>>> unicodedata.decimal('9')
>>> unicodedata.digit('9')

Wow, I couldn't have picked a more boring example.  Still, the fact that numeric yields a float is significant.  Let's see if Unicode has any interesting characters:

>>> longunicodename = 'VULGAR FRACTION THREE QUARTERS'
>>> thrqtrs = unicodedata.lookup(longunicodename)
>>> bengali16 = unicodedata.lookup(longunicodename)
>>> longunicodename = 'TAMIL NUMBER ONE THOUSAND'
>>> tamil1k = unicodedata.lookup(longunicodename)
>>> longunicodename = 'THAI DIGIT SIX'
>>> thai6 = unicodedata.lookup(longunicodename)
>>> # numeric is the most general of the groupings
>>> thrqtrs
>>> unicodedata.numeric(thrqtrs)
>>> unicodedata.digit(thrqtrs)
Traceback (most recent call last):
  File "", line 1, in
ValueError: not a digit
>>> # whoops!
>>> unicodedata.decimal(thrqtrs)
Traceback (most recent call last):
  File "", line 1, in
ValueError: not a decimal
>>> # no go

OK, that's a lot to cover.  One fun thing you can do is look up a character set or language just based upon the inclusion of the word (TELUGU, for instance) for the language in the character description.  It's not the same as a code chart, but it can be handy:

>>> for codeptx in range(40000):
        if unicodedata.name(chr(codeptx)).find(
            'TELUGU') > -1:
    except ValueError:

And so on . . .

Well, that's what I think I know for now.  Stay tuned. 

No comments:

Post a Comment