Sunday, November 8, 2009

Python 3 and UTF-8 - getting a handle on it

Last time I attemped a dive into the world of Unicode.  Judging from the comments, I see I've got a bit to learn.  With that in mind, I've started reading Fonts & Encodings (the deer book) and made a visit to a useful UK site.

The first thing that I might want to understand before proceeding on the odyssey of UTF-8 or Unicode identifiers in Python is UTF-8 itself.  What does UTF-8 encoding mean?  (For those who know this stuff already, I've probably saved you some time; for those who are still kind enough to read further and check for accuracy, thank you!)

Enough waffling, back to the problem!

From last time, the Lao combination of characters:

LAO LETTER SO TAM
LAO VOWEL SIGN YY
LAO TONE MAI EK


Screenshot:




The screenshot doesn't do the Lao script justice, but at least we have something to work from.

>>> for glyphx in strx:
    print(ord(glyphx))

   
3722
3767
3784


OK, we've got our three Unicode numeric values for our glyphs:  3722, 3767, 3784.  Let's see if we can encode our string as UTF-8, look at it, and get back to those values.

>>> strutf8x = strx.encode('UTF-8')
>>> for bytex in strutf8x:
    print(bin(bytex))

    
0b11100000
0b10111010
0b10001010
0b11100000
0b10111010
0b10110111
0b11100000
0b10111011
0b10001000



At first glance, it looks like a bunch of ones and zeros, but there is a pattern.

In the first byte 0b11100000 the 111 is saying, "We'll need three bytes, including this one, for the integer representation of the codepoint for this glyph."

The zero after the three ones 0 is saying, "OK, we're finished counting bytes, I don't count for anything other than a flag that we're finished that part, but my friend to the right counts as your most significant bit."

As it turns out, none of the remaining bits 0000 has any value, so the next two bytes will have to express the number 3722.

0b10111010, 0b10001010 These two bytes have flags as well.  They both begin with 10, which says "I am a flag, your bits for calculation start after me."

So let's see if this holds up:

>>> int('111010001010', 2)
3722


Good deal.  We were able to work backwards from UTF-8 and get the same first ordinal number (3722) for the glyph we started out with.

OK, fine, but what does ASCII look like encoded this way?  Seven bits gives you 127 as your highest number.  Let's see:

>>> asciistr = chr(126)
>>> print(asciistr)
~
>>> unicodedata.name(asciistr)
'TILDE
'





The venerable tilde . . .

>>> utf8asciix = asciistr.encode('UTF-8')
>>> print(bin(utf8asciix))
Traceback (most recent call last):
  File "", line 1, in
    print(bin(utf8asciix))

TypeError: 'bytes' object cannot be interpreted as an integer

Gaaaahhhh!  What the heck did I do wrong?!  I tried to use the bin() function on a sequence of bytes, that's what I did wrong - duh!  Let's try this again:


>>> for bytex in utf8asciix:
    print(bin(bytex))

   
0b1111110


Much better, but wait a second - there's six bits flipped on right in a row - where is my leading 10 like we saw above?  For ASCII, it's not there.  This might make it more clear:

>>> bytereprsntn = bin(utf8asciix[0])
>>> print(bytereprsntn)
0b1111110

>>> bytereprsntn = bytereprsntn[0:2] + '0' + bytereprsntn[2:]
>>> bytereprsntn
'0b01111110'


OK, we've got all eight bits in our byte now - bin() does not use any more digits than it needs to express a number in binary.  The 0 prior to the  1111110 bits lets UTF-8 know that it's dealing with a codepoint that falls within the ASCII range.

When the Python 3.1 interpreter reads UTF-8 code, this is the process it employs to get at your Unicode identifiers (although it's coded far more efficiently than I described it).

No comments:

Post a Comment