The first thing that I might want to understand before proceeding on the odyssey of UTF-8 or Unicode identifiers in Python is UTF-8 itself. What does UTF-8 encoding mean? (For those who know this stuff already, I've probably saved you some time; for those who are still kind enough to read further and check for accuracy, thank you!)
Enough waffling, back to the problem!
From last time, the Lao combination of characters:
LAO LETTER SO TAM
LAO VOWEL SIGN YY
LAO TONE MAI EK
Screenshot:
The screenshot doesn't do the Lao script justice, but at least we have something to work from.
>>> for glyphx in strx:
print(ord(glyphx))
3722
3767
3784
print(ord(glyphx))
3722
3767
3784
OK, we've got our three Unicode numeric values for our glyphs: 3722, 3767, 3784. Let's see if we can encode our string as UTF-8, look at it, and get back to those values.
>>> strutf8x = strx.encode('UTF-8')
>>> for bytex in strutf8x:
print(bin(bytex))
0b11100000
0b10111010
0b10001010
0b11100000
0b10111010
0b10110111
0b11100000
0b10111011
0b10001000
>>> for bytex in strutf8x:
print(bin(bytex))
0b11100000
0b10111010
0b10001010
0b11100000
0b10111010
0b10110111
0b11100000
0b10111011
0b10001000
In the first byte 0b11100000 the 111 is saying, "We'll need three bytes, including this one, for the integer representation of the codepoint for this glyph."
The zero after the three ones 0 is saying, "OK, we're finished counting bytes, I don't count for anything other than a flag that we're finished that part, but my friend to the right counts as your most significant bit."
As it turns out, none of the remaining bits 0000 has any value, so the next two bytes will have to express the number 3722.
0b10111010, 0b10001010 These two bytes have flags as well. They both begin with 10, which says "I am a flag, your bits for calculation start after me."
So let's see if this holds up:
>>> int('111010001010', 2)
3722
Good deal. We were able to work backwards from UTF-8 and get the same first ordinal number (3722) for the glyph we started out with.
OK, fine, but what does ASCII look like encoded this way? Seven bits gives you 127 as your highest number. Let's see:
>>> asciistr = chr(126)
>>> print(asciistr)
~
>>> unicodedata.name(asciistr)
'TILDE'
The venerable tilde . . .
>>> utf8asciix = asciistr.encode('UTF-8')
>>> print(bin(utf8asciix))
Traceback (most recent call last):
File "
print(bin(utf8asciix))
TypeError: 'bytes' object cannot be interpreted as an integer
Gaaaahhhh! What the heck did I do wrong?! I tried to use the bin() function on a sequence of bytes, that's what I did wrong - duh! Let's try this again:
>>> for bytex in utf8asciix:
print(bin(bytex))
0b1111110
Much better, but wait a second - there's six bits flipped on right in a row - where is my leading 10 like we saw above? For ASCII, it's not there. This might make it more clear:
>>> bytereprsntn = bin(utf8asciix[0])
>>> print(bytereprsntn)
0b1111110
>>> bytereprsntn = bytereprsntn[0:2] + '0' + bytereprsntn[2:]
>>> bytereprsntn
'0b01111110'
OK, we've got all eight bits in our byte now - bin() does not use any more digits than it needs to express a number in binary. The 0 prior to the 1111110 bits lets UTF-8 know that it's dealing with a codepoint that falls within the ASCII range.
When the Python 3.1 interpreter reads UTF-8 code, this is the process it employs to get at your Unicode identifiers (although it's coded far more efficiently than I described it).
No comments:
Post a Comment