A while back I did two posts, one on Lao UTF-8 and one on Arabic UTF-8, using binary representation of bytes to demonstrate UTF-8 encoding in Python.
I still wasn't satisfied with my knowledge of Python 3.x and string handling, nor was the whole UTF-8 thing second nature to me.
On Monday, I'm giving a talk to our local Linux group about the subject of my Python poster session, Python 3.1, Unicode, and Languages of the Indian Ocean Region. In the immortal words of Dean Wormer, "Fat, drunk and stupid is no way to go through life, son." You know, it's no way to give a presentation, either. Time to get smarter now.
The code in this post is nothing but a verbose rehash of what the Python string object does already - decode. But understanding exactly what is going on is helpful. That's the beauty of having an interpreter in Python or LISP - you can check your assumptions and find out your mistakes as fast as you can type (it would help if I could type faster).
Basic things I learned in this exercise:
1) empty, one item, or many items, Python's bytes object is a sequence that must be iterated over or indexed to get byte values.
2) the bytes object has a decoding method, the string object has an encoding method; the two methods work together to switch back and forth between bytes and strings, or, less appropriately, between bytes and characters.
3) binary math goes way beyond decimal 255. Being able to read hexadecimal numbers and recognize significant ones is a real help when working with UTF-8 and Unicode.
Enough qualifiying and waffling. To the code and the interpreter!
"""
Utilities for dealing with bytes, UTF-8, and Unicode.
"""
# new in Python 3.1.1
from collections import OrderedDict
# gates for flag bytes of UTF-8
# 128 - just beyond ASCII
# '0b10000000'
ASCIICUTOFF = 0x80
# 192 - 2 byte designator for UTF-8 character
# '0b11000000'
TWOBYTER = 0xc0
CUTOFFS = OrderedDict()
# 224 - just beyond two byte representation
# '0b11100000'
CUTOFFS['TWOBYTES'] = 0xe0
# 240 - just beyond three byte representation
# '0b11110000'
CUTOFFS['THREEBYTES'] = 0xf0
# 248 - just beyond four byte representation
# '0b11111000'
CUTOFFS['FOURBYTES'] = 0xf8
# 252 - just beyond five byte representation
# '0b11111100'
CUTOFFS['FIVEBYTES'] = 0xfc
# 254 - end of the line
# '0b11111110'
CUTOFFS['SIXBYTES'] = 0xfe
# for binary math
THIRTYTWO = 0x20
THIRTY = 30
SIX = 6
TWO = 0x2
ONE = 0x1
ERRORX = 0, 0
ERRORY = -1
# on subsequent bytes, cannot have second to
# most significant (seventh) bit flipped on
# 63
FORBIDDEN = 0x3f
def interpfirstbyte(firstbyte):
"""
In a UTF-8 byte sequence,
function which interprets
the first byte of a UTF-8
encoded character.
Returns 2 tuple of number of
bytes total in the UTF-8 character
encoding and value of significant
bits in first byte of UTF-8
character encoding.
interp1stbyte(0xf1) -> (4, 1)
"""
twopow = THIRTYTWO
if firstbyte[0] < ASCIICUTOFF:
return ONE, firstbyte[0]
elif firstbyte[0] < TWOBYTER:
print('Not a valid first byte of a UTF-8 character sequence.')
return ERRORX
# characters beyond ASCII range
counter = TWO
for cutoff in CUTOFFS:
if firstbyte[0] < CUTOFFS[cutoff]:
return (counter, int(firstbyte[0] % (CUTOFFS[cutoff] - twopow)))
twopow /= TWO
counter += ONE
print('Not a valid first byte of a UTF-8 character sequence.')
return ERRORX
def interpsubsqntbyte(nextbyte):
"""
In a multiple byte UTF-8 character
representation, returns significant
part of byte as an integer.
interpsubsqntbyte(0x81) -> 1
"""
retval = int(nextbyte[0] % ASCIICUTOFF)
if retval > FORBIDDEN:
print('Not a valid byte for a UTF-8 multibyte sequence.')
return ERRORY
return retval
class FileByter:
"""
Attempt to put UTF-8 readable file
object into a class that processes
one byte, then one character at a time.
"""
def __init__(self, filenamex):
self.filename = filenamex
self.fle = open(self.filename, 'rb')
self.currentbyte = None
self.numbytes = 0
self.modx = 0
self.currentcharord = -1
self.charbytes = b''
def gimmebyte(self):
"""
Assigns the next byte in the file
being read to self.currentbyte.
Closes file and assigns currentbyte
value of None at end of file.
"""
self.currentbyte = self.fle.read(1)
if len(self.currentbyte) == 0:
self.fle.close()
print('Closed file {0}.'.format(self.filename))
self.currentbyte = None
def readchar(self):
counter = 0
powx = 0
self.charbytes = b''
self.gimmebyte()
if self.currentbyte:
self.numbytes, self.modx = interpfirstbyte(self.currentbyte)
if (self.numbytes, self.modx) == ERRORX:
print('Not on first byte of UTF-8 char sequence. Try again.')
return None
self.charbytes += self.currentbyte
if self.numbytes == ONE:
return None
counter = self.numbytes
powx = THIRTY - (SIX - counter) * SIX
self.currentcharord = self.modx * TWO ** powx
powx -= SIX
while counter > ONE:
self.gimmebyte()
self.charbytes += self.currentbyte
self.modx = int(interpsubsqntbyte(self.currentbyte))
if self.modx == ERRORY:
print('Invalid subsequent byte in UTF-8 sequence.')
return None
self.currentcharord += int(self.modx * TWO ** powx)
counter -= ONE
powx -= SIX
I'll work a little with this file of some foreign language strings in the screenshot below:
Python 3.1.1 (r311:74480, Nov 29 2009, 22:24:25)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "copyright", "credits" or "license()" for more information.
>>> import sys
>>> sys.path += ['/usr/home/carl/pythonblog']
>>> import handlebytes
>>> fbx = handlebytes.FileByter('/usr/home/carl/pythonblog/foreignbytestest')
>>> fbx.readchar()
>>> fbx.charbytes
b'\xe6\x97\xa5'
>>> bin(0xe6)
'0b11100110'
>>> # three bytes in the character
>>> somejapanesecharacter = fbx.charbytes.decode('UTF-8')
>>> import unicodedata
>>> unicodedata.name(somejapanesecharacter)
'CJK UNIFIED IDEOGRAPH-65E5'
>>> # OK - let's try one byte at a time
>>> fbx.gimmebyte()
>>> fbx.currentbyte
b'\xe6'
>>> bin(0xe6)
'0b11100110'
>>> # 3 bytes again
>>> fbx.gimmebyte()
>>> fbx.currentbyte
b'\x9c'
>>> bin(0x9c)
'0b10011100'
>>> # first two bits are 10 as expected
>>> fbx.readchar()
Not a valid first byte of a UTF-8 character sequence.
>>> # third byte - can't have 10 bits leading UTF-8 first byte
>>> fbx.readchar()
>>> fbx.charbytes
b'\xe8\xaa\x9e'
>>> # OK - on to an ASCII char
>>> fbx.readchar()
>>> fbx.charbytes
b'\n'
>>> fbx.readchar()
>>> fbx.charbytes
b'a'
>>> # ASCII characters show up as themselves, not as hex numbers.
That's a lot of activity just to understand a few simple concepts. It helps me. Hopefully someone else can benefit from it too.
Final notes:
1) I like the collections.SortedDict object; it saves me having to work with a sorting key and an associated list. I believe Ronacher and Hettinger were responsible for getting it into the standard lib. Thanks, here's to people with German names from outside of Germany (Austria and America respectively).
2) The only characters I'm familiar with that have a four byte representation in UTF-8 are musical notes. They don't show up in idle, so I left that part of the testing of this go. Everything here is three bytes or less.
No comments:
Post a Comment