Saturday, December 5, 2009

Python 3.1.1, Unicode, UTF-8, bytes

A while back I did two posts, one on Lao UTF-8 and one on Arabic UTF-8, using binary representation of bytes to demonstrate UTF-8 encoding in Python.
I still wasn't satisfied with my knowledge of Python 3.x and string handling, nor was the whole UTF-8 thing second nature to me.

On Monday, I'm giving a talk to our local Linux group about the subject of my Python poster session, Python 3.1, Unicode, and Languages of the Indian Ocean Region.  In the immortal words of Dean Wormer, "Fat, drunk and stupid is no way to go through life, son."  You know, it's no way to give a presentation, either.  Time to get smarter now.

The code in this post is nothing but a verbose rehash of what the Python string object does already - decode.  But understanding exactly what is going on is helpful.  That's the beauty of having an interpreter in Python or LISP - you can check your assumptions and find out your mistakes as fast as you can type (it would help if I could type faster).

Basic things I learned in this exercise:

1) empty, one item, or many items, Python's bytes object is a sequence that must be iterated over or indexed to get byte values.

2) the bytes object has a decoding method, the string object has an encoding method; the two methods work together to switch back and forth between bytes and strings, or, less appropriately, between bytes and characters.

3) binary math goes way beyond decimal 255.  Being able to read hexadecimal numbers and recognize significant ones is a real help when working with UTF-8 and Unicode.

Enough qualifiying and waffling.  To the code and the interpreter!

"""
Utilities for dealing with bytes, UTF-8, and Unicode.
"""

# new in Python 3.1.1
from collections import OrderedDict

# gates for flag bytes of UTF-8

# 128 - just beyond ASCII
# '0b10000000'
ASCIICUTOFF = 0x80
# 192 - 2 byte designator for UTF-8 character
# '0b11000000'
TWOBYTER = 0xc0
CUTOFFS = OrderedDict()
# 224 - just beyond two byte representation
# '0b11100000'
CUTOFFS['TWOBYTES'] = 0xe0
# 240 - just beyond three byte representation
# '0b11110000'
CUTOFFS['THREEBYTES'] = 0xf0
# 248 - just beyond four byte representation
# '0b11111000'
CUTOFFS['FOURBYTES'] = 0xf8
# 252 - just beyond five byte representation
# '0b11111100'
CUTOFFS['FIVEBYTES'] = 0xfc
# 254 - end of the line
# '0b11111110'
CUTOFFS['SIXBYTES'] = 0xfe

# for binary math
THIRTYTWO = 0x20
THIRTY = 30
SIX = 6
TWO = 0x2
ONE = 0x1

ERRORX = 0, 0
ERRORY = -1

# on subsequent bytes, cannot have second to
# most significant (seventh) bit flipped on
# 63
FORBIDDEN = 0x3f

def interpfirstbyte(firstbyte):
    """
    In a UTF-8 byte sequence,
    function which interprets
    the first byte of a UTF-8
    encoded character.

    Returns 2 tuple of number of
    bytes total in the UTF-8 character
    encoding and value of significant
    bits in first byte of UTF-8
    character encoding.
   
    interp1stbyte(0xf1) -> (4, 1)
    """
    twopow = THIRTYTWO
    if firstbyte[0] < ASCIICUTOFF:
        return ONE, firstbyte[0]
    elif firstbyte[0] < TWOBYTER:
        print('Not a valid first byte of a UTF-8 character sequence.')
        return ERRORX
    # characters beyond ASCII range
    counter = TWO
    for cutoff in CUTOFFS:
        if firstbyte[0] < CUTOFFS[cutoff]:
            return (counter, int(firstbyte[0] % (CUTOFFS[cutoff] - twopow)))
        twopow /= TWO
        counter += ONE

    print('Not a valid first byte of a UTF-8 character sequence.')
    return ERRORX

def interpsubsqntbyte(nextbyte):
    """
    In a multiple byte UTF-8 character
    representation, returns significant
    part of byte as an integer.

    interpsubsqntbyte(0x81) -> 1
    """
    retval = int(nextbyte[0] % ASCIICUTOFF)
    if retval > FORBIDDEN:

        print('Not a valid byte for a UTF-8 multibyte sequence.')
        return ERRORY
    return retval

class FileByter:
    """
    Attempt to put UTF-8 readable file
    object into a class that processes
    one byte, then one character at a time.
    """
    def __init__(self, filenamex):
        self.filename = filenamex
        self.fle = open(self.filename, 'rb')
        self.currentbyte = None
        self.numbytes = 0
        self.modx = 0
        self.currentcharord = -1
        self.charbytes = b''
    def gimmebyte(self):
        """
        Assigns the next byte in the file
        being read to self.currentbyte.

        Closes file and assigns currentbyte
        value of None at end of file.
        """
        self.currentbyte = self.fle.read(1)
        if len(self.currentbyte) == 0:
            self.fle.close()
            print('Closed file {0}.'.format(self.filename))
            self.currentbyte = None
    def readchar(self):
        counter = 0
        powx = 0
        self.charbytes = b''
        self.gimmebyte()
        if self.currentbyte:
            self.numbytes, self.modx = interpfirstbyte(self.currentbyte)
            if (self.numbytes, self.modx) == ERRORX:
                print('Not on first byte of UTF-8 char sequence.  Try again.')
                return None
            self.charbytes += self.currentbyte
            if self.numbytes == ONE:
                return None
            counter = self.numbytes
            powx = THIRTY - (SIX - counter) * SIX
            self.currentcharord = self.modx * TWO ** powx
            powx -= SIX
            while counter > ONE:
                self.gimmebyte()
                self.charbytes += self.currentbyte
                self.modx = int(interpsubsqntbyte(self.currentbyte))

                if self.modx == ERRORY:
                    print('Invalid subsequent byte in UTF-8 sequence.')
                    return None

                self.currentcharord += int(self.modx * TWO ** powx)
                counter -= ONE
                powx -= SIX



I'll work a little with this file of some foreign language strings in the screenshot below:

















Python 3.1.1 (r311:74480, Nov 29 2009, 22:24:25)
[GCC 4.2.1 20070719  [FreeBSD]] on freebsd7
Type "copyright", "credits" or "license()" for more information.
>>> import sys
>>> sys.path += ['/usr/home/carl/pythonblog']
>>> import handlebytes
>>> fbx = handlebytes.FileByter('/usr/home/carl/pythonblog/foreignbytestest')
>>> fbx.readchar()
>>> fbx.charbytes
b'\xe6\x97\xa5'
>>> bin(0xe6)
'0b11100110'
>>> # three bytes in the character
>>> somejapanesecharacter = fbx.charbytes.decode('UTF-8')
>>> import unicodedata
>>> unicodedata.name(somejapanesecharacter)
'CJK UNIFIED IDEOGRAPH-65E5'
>>> # OK - let's try one byte at a time
>>> fbx.gimmebyte()
>>> fbx.currentbyte
b'\xe6'
>>> bin(0xe6)
'0b11100110'
>>> # 3 bytes again
>>> fbx.gimmebyte()
>>> fbx.currentbyte
b'\x9c'
>>> bin(0x9c)
'0b10011100'
>>> # first two bits are 10 as expected
>>> fbx.readchar()
Not a valid first byte of a UTF-8 character sequence.
>>> # third byte - can't have 10 bits leading UTF-8 first byte
>>> fbx.readchar()
>>> fbx.charbytes
b'\xe8\xaa\x9e'
>>> # OK - on to an ASCII char
>>> fbx.readchar()
>>> fbx.charbytes
b'\n'
>>> fbx.readchar()
>>> fbx.charbytes
b'a'
>>> # ASCII characters show up as themselves, not as hex numbers.


That's a lot of activity just to understand a few simple concepts.  It helps me.  Hopefully someone else can benefit from it too.

Final notes:

1) I like the collections.SortedDict object; it saves me having to work with a sorting key and an associated list.  I believe Ronacher and Hettinger were responsible for getting it into the standard lib.  Thanks, here's to people with German names from outside of Germany (Austria and America respectively).

2) The only characters I'm familiar with that have a four byte representation in UTF-8 are musical notes.  They don't show up in idle, so I left that part of the testing of this go.  Everything here is three bytes or less.

No comments:

Post a Comment