Sunday, February 7, 2010

Handling UnicodeEncodeError in the Console (Python 3.1)

I've been working with a lot of different foreign scripts for the past six months or so.  Ideally I like to work in the console where possible.  An error that always comes up is the following:

[carl@pcbsd]/home/carl(139)% python3.1
Python 3.1.1 (r311:74480, Jan 17 2010, 23:15:26)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\u0400')
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character '\u0400' in position 0: ordinal not in range(128)
>>>


After a while this can get pretty annoying.  There's a number of ways to get around the problem.  I don't know much about most of the languages I'm dealing with, so I prefer the Unicode code charts' capitalized ASCII descriptions to glyphs or empty boxes.  Fortunately the unicodedata module has all this information available.

To get the output I wanted I came up with a little script:

# mockprint.py - wrapper around print
# function to handle
# UnicodeEncoding errors


# python 3.1


import unicodedata


ERRORSTR = "'ascii' codec can't encode character "
CHARIDX = 5
POSITIDX = 8
POSITIDX2 = 7


def mockprint(stringx):
    """
    Wrapper for print() function that
    replaces unprintable characters
    with their Unicode names.
    """
    try:
        print(stringx)
    except UnicodeEncodeError as e:
        # main cases:
        # 1) one character can't be printed
        # 2) multiple characters in a row can't be printed
        # 3) unicode character is first or last in string
        # 4) other ascii characters surround the unicode ones
        reasonx = str(e)
        reasonx = reasonx.split(' ')
        idx = reasonx[POSITIDX]
        # more than 1 char in a row can't be printed
        if idx == 'ordinal':
            idx = int(reasonx[POSITIDX2][0])
            if idx != 0:
                print(stringx[:idx])
            print(unicodedata.name(stringx[idx]))
            mockprint(stringx[(idx + 1):])
        # offending character shows up after ascii chars 
        elif len(stringx) > 1:
            charx = int(reasonx[CHARIDX][3:-1], 16)
            charx = chr(charx)
            print(unicodedata.name(charx))
            mockprint(stringx[(int(idx[0]) + 1):])
        # end of the line 
        elif len(stringx) == 1:
            charx = int(reasonx[CHARIDX][3:-1], 16)
            charx = chr(charx)
            print(unicodedata.name(charx))

A quick demo:

>>> import mockprint
>>> mockprint.mockprint('hello\u0401\u0402\u0403\u0404world')
hello
CYRILLIC CAPITAL LETTER IO
CYRILLIC CAPITAL LETTER DJE
CYRILLIC CAPITAL LETTER GJE
CYRILLIC CAPITAL LETTER UKRAINIAN IE
world

And something a bit more challenging:

 

A few foreign words in a number of different languages.

>>> fle = open('/home/carl/pythonblog/foreignbytestest', 'rt', encoding = 'UTF-8')
>>> import mockprint                                                             
>>> for linex in fle.readlines():                                                
...     mockprint.mockprint(linex)                                               
...                                                                              
CJK UNIFIED IDEOGRAPH-65E5                                                       
CJK UNIFIED IDEOGRAPH-672C                                                       
CJK UNIFIED IDEOGRAPH-8A9E                                                       




abcde


ETHIOPIC SYLLABLE GLOTTAL A
ETHIOPIC SYLLABLE MAA     
ETHIOPIC SYLLABLE RE      
ETHIOPIC SYLLABLE NYAA    




ARMENIAN CAPITAL LETTER HO
ARMENIAN SMALL LETTER AYB
ARMENIAN SMALL LETTER YI 
ARMENIAN SMALL LETTER ECH
ARMENIAN SMALL LETTER REH
ARMENIAN SMALL LETTER ECH
ARMENIAN SMALL LETTER NOW




ORIYA LETTER O
ORIYA LETTER DDA
ORIYA SIGN NUKTA
ORIYA VOWEL SIGN I
ORIYA LETTER AA




LAO LETTER PHO TAM
LAO VOWEL SIGN AA
LAO LETTER SO SUNG
LAO VOWEL SIGN AA
LAO LETTER LO LOOT
LAO VOWEL SIGN AA
LAO LETTER WO




CYRILLIC SMALL LETTER ER
CYRILLIC SMALL LETTER U
CYRILLIC SMALL LETTER ES
CYRILLIC SMALL LETTER ES
CYRILLIC SMALL LETTER KA
CYRILLIC SMALL LETTER I
CYRILLIC SMALL LETTER SHORT I


CYRILLIC SMALL LETTER YA
CYRILLIC SMALL LETTER ZE
CYRILLIC SMALL LETTER YERU
CYRILLIC SMALL LETTER KA

Well, if that isn't beautiful, I don't know what is.

Seriously, this is a hack - parsing an error string and working backwards?  I've got to be joking.  Actually, no.  For as much time as I've spent remembering after the fact that I can't print Unicode in the console, this is worth it, even if it's only good for Python 3.1.
  

2 comments:

  1. What terminal are you using these days that is not Unicode-capable? You might, through a judicious adjustment in its options, be able to simply see this in your console, as I do in mine:

    guinness$ python3.1
    Python 3.1.1+ (r311:74480, Nov 2 2009, 14:49:22)
    [GCC 4.4.1] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print('\u0400')
    Ѐ
    >>>

    ReplyDelete
  2. @Brandon Craig Rhodes, thanks. Linux does have better support for this.

    I'm using FreeBSD, which does not have genuine UTF-8 support for the console (raw screen command line outside of X).

    There is a thread here that addresses some workarounds for FreeBSD: http://forums.freebsd.org/showthread.php?t=311. I don't know if it's possible to get Asian characters to show up in the console through any of these, or similar methods.

    I have a Suse 11.2 install on one of my old dell towers at home. At some point I will give it a try there. In the meantime, I like seeing the names of the characters.

    See you at Pycon!

    Carl T.

    ReplyDelete