pyright: Handling UnicodeEncodeError in the Console (Python 3.1)

I've been working with a lot of different foreign scripts for the past six months or so. Ideally I like to work in the console where possible. An error that always comes up is the following:

[carl@pcbsd]/home/carl(139)% python3.1
Python 3.1.1 (r311:74480, Jan 17 2010, 23:15:26)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\u0400')
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character '\u0400' in position 0: ordinal not in range(128)
>>>

After a while this can get pretty annoying. There's a number of ways to get around the problem. I don't know much about most of the languages I'm dealing with, so I prefer the Unicode code charts' capitalized ASCII descriptions to glyphs or empty boxes. Fortunately the unicodedata module has all this information available.

To get the output I wanted I came up with a little script:

# mockprint.py - wrapper around print
# function to handle
# UnicodeEncoding errors

# python 3.1

import unicodedata

ERRORSTR = "'ascii' codec can't encode character "
CHARIDX = 5
POSITIDX = 8
POSITIDX2 = 7

def mockprint(stringx):
"""
Wrapper for print() function that
replaces unprintable characters
with their Unicode names.
"""
try:
print(stringx)
except UnicodeEncodeError as e:
# main cases:
# 1) one character can't be printed
# 2) multiple characters in a row can't be printed
# 3) unicode character is first or last in string
# 4) other ascii characters surround the unicode ones
reasonx = str(e)
reasonx = reasonx.split(' ')
idx = reasonx[POSITIDX]
# more than 1 char in a row can't be printed
if idx == 'ordinal':
idx = int(reasonx[POSITIDX2][0])
if idx != 0:
print(stringx[:idx])
print(unicodedata.name(stringx[idx]))
mockprint(stringx[(idx + 1):])
# offending character shows up after ascii chars
elif len(stringx) > 1:
charx = int(reasonx[CHARIDX][3:-1], 16)
charx = chr(charx)
print(unicodedata.name(charx))
mockprint(stringx[(int(idx[0]) + 1):])
# end of the line
elif len(stringx) == 1:
charx = int(reasonx[CHARIDX][3:-1], 16)
charx = chr(charx)
print(unicodedata.name(charx))

A quick demo:

>>> import mockprint
>>> mockprint.mockprint('hello\u0401\u0402\u0403\u0404world')
hello
CYRILLIC CAPITAL LETTER IO
CYRILLIC CAPITAL LETTER DJE
CYRILLIC CAPITAL LETTER GJE
CYRILLIC CAPITAL LETTER UKRAINIAN IE
world

And something a bit more challenging:

A few foreign words in a number of different languages.

>>> fle = open('/home/carl/pythonblog/foreignbytestest', 'rt', encoding = 'UTF-8')
>>> import mockprint
>>> for linex in fle.readlines():
...     mockprint.mockprint(linex)
...
CJK UNIFIED IDEOGRAPH-65E5
CJK UNIFIED IDEOGRAPH-672C
CJK UNIFIED IDEOGRAPH-8A9E

abcde

ETHIOPIC SYLLABLE GLOTTAL A
ETHIOPIC SYLLABLE MAA
ETHIOPIC SYLLABLE RE
ETHIOPIC SYLLABLE NYAA

ARMENIAN CAPITAL LETTER HO
ARMENIAN SMALL LETTER AYB
ARMENIAN SMALL LETTER YI
ARMENIAN SMALL LETTER ECH
ARMENIAN SMALL LETTER REH
ARMENIAN SMALL LETTER ECH
ARMENIAN SMALL LETTER NOW

ORIYA LETTER O
ORIYA LETTER DDA
ORIYA SIGN NUKTA
ORIYA VOWEL SIGN I
ORIYA LETTER AA

LAO LETTER PHO TAM
LAO VOWEL SIGN AA
LAO LETTER SO SUNG
LAO VOWEL SIGN AA
LAO LETTER LO LOOT
LAO VOWEL SIGN AA
LAO LETTER WO

CYRILLIC SMALL LETTER ER
CYRILLIC SMALL LETTER U
CYRILLIC SMALL LETTER ES
CYRILLIC SMALL LETTER ES
CYRILLIC SMALL LETTER KA
CYRILLIC SMALL LETTER I
CYRILLIC SMALL LETTER SHORT I

CYRILLIC SMALL LETTER YA
CYRILLIC SMALL LETTER ZE
CYRILLIC SMALL LETTER YERU
CYRILLIC SMALL LETTER KA

Well, if that isn't beautiful, I don't know what is.

Seriously, this is a hack - parsing an error string and working backwards? I've got to be joking. Actually, no. For as much time as I've spent remembering after the fact that I can't print Unicode in the console, this is worth it, even if it's only good for Python 3.1.

2 comments:

Brandon RhodesFebruary 8, 2010 at 6:28 AM
What terminal are you using these days that is not Unicode-capable? You might, through a judicious adjustment in its options, be able to simply see this in your console, as I do in mine:

guinness$ python3.1
Python 3.1.1+ (r311:74480, Nov 2 2009, 14:49:22)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\u0400')
Ѐ
>>>
Carl TrachteFebruary 8, 2010 at 7:04 AM
@Brandon Craig Rhodes, thanks. Linux does have better support for this.

I'm using FreeBSD, which does not have genuine UTF-8 support for the console (raw screen command line outside of X).

There is a thread here that addresses some workarounds for FreeBSD: http://forums.freebsd.org/showthread.php?t=311. I don't know if it's possible to get Asian characters to show up in the console through any of these, or similar methods.

I have a Suse 11.2 install on one of my old dell towers at home. At some point I will give it a try there. In the meantime, I like seeing the names of the characters.

See you at Pycon!

Carl T.

New comments are not allowed.

pyright

Sunday, February 7, 2010

Handling UnicodeEncodeError in the Console (Python 3.1)

2 comments:

About Me

Blog Archive