Wednesday, February 24, 2010

Unicode Poster From Pycon 2010 Up

I posted my poster to Slideshare in Open Office Presentation format.  The file is about 13 or 14 megs in size.  The embedded poster was in png format to preserve the shape of the foreign glyphs.

A number of people asked me at the poster session for sources.  Here are the main ones:

1) Wikipedia - perhaps not the ultimate authority on all things, but a good place to research foreign languages and scripts.

2) the O'Reilly book Fonts and Encodings by Haralambous.  If you know little about Unicode and fonts, this is the next best thing to Knuth.

3) the Python 3.1 interpreter and the unicodedata module.  Once you get the basics of Unicode down, the unicodedata module has most of what you'll need.

4) Google searches and language promotion websites - laoconnection is a site that comes to mind.  Most people are proud of their languages and culture and want to share them.

Thanks to everyone who stopped by the poster.  That was fun.

Tuesday, February 23, 2010

FOSS Conference Economics

Just got back from Pycon - great show.

I've had some time to reflect on how to make conference going affordable, and where my money goes.  This year I was partially funded by my employer.  I was quite grateful, as I wasn't expecting anything.

What has concerned me in the past is the amount of money put out on travel and hotels.  If you're inside the US, Pycon(US) will see the biggest chunk of your money going to the hotel.  This is where people (or at least me) say, "Hey, wait a second, all my monetary support for the Open Source Software movement is going to the hotel industry!"  Not so fast - actually, although your money doesn't support FOSS directly, it does keep it from *losing* money.  To secure a hotel/convention facility for more than 1000 people, there has to be a commitment on rooms.  I've seen other devs stay at cheaper hotels for conferences - this is a good approach, if it's done out of necessity.  I generally try to stay at the conference hotel in order to support the continued success of the conference - to make sure the conference doesn't lose money.

The travel argument goes roughly the same way - you can't have a conference if people don't show.  Even though most of your money is going to the airlines (in the case of Pycon(US) for those outside the United States), your attendance is a plus.

Sunday, February 7, 2010

Handling UnicodeEncodeError in the Console (Python 3.1)

I've been working with a lot of different foreign scripts for the past six months or so.  Ideally I like to work in the console where possible.  An error that always comes up is the following:

[carl@pcbsd]/home/carl(139)% python3.1
Python 3.1.1 (r311:74480, Jan 17 2010, 23:15:26)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\u0400')
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character '\u0400' in position 0: ordinal not in range(128)
>>>


After a while this can get pretty annoying.  There's a number of ways to get around the problem.  I don't know much about most of the languages I'm dealing with, so I prefer the Unicode code charts' capitalized ASCII descriptions to glyphs or empty boxes.  Fortunately the unicodedata module has all this information available.

To get the output I wanted I came up with a little script:

# mockprint.py - wrapper around print
# function to handle
# UnicodeEncoding errors


# python 3.1


import unicodedata


ERRORSTR = "'ascii' codec can't encode character "
CHARIDX = 5
POSITIDX = 8
POSITIDX2 = 7


def mockprint(stringx):
    """
    Wrapper for print() function that
    replaces unprintable characters
    with their Unicode names.
    """
    try:
        print(stringx)
    except UnicodeEncodeError as e:
        # main cases:
        # 1) one character can't be printed
        # 2) multiple characters in a row can't be printed
        # 3) unicode character is first or last in string
        # 4) other ascii characters surround the unicode ones
        reasonx = str(e)
        reasonx = reasonx.split(' ')
        idx = reasonx[POSITIDX]
        # more than 1 char in a row can't be printed
        if idx == 'ordinal':
            idx = int(reasonx[POSITIDX2][0])
            if idx != 0:
                print(stringx[:idx])
            print(unicodedata.name(stringx[idx]))
            mockprint(stringx[(idx + 1):])
        # offending character shows up after ascii chars 
        elif len(stringx) > 1:
            charx = int(reasonx[CHARIDX][3:-1], 16)
            charx = chr(charx)
            print(unicodedata.name(charx))
            mockprint(stringx[(int(idx[0]) + 1):])
        # end of the line 
        elif len(stringx) == 1:
            charx = int(reasonx[CHARIDX][3:-1], 16)
            charx = chr(charx)
            print(unicodedata.name(charx))

A quick demo:

>>> import mockprint
>>> mockprint.mockprint('hello\u0401\u0402\u0403\u0404world')
hello
CYRILLIC CAPITAL LETTER IO
CYRILLIC CAPITAL LETTER DJE
CYRILLIC CAPITAL LETTER GJE
CYRILLIC CAPITAL LETTER UKRAINIAN IE
world

And something a bit more challenging:

 

A few foreign words in a number of different languages.

>>> fle = open('/home/carl/pythonblog/foreignbytestest', 'rt', encoding = 'UTF-8')
>>> import mockprint                                                             
>>> for linex in fle.readlines():                                                
...     mockprint.mockprint(linex)                                               
...                                                                              
CJK UNIFIED IDEOGRAPH-65E5                                                       
CJK UNIFIED IDEOGRAPH-672C                                                       
CJK UNIFIED IDEOGRAPH-8A9E                                                       




abcde


ETHIOPIC SYLLABLE GLOTTAL A
ETHIOPIC SYLLABLE MAA     
ETHIOPIC SYLLABLE RE      
ETHIOPIC SYLLABLE NYAA    




ARMENIAN CAPITAL LETTER HO
ARMENIAN SMALL LETTER AYB
ARMENIAN SMALL LETTER YI 
ARMENIAN SMALL LETTER ECH
ARMENIAN SMALL LETTER REH
ARMENIAN SMALL LETTER ECH
ARMENIAN SMALL LETTER NOW




ORIYA LETTER O
ORIYA LETTER DDA
ORIYA SIGN NUKTA
ORIYA VOWEL SIGN I
ORIYA LETTER AA




LAO LETTER PHO TAM
LAO VOWEL SIGN AA
LAO LETTER SO SUNG
LAO VOWEL SIGN AA
LAO LETTER LO LOOT
LAO VOWEL SIGN AA
LAO LETTER WO




CYRILLIC SMALL LETTER ER
CYRILLIC SMALL LETTER U
CYRILLIC SMALL LETTER ES
CYRILLIC SMALL LETTER ES
CYRILLIC SMALL LETTER KA
CYRILLIC SMALL LETTER I
CYRILLIC SMALL LETTER SHORT I


CYRILLIC SMALL LETTER YA
CYRILLIC SMALL LETTER ZE
CYRILLIC SMALL LETTER YERU
CYRILLIC SMALL LETTER KA

Well, if that isn't beautiful, I don't know what is.

Seriously, this is a hack - parsing an error string and working backwards?  I've got to be joking.  Actually, no.  For as much time as I've spent remembering after the fact that I can't print Unicode in the console, this is worth it, even if it's only good for Python 3.1.
  

Tuesday, February 2, 2010

py-openbsd's DoubleAssociation

I briefly covered this structure last time, but didn't do it justice.  The idea of a two-way dictionary structure (keys and values are both keys) intrigued me. I wanted to give it a spin with a real world example.

I've chosen a simple example with a few domain name (common names) and ip addresses:

# dblassoc.py

import openbsd

# some ip addrsses paired with domains
ips = {'google':(0x4a7d1393, 0xd8ef3d68),
       'openbsd':(0x8ef40c2a,),
       'freebsd':(0x45935321,),
       'yahoo':(0xd1bf5d34, 0xd183249e)}

# OK, we can now make the DoubleAssociation
ipsbothways = openbsd.utils.DoubleAssociation(ips)

print "ipsbothways['yahoo'] = " + str(ipsbothways['yahoo'])

# fair enough, but nothing we couldn't get from the dictionary

# try to query on an ip address to get a domain name
print "ipsbothways[(0x8ef40c2a,)] = " + ipsbothways[(0x8ef40c2a,)]

# unlike a normal dictionary, DoubleAssociation gives everything
# back with the keys() method

for keyx in ipsbothways.keys():
    try:
        if 0xd183249e in keyx:
            print "domain is " + ipsbothways[keyx]
            break
    except TypeError:
        print "TypeError:  " + keyx

Python 2.5.4 (r254:67916, Jul  1 2009, 11:37:21)
[GCC 3.3.5 (propolice)] on openbsd4
Type "help", "copyright", "credits" or "license" for more information.
>>> import dblassoc
ipsbothways['yahoo'] = (3518979380L, 3515032734L)
ipsbothways[(0x8ef40c2a,)] = openbsd
TypeError:  google
TypeError:  openbsd
TypeError:  yahoo
domain is yahoo
>>>

The one thing you have to look out for is the treatment of everything in the structure as a key - that's why I had to catch the TypeError.  Everything is a value, too.  The values and keys methods yield the same results.

In real life, if you had 30 or 50 or 1000 ip addresses, this would come in handy.  Likewise for doctor-patient records, etc. (although the grouping of patients has to be unique, so that may not work after all - best to test both "sides" of the structure for exclusivity).