I posted my poster to Slideshare in Open Office Presentation format. The file is about 13 or 14 megs in size. The embedded poster was in png format to preserve the shape of the foreign glyphs.
A number of people asked me at the poster session for sources. Here are the main ones:
1) Wikipedia - perhaps not the ultimate authority on all things, but a good place to research foreign languages and scripts.
2) the O'Reilly book Fonts and Encodings by Haralambous. If you know little about Unicode and fonts, this is the next best thing to Knuth.
3) the Python 3.1 interpreter and the unicodedata module. Once you get the basics of Unicode down, the unicodedata module has most of what you'll need.
4) Google searches and language promotion websites - laoconnection is a site that comes to mind. Most people are proud of their languages and culture and want to share them.
Thanks to everyone who stopped by the poster. That was fun.
Wednesday, February 24, 2010
Tuesday, February 23, 2010
FOSS Conference Economics
Just got back from Pycon - great show.
I've had some time to reflect on how to make conference going affordable, and where my money goes. This year I was partially funded by my employer. I was quite grateful, as I wasn't expecting anything.
What has concerned me in the past is the amount of money put out on travel and hotels. If you're inside the US, Pycon(US) will see the biggest chunk of your money going to the hotel. This is where people (or at least me) say, "Hey, wait a second, all my monetary support for the Open Source Software movement is going to the hotel industry!" Not so fast - actually, although your money doesn't support FOSS directly, it does keep it from *losing* money. To secure a hotel/convention facility for more than 1000 people, there has to be a commitment on rooms. I've seen other devs stay at cheaper hotels for conferences - this is a good approach, if it's done out of necessity. I generally try to stay at the conference hotel in order to support the continued success of the conference - to make sure the conference doesn't lose money.
The travel argument goes roughly the same way - you can't have a conference if people don't show. Even though most of your money is going to the airlines (in the case of Pycon(US) for those outside the United States), your attendance is a plus.
I've had some time to reflect on how to make conference going affordable, and where my money goes. This year I was partially funded by my employer. I was quite grateful, as I wasn't expecting anything.
What has concerned me in the past is the amount of money put out on travel and hotels. If you're inside the US, Pycon(US) will see the biggest chunk of your money going to the hotel. This is where people (or at least me) say, "Hey, wait a second, all my monetary support for the Open Source Software movement is going to the hotel industry!" Not so fast - actually, although your money doesn't support FOSS directly, it does keep it from *losing* money. To secure a hotel/convention facility for more than 1000 people, there has to be a commitment on rooms. I've seen other devs stay at cheaper hotels for conferences - this is a good approach, if it's done out of necessity. I generally try to stay at the conference hotel in order to support the continued success of the conference - to make sure the conference doesn't lose money.
The travel argument goes roughly the same way - you can't have a conference if people don't show. Even though most of your money is going to the airlines (in the case of Pycon(US) for those outside the United States), your attendance is a plus.
Sunday, February 7, 2010
Handling UnicodeEncodeError in the Console (Python 3.1)
I've been working with a lot of different foreign scripts for the past six months or so. Ideally I like to work in the console where possible. An error that always comes up is the following:
[carl@pcbsd]/home/carl(139)% python3.1
Python 3.1.1 (r311:74480, Jan 17 2010, 23:15:26)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\u0400')
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character '\u0400' in position 0: ordinal not in range(128)
>>>
After a while this can get pretty annoying. There's a number of ways to get around the problem. I don't know much about most of the languages I'm dealing with, so I prefer the Unicode code charts' capitalized ASCII descriptions to glyphs or empty boxes. Fortunately the unicodedata module has all this information available.
To get the output I wanted I came up with a little script:
# mockprint.py - wrapper around print
# function to handle
# UnicodeEncoding errors
# python 3.1
import unicodedata
ERRORSTR = "'ascii' codec can't encode character "
CHARIDX = 5
POSITIDX = 8
POSITIDX2 = 7
def mockprint(stringx):
"""
Wrapper for print() function that
replaces unprintable characters
with their Unicode names.
"""
try:
print(stringx)
except UnicodeEncodeError as e:
# main cases:
# 1) one character can't be printed
# 2) multiple characters in a row can't be printed
# 3) unicode character is first or last in string
# 4) other ascii characters surround the unicode ones
reasonx = str(e)
reasonx = reasonx.split(' ')
idx = reasonx[POSITIDX]
# more than 1 char in a row can't be printed
if idx == 'ordinal':
idx = int(reasonx[POSITIDX2][0])
if idx != 0:
print(stringx[:idx])
print(unicodedata.name(stringx[idx]))
mockprint(stringx[(idx + 1):])
# offending character shows up after ascii chars
elif len(stringx) > 1:
charx = int(reasonx[CHARIDX][3:-1], 16)
charx = chr(charx)
print(unicodedata.name(charx))
mockprint(stringx[(int(idx[0]) + 1):])
# end of the line
elif len(stringx) == 1:
charx = int(reasonx[CHARIDX][3:-1], 16)
charx = chr(charx)
print(unicodedata.name(charx))
A quick demo:
>>> import mockprint
>>> mockprint.mockprint('hello\u0401\u0402\u0403\u0404world')
hello
CYRILLIC CAPITAL LETTER IO
CYRILLIC CAPITAL LETTER DJE
CYRILLIC CAPITAL LETTER GJE
CYRILLIC CAPITAL LETTER UKRAINIAN IE
world
And something a bit more challenging:
>>> fle = open('/home/carl/pythonblog/foreignbytestest', 'rt', encoding = 'UTF-8')
>>> import mockprint
>>> for linex in fle.readlines():
... mockprint.mockprint(linex)
...
CJK UNIFIED IDEOGRAPH-65E5
CJK UNIFIED IDEOGRAPH-672C
CJK UNIFIED IDEOGRAPH-8A9E
abcde
ETHIOPIC SYLLABLE GLOTTAL A
ETHIOPIC SYLLABLE MAA
ETHIOPIC SYLLABLE RE
ETHIOPIC SYLLABLE NYAA
ARMENIAN CAPITAL LETTER HO
ARMENIAN SMALL LETTER AYB
ARMENIAN SMALL LETTER YI
ARMENIAN SMALL LETTER ECH
ARMENIAN SMALL LETTER REH
ARMENIAN SMALL LETTER ECH
ARMENIAN SMALL LETTER NOW
ORIYA LETTER O
ORIYA LETTER DDA
ORIYA SIGN NUKTA
ORIYA VOWEL SIGN I
ORIYA LETTER AA
LAO LETTER PHO TAM
LAO VOWEL SIGN AA
LAO LETTER SO SUNG
LAO VOWEL SIGN AA
LAO LETTER LO LOOT
LAO VOWEL SIGN AA
LAO LETTER WO
CYRILLIC SMALL LETTER ER
CYRILLIC SMALL LETTER U
CYRILLIC SMALL LETTER ES
CYRILLIC SMALL LETTER ES
CYRILLIC SMALL LETTER KA
CYRILLIC SMALL LETTER I
CYRILLIC SMALL LETTER SHORT I
CYRILLIC SMALL LETTER YA
CYRILLIC SMALL LETTER ZE
CYRILLIC SMALL LETTER YERU
CYRILLIC SMALL LETTER KA
[carl@pcbsd]/home/carl(139)% python3.1
Python 3.1.1 (r311:74480, Jan 17 2010, 23:15:26)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\u0400')
Traceback (most recent call last):
File "
UnicodeEncodeError: 'ascii' codec can't encode character '\u0400' in position 0: ordinal not in range(128)
>>>
After a while this can get pretty annoying. There's a number of ways to get around the problem. I don't know much about most of the languages I'm dealing with, so I prefer the Unicode code charts' capitalized ASCII descriptions to glyphs or empty boxes. Fortunately the unicodedata module has all this information available.
To get the output I wanted I came up with a little script:
# mockprint.py - wrapper around print
# function to handle
# UnicodeEncoding errors
# python 3.1
import unicodedata
ERRORSTR = "'ascii' codec can't encode character "
CHARIDX = 5
POSITIDX = 8
POSITIDX2 = 7
def mockprint(stringx):
"""
Wrapper for print() function that
replaces unprintable characters
with their Unicode names.
"""
try:
print(stringx)
except UnicodeEncodeError as e:
# main cases:
# 1) one character can't be printed
# 2) multiple characters in a row can't be printed
# 3) unicode character is first or last in string
# 4) other ascii characters surround the unicode ones
reasonx = str(e)
reasonx = reasonx.split(' ')
idx = reasonx[POSITIDX]
# more than 1 char in a row can't be printed
if idx == 'ordinal':
idx = int(reasonx[POSITIDX2][0])
if idx != 0:
print(stringx[:idx])
print(unicodedata.name(stringx[idx]))
mockprint(stringx[(idx + 1):])
# offending character shows up after ascii chars
elif len(stringx) > 1:
charx = int(reasonx[CHARIDX][3:-1], 16)
charx = chr(charx)
print(unicodedata.name(charx))
mockprint(stringx[(int(idx[0]) + 1):])
# end of the line
elif len(stringx) == 1:
charx = int(reasonx[CHARIDX][3:-1], 16)
charx = chr(charx)
print(unicodedata.name(charx))
A quick demo:
>>> import mockprint
>>> mockprint.mockprint('hello\u0401\u0402\u0403\u0404world')
hello
CYRILLIC CAPITAL LETTER IO
CYRILLIC CAPITAL LETTER DJE
CYRILLIC CAPITAL LETTER GJE
CYRILLIC CAPITAL LETTER UKRAINIAN IE
world
And something a bit more challenging:
A few foreign words in a number of different languages.
>>> import mockprint
>>> for linex in fle.readlines():
... mockprint.mockprint(linex)
...
CJK UNIFIED IDEOGRAPH-65E5
CJK UNIFIED IDEOGRAPH-672C
CJK UNIFIED IDEOGRAPH-8A9E
abcde
ETHIOPIC SYLLABLE GLOTTAL A
ETHIOPIC SYLLABLE MAA
ETHIOPIC SYLLABLE RE
ETHIOPIC SYLLABLE NYAA
ARMENIAN CAPITAL LETTER HO
ARMENIAN SMALL LETTER AYB
ARMENIAN SMALL LETTER YI
ARMENIAN SMALL LETTER ECH
ARMENIAN SMALL LETTER REH
ARMENIAN SMALL LETTER ECH
ARMENIAN SMALL LETTER NOW
ORIYA LETTER O
ORIYA LETTER DDA
ORIYA SIGN NUKTA
ORIYA VOWEL SIGN I
ORIYA LETTER AA
LAO LETTER PHO TAM
LAO VOWEL SIGN AA
LAO LETTER SO SUNG
LAO VOWEL SIGN AA
LAO LETTER LO LOOT
LAO VOWEL SIGN AA
LAO LETTER WO
CYRILLIC SMALL LETTER ER
CYRILLIC SMALL LETTER U
CYRILLIC SMALL LETTER ES
CYRILLIC SMALL LETTER ES
CYRILLIC SMALL LETTER KA
CYRILLIC SMALL LETTER I
CYRILLIC SMALL LETTER SHORT I
CYRILLIC SMALL LETTER YA
CYRILLIC SMALL LETTER ZE
CYRILLIC SMALL LETTER YERU
CYRILLIC SMALL LETTER KA
Well, if that isn't beautiful, I don't know what is.
Seriously, this is a hack - parsing an error string and working backwards? I've got to be joking. Actually, no. For as much time as I've spent remembering after the fact that I can't print Unicode in the console, this is worth it, even if it's only good for Python 3.1.
Tuesday, February 2, 2010
py-openbsd's DoubleAssociation
I briefly covered this structure last time, but didn't do it justice. The idea of a two-way dictionary structure (keys and values are both keys) intrigued me. I wanted to give it a spin with a real world example.
I've chosen a simple example with a few domain name (common names) and ip addresses:
# dblassoc.py
import openbsd
# some ip addrsses paired with domains
ips = {'google':(0x4a7d1393, 0xd8ef3d68),
'openbsd':(0x8ef40c2a,),
'freebsd':(0x45935321,),
'yahoo':(0xd1bf5d34, 0xd183249e)}
# OK, we can now make the DoubleAssociation
ipsbothways = openbsd.utils.DoubleAssociation(ips)
print "ipsbothways['yahoo'] = " + str(ipsbothways['yahoo'])
# fair enough, but nothing we couldn't get from the dictionary
# try to query on an ip address to get a domain name
print "ipsbothways[(0x8ef40c2a,)] = " + ipsbothways[(0x8ef40c2a,)]
# unlike a normal dictionary, DoubleAssociation gives everything
# back with the keys() method
for keyx in ipsbothways.keys():
try:
if 0xd183249e in keyx:
print "domain is " + ipsbothways[keyx]
break
except TypeError:
print "TypeError: " + keyx
Python 2.5.4 (r254:67916, Jul 1 2009, 11:37:21)
[GCC 3.3.5 (propolice)] on openbsd4
Type "help", "copyright", "credits" or "license" for more information.
>>> import dblassoc
ipsbothways['yahoo'] = (3518979380L, 3515032734L)
ipsbothways[(0x8ef40c2a,)] = openbsd
TypeError: google
TypeError: openbsd
TypeError: yahoo
domain is yahoo
>>>
The one thing you have to look out for is the treatment of everything in the structure as a key - that's why I had to catch the TypeError. Everything is a value, too. The values and keys methods yield the same results.
In real life, if you had 30 or 50 or 1000 ip addresses, this would come in handy. Likewise for doctor-patient records, etc. (although the grouping of patients has to be unique, so that may not work after all - best to test both "sides" of the structure for exclusivity).
I've chosen a simple example with a few domain name (common names) and ip addresses:
# dblassoc.py
import openbsd
# some ip addrsses paired with domains
ips = {'google':(0x4a7d1393, 0xd8ef3d68),
'openbsd':(0x8ef40c2a,),
'freebsd':(0x45935321,),
'yahoo':(0xd1bf5d34, 0xd183249e)}
# OK, we can now make the DoubleAssociation
ipsbothways = openbsd.utils.DoubleAssociation(ips)
print "ipsbothways['yahoo'] = " + str(ipsbothways['yahoo'])
# fair enough, but nothing we couldn't get from the dictionary
# try to query on an ip address to get a domain name
print "ipsbothways[(0x8ef40c2a,)] = " + ipsbothways[(0x8ef40c2a,)]
# unlike a normal dictionary, DoubleAssociation gives everything
# back with the keys() method
for keyx in ipsbothways.keys():
try:
if 0xd183249e in keyx:
print "domain is " + ipsbothways[keyx]
break
except TypeError:
print "TypeError: " + keyx
Python 2.5.4 (r254:67916, Jul 1 2009, 11:37:21)
[GCC 3.3.5 (propolice)] on openbsd4
Type "help", "copyright", "credits" or "license" for more information.
>>> import dblassoc
ipsbothways['yahoo'] = (3518979380L, 3515032734L)
ipsbothways[(0x8ef40c2a,)] = openbsd
TypeError: google
TypeError: openbsd
TypeError: yahoo
domain is yahoo
>>>
The one thing you have to look out for is the treatment of everything in the structure as a key - that's why I had to catch the TypeError. Everything is a value, too. The values and keys methods yield the same results.
In real life, if you had 30 or 50 or 1000 ip addresses, this would come in handy. Likewise for doctor-patient records, etc. (although the grouping of patients has to be unique, so that may not work after all - best to test both "sides" of the structure for exclusivity).
Subscribe to:
Posts (Atom)