Monday, October 26, 2009

More Unicode in Python (Lao this time)

Last post I took on a fairly well established language for Python identifiers and UTF-8 encoded text: Armenian (with a little Russian thrown in at the end).

This time I'll relate something a bit more challenging: the southeast Asian language Lao.

Before I get started, a quick recap:

1) Python version 3.0 and higher can handle Unicode identifiers and UTF-8 encoded source.

2) Getting non-ASCII characters to show up correctly, from least difficult medium to most difficult:

  a) Web Browser
  b) Desktop Application
  c) Console

Lao is a bit of an exception, at least the way I have things set up. The GUI (idle) with some tweaking, works pretty well. The Python Wiki, not so much.

Here's the LaoLanguage page on the Python Wiki (in Opera on my laptop):




Yuck! It's supposed to be seven characters wide (not eleven) and what we in the West would call accent marks are supposed to sit above the character to be modified, not northeast of them.

Well, let's try it in idle. After some messing around with the font I discovered that Courier 10 point size 14 seems to yield the best results:




Not perfect, but not bad either. This is what it's supposed to look like from laoconnection:





From here on out we should be able to dispense with the screen shots. I assigned the variable laostring the text value I pasted in to idle. Now we can inspect it a little bit:

>>> len(laostring)
11
>>> # only 7 character widths
>>> import unicodedata
>>> for charx in laostring:
        print(unicodedata.name(charx))


LAO LETTER SO TAM
LAO VOWEL SIGN YY
LAO TONE MAI EK
LAO LETTER KHO SUNG
LAO LETTER O
LAO TONE MAI THO
LAO LETTER YO
LAO VOWEL SIGN EI
LAO LETTER MO
LAO TONE MAI EK
LAO LETTER NO

>>> 


The Unicode Standard doesn't handle all languages the same way. Whereas "ä" takes only one codepoint, the single character (sort of) LAO LETTER SO TAM, LAO VOWEL SIGN YY, LAO TONE MAI EK takes up three.

Stuff like this is second nature to people working with East Asian languages daily. For me, it takes some getting used to, especially the string length thing. Rendering is an entirely different problem, but I am hoping that with practice and experimentation, I'll get a better handle on what works. Suggestions welcome.

Monday, October 5, 2009

Python 3.0/3.1 and Unicode Identifiers

Lately, I've been attempting to update the Python Wiki's non-English language pages with code snippets that highlight Python 3.0 and Python 3.1's ability to host non-ascii (Unicode/UTF-8 encoded) identifiers. Here's an example from the Armenian page (http://wiki.python.org/moin/ArmenianLanguage):

# python 3.0/3.1
# English
to_say = ['hello',
          'Good morning',
          'how are you']
# common Armenian name
name = ['Ashot', 'Armen', 'Anahit']
for namex in name:
    for greeting in to_say:
        print(greeting + ' ' + namex)

# Հայերեն լեզու
ասել = ['Ողջույն',
        'Բարի օր',
        'Ինչպե՞ս եք']
# տարածված Հայկական անուն
անուն = ['Աշոտ', 'Արմեն', 'Անահիտ']
for անունx in անուն:
    for Ողջույնx in ասել:
        print(Ողջույնx + ' ' + անունx)


I've saved this code snippet in a file called armenian.py, which I'll make use of shortly.

After getting the Python Wiki set up with a handful of these, it occurred to me that I hadn't actually run any of these scripts. My first attempts didn't work out so well. I was trying to run the scripts out of Konsole on KDE under FreeBSD 7.2. After googling and taking a look at the Konsole handbook, I came up empty (I suspect a good how to is in both of these sources; I just couldn't find it in a time efficient manner.) I did note, however, that I could get the non-ascii characters to show up in idle, the default Python editor, on Windows.

It could be my ports collection is out of date, but I did not have access to tkinter (required to run idle) for Python 3.0 on the FreeBSD machine. I downloaded the Python3.1 source from python.org and compiled it outside the ports system. Lo and behold, I had access to the Unicode characters within the Python interpreter.

Inside idle:

Python 3.1.1 (r311:74480, Oct 3 2009, 21:55:31)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "copyright", "credits" or "license()" for more information.
>>> import armenian
hello Ashot
Good morning Ashot
how are you Ashot
hello Armen
Good morning Armen
how are you Armen
hello Anahit
Good morning Anahit
how are you Anahit
Ողջույն Աշոտ
Բարի օր Աշոտ
Ինչպե՞ս եք Աշոտ
Ողջույն Արմեն
Բարի օր Արմեն
Ինչպե՞ս եք Արմեն
Ողջույն Անահիտ
Բարի օր Անահիտ
Ինչպե՞ս եք Անահիտ
>>>
 

Well, if that isn't beautiful, I don't know what is.

There still remains the problem of typing the text in (the mini armenian.py program was copied from a webpage). I found gvim and its help files useful here.

What worked for me in gvim:

:se encoding=utf-8
:se gfn=-misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1


To find available keyboard layouts:

:echo globpath(&rtp, "keymap/*.vim")

I saw one called russian-yawerty (presumably this is similar to querty - worth a try):

:se keymap=russian-yawerty

Вот карандаш

:se keymap= 

takes me back to the default (US English).

This was enough to get me started. I hope to get more proficient and learn more as I get more experience with this.

Notes:

1) Hasmik of CalPoly was kind enough to provide me with the Armenian presented here. Many thanks.

2) I think the Russian phrase means "Here is a pencil."

3) Armenian doesn't use a question mark. Instead the character that appears as a superscript in the first word of the Armenian phrase for How are you? serves this purpose.