Friday, August 28, 2009

Unicode - Arabic

There had been some discussion on the Python Diversity list about foreign languages and making finding Python resources that are available in them easier. The foreign language page on the Python Wiki shows a number of languages, but some of the more prominent ones are missing. In addition, all the languages are listed in English in ASCII compatible characters.

It was my thought that the languages should also be listed in a manner that a native speaker who is not fluent in English could recognize.

I started alphabetically and came across Arabic. Arabic is a language written in a bidi, or bidirectional, manner (letters go right to left; numers go left to right - not an official definition, but it describes bidi well enough for this post). Letters in the middle of a word can change shape depending on what other letters are around them.

Vim does a good job of documenting this. It was an important tool in how I got around the challenge of letters only defined outside of the 128 member ASCII set.

How I got the Arabic word for "Arabic" onto the Python Wiki:

1) go to the Wikipedia page for Arabic (language).

2) open gvim (I didn't do this in the console) and set encoding to UTF-8 (:se encoding=UTF-8).

3) copy the Arabic word for "Arabic" from the Wikipedia page for the Arabic language and paste it into gvim

4) navigate across the characters (they may not show up at all, or may blink, or may look like perforated boxes - don't worry about that); on each one type in the command(s) 'ga'. The Unicode character code should appear for that letter at the bottom of the screen. Write each one down on paper or type them in in another window (this is crude, but for six characters, it's not too bad).

5) use an adaptation of the code from Andrew Kuchling's Unicode page to find codes suitable for a web page, the first character to last should be listed left to right, but should render right to left in a web browser.

how to get codes out of Python:

%/usr/local/bin/python3.0
Python 3.0.1 (r301:69556, Jul 16 2009, 21:08:20)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "help", "copyright", "credits" or "license" for more information.

>>> charsx = ['\u0627', '\u0644', '\u0639', '\u0631', '\u0628', '\u064A','\u0629']

>>> import pprint
>>> XMLCD = 'xmlcharrefreplace'
>>> xmlcodes = [charx.encode('ascii', XMLCD) for charx in charsx]
>>> pprint.pprint(xmlcodes)
[b'ا',
 b'ل',
 b'ع',
 b'ر',
 b'ب',
 b'ي',
 b'ة']
>>> exit()
%


In the moinmoin based Wiki the Python Wiki uses, simply listing the character codes in the following fashion allows them to render correctly on the page:

 الرربية

It may or may not render correctly in your web browser; hopefully, for people who regularly work in Arabic on the computer, it will:

الرربية

Discussion:

It's harder to get Unicode to show up in desktop applications than it is in the web browser. I found this out the hard way.

There are definitely more elegant and efficient ways of doing this. This was more of a "Watch Carl learn UTF-8 and Unicode" post.

Code:

I'm a list comprehension freak. I probably go overboard at times.

'xmlcharrefreplace' really doesn't need its own constant. It's for me. The word 'xmlcharrefreplace' is just too alliterative and tongue twisting for me not to misspell or mistype it. Your mileage may vary.

I have a bias towards using pretty print (pprint) whenever I can.  It just makes viewing numbers and data easier for me.  Again, your mileage may vary.

The only thing harder than getting xml character codes to render correctly is getting them to show up as plain codes.  I'm pretty much an html newb.  If you need to represent &#, you can substitute &# (in order to just do that I had to substitute twice :-)

I used vim for this.  EMACS people, I don't know, but I suspect you have something similar for bidi and Arabic. Feel free to share that in the comments (no flaming, please.)

Carl T.

Saturday, August 22, 2009

First Post

The Python Programming community (www.python.org) frequently suggests that people using Python have a blog and post it the aggregator at http://www.planetpython.org/.  I'm a big fan of the planet and read it daily.

This post will not be linked to Planet Python, but I hope to write something worth linking soon.  (Update:  I was mistaken regarding the Planet - it's an all or nothing proposition - all posts are fed to the Planet.  Michael Foord was kind enough to set me up in accordance with Jack Dietrich's instructions in the comments.  I am working on a mini project and hope to have some code up by the second week of September - thanks for your patience, those of you waiting with baited breath).

The title of the blog is a play on words.  I'm a geologist by trade - pyrite is iron sulfide, or fools' gold.  Ideally, I would have all kinds of great software testing experience (I don't).  

Still, my goal as someone who writes scripts and, occasionally, programs, is to produce clean, maintainable code.  Python's development community puts a huge emphasis on this, so it's a good fit.  

Hopefully my posts will reflect at a minimum that I'm trying to write clean code.  When I fail, or make something more complicated than it needs to be, I'll be relying on my readers (assuming there are a few) to correct me.  "Teach me clean coding", as it were.

Thanks for stopping by.