Friday, August 28, 2009

Unicode - Arabic

There had been some discussion on the Python Diversity list about foreign languages and making finding Python resources that are available in them easier. The foreign language page on the Python Wiki shows a number of languages, but some of the more prominent ones are missing. In addition, all the languages are listed in English in ASCII compatible characters.

It was my thought that the languages should also be listed in a manner that a native speaker who is not fluent in English could recognize.

I started alphabetically and came across Arabic. Arabic is a language written in a bidi, or bidirectional, manner (letters go right to left; numers go left to right - not an official definition, but it describes bidi well enough for this post). Letters in the middle of a word can change shape depending on what other letters are around them.

Vim does a good job of documenting this. It was an important tool in how I got around the challenge of letters only defined outside of the 128 member ASCII set.

How I got the Arabic word for "Arabic" onto the Python Wiki:

1) go to the Wikipedia page for Arabic (language).

2) open gvim (I didn't do this in the console) and set encoding to UTF-8 (:se encoding=UTF-8).

3) copy the Arabic word for "Arabic" from the Wikipedia page for the Arabic language and paste it into gvim

4) navigate across the characters (they may not show up at all, or may blink, or may look like perforated boxes - don't worry about that); on each one type in the command(s) 'ga'. The Unicode character code should appear for that letter at the bottom of the screen. Write each one down on paper or type them in in another window (this is crude, but for six characters, it's not too bad).

5) use an adaptation of the code from Andrew Kuchling's Unicode page to find codes suitable for a web page, the first character to last should be listed left to right, but should render right to left in a web browser.

how to get codes out of Python:

%/usr/local/bin/python3.0
Python 3.0.1 (r301:69556, Jul 16 2009, 21:08:20)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "help", "copyright", "credits" or "license" for more information.

>>> charsx = ['\u0627', '\u0644', '\u0639', '\u0631', '\u0628', '\u064A','\u0629']

>>> import pprint
>>> XMLCD = 'xmlcharrefreplace'
>>> xmlcodes = [charx.encode('ascii', XMLCD) for charx in charsx]
>>> pprint.pprint(xmlcodes)
[b'ا',
 b'ل',
 b'ع',
 b'ر',
 b'ب',
 b'ي',
 b'ة']
>>> exit()
%


In the moinmoin based Wiki the Python Wiki uses, simply listing the character codes in the following fashion allows them to render correctly on the page:

 الرربية

It may or may not render correctly in your web browser; hopefully, for people who regularly work in Arabic on the computer, it will:

الرربية

Discussion:

It's harder to get Unicode to show up in desktop applications than it is in the web browser. I found this out the hard way.

There are definitely more elegant and efficient ways of doing this. This was more of a "Watch Carl learn UTF-8 and Unicode" post.

Code:

I'm a list comprehension freak. I probably go overboard at times.

'xmlcharrefreplace' really doesn't need its own constant. It's for me. The word 'xmlcharrefreplace' is just too alliterative and tongue twisting for me not to misspell or mistype it. Your mileage may vary.

I have a bias towards using pretty print (pprint) whenever I can.  It just makes viewing numbers and data easier for me.  Again, your mileage may vary.

The only thing harder than getting xml character codes to render correctly is getting them to show up as plain codes.  I'm pretty much an html newb.  If you need to represent &#, you can substitute &# (in order to just do that I had to substitute twice :-)

I used vim for this.  EMACS people, I don't know, but I suspect you have something similar for bidi and Arabic. Feel free to share that in the comments (no flaming, please.)

Carl T.

5 comments:

  1. I didn't follow your reasoning about doing a round trip via Vim. I managed to copy the text in question in Firefox and to paste it into the URL to create a new MoinMoin page:

    http://wiki.python.org/moin/Languages/العربية

    I've also managed to add that new page to the language category, and it appears to show up in the list of language-related pages.

    Perversely, when trying to compose a comment in Firefox on this page, I lose control over my clipboard selections and can't paste anything in. I'm not sure whether there are some bad feelings shared between Firefox (perhaps with JavaScript), your page, and right-to-left scripts, but I had no such problems with MoinMoin.

    ReplyDelete
  2. Paul,

    Thanks. Actually, I could paste it in too.

    The reason why I created more work for myself was to make it explicit for me. I can't read Arabic, and I wanted to make sure I was getting the right letters.

    Also, if I need to troubleshoot this at a later date, I have a knowledge base to work from. The Arabic characters are pretty, but they mean nothing to me, because I can't effectively recognize them or read them.

    Carl T.

    ReplyDelete
  3. Paul,

    Thanks for creating the link and the page. I copied the stuff from the old page into the new one. I understand the organization of the Languages section a little better now. Appreciate your patience with me.

    Carl T.

    ReplyDelete
  4. At the risk of having a conversation with myself, one more point.

    I'm using Opera. Something looked funny at the top of the screen when I went to the Arabic page on the Python Wiki (now addressed with the Arabic word for "Arabic"). The letters in Opera's browser tab are backwards: Languages/<some backwards Arabic>. This proves the old adage, "Sometimes, you just can't win."

    August 29, 2009 10:28 AM

    ReplyDelete
  5. Epilogue: I had some problems with the Wiki page (I did actions that caused data to be lost) and got some guidance:

    - page names need to be in English so that the Wiki maintainers can read them.

    Live and learn.

    ReplyDelete