It was my thought that the languages should also be listed in a manner that a native speaker who is not fluent in English could recognize.
I started alphabetically and came across Arabic. Arabic is a language written in a bidi, or bidirectional, manner (letters go right to left; numers go left to right - not an official definition, but it describes bidi well enough for this post). Letters in the middle of a word can change shape depending on what other letters are around them.
Vim does a good job of documenting this. It was an important tool in how I got around the challenge of letters only defined outside of the 128 member ASCII set.
How I got the Arabic word for "Arabic" onto the Python Wiki:
1) go to the Wikipedia page for Arabic (language).
2) open gvim (I didn't do this in the console) and set encoding to UTF-8 (:se encoding=UTF-8).
3) copy the Arabic word for "Arabic" from the Wikipedia page for the Arabic language and paste it into gvim
4) navigate across the characters (they may not show up at all, or may blink, or may look like perforated boxes - don't worry about that); on each one type in the command(s) 'ga'. The Unicode character code should appear for that letter at the bottom of the screen. Write each one down on paper or type them in in another window (this is crude, but for six characters, it's not too bad).
5) use an adaptation of the code from Andrew Kuchling's Unicode page to find codes suitable for a web page, the first character to last should be listed left to right, but should render right to left in a web browser.
how to get codes out of Python:
%/usr/local/bin/python3.0
Python 3.0.1 (r301:69556, Jul 16 2009, 21:08:20)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "help", "copyright", "credits" or "license" for more information.
>>> charsx = ['\u0627', '\u0644',
'\u0639', '\u0631',
'\u0628', '\u064A',
'\u0629']
>>> import pprint
>>> XMLCD = 'xmlcharrefreplace'
>>> xmlcodes = [charx.encode('ascii', XMLCD) for charx in charsx]
>>> pprint.pprint(xmlcodes)
[b'ا',
b'ل',
b'ع',
b'ر',
b'ب',
b'ي',
b'ة']
>>> exit()
%
In the moinmoin based Wiki the Python Wiki uses, simply listing the character codes in the following fashion allows them to render correctly on the page:
الرربية
It may or may not render correctly in your web browser; hopefully, for people who regularly work in Arabic on the computer, it will:
الرربية
Discussion:
It's harder to get Unicode to show up in desktop applications than it is in the web browser. I found this out the hard way.
There are definitely more elegant and efficient ways of doing this. This was more of a "Watch Carl learn UTF-8 and Unicode" post.
Code:
I'm a list comprehension freak. I probably go overboard at times.
'xmlcharrefreplace' really doesn't need its own constant. It's for me. The word 'xmlcharrefreplace' is just too alliterative and tongue twisting for me not to misspell or mistype it. Your mileage may vary.
I have a bias towards using pretty print (pprint) whenever I can. It just makes viewing numbers and data easier for me. Again, your mileage may vary.
The only thing harder than getting xml character codes to render correctly is getting them to show up as plain codes. I'm pretty much an html newb. If you need to represent
&#, you can substitute
&#(in order to just do that I had to substitute twice :-)
I used vim for this. EMACS people, I don't know, but I suspect you have something similar for bidi and Arabic. Feel free to share that in the comments (no flaming, please.)
Carl T.