Thursday, November 12, 2009

Python 3.1 and bidi text - what you see isn't (always) what you get

Last time I was working on understanding UTF-8 in Python 3.1.  This time I'll try to be a little more adventurous and take on bidirectional text - Arabic.

(At this point I should mention I got the original idea to start messing around with this stuff from a couple folks on the python-diversity list.  One individual  was particularly interested in Arabic, another in languages of the Indian subcontinent.)

So I made an attempt at some Arabic text and an Arabic Python identifier.  I tried OpenOffice Writer 3.0.1, gedit (in Gnome), and Microsoft Word 2007 on another computer to get output that looked like valid Python with very mixed results.   My locale is English US.  Arabic text pasted into Word was left justified, but backwards.  In OpenOffice it was right justified.  Finally I settled on gedit, shown in the screenshot:



Two variables - one is "a", the other the Arabic word for "Arabic."  What's interesting is that the cursor position on line two shows up as column 13.  It really is counting, to someone used to Latin script, backwards.

"There's no way this will work in Python", I thought.  Oh me of little faith (feel free to groan) - another screenshot:




Wow.  It worked!  The screenshot is from idle.  Let's see how well this went (another screenshot):




Well, not perfect by any stretch - the Arabic text has to be reversed to show up correctly.  (disclaimer:  idle has been a great tool while I investigate scripts and Unicode in Python; it doesn't claim to deal with bidirectional text nor was I expecting it to.)

OK, we're finished with the screen shots.  From here on out we can rip apart the bytes that make up the arabicx.py file above and see what's going on:

>>> # open python file as bytes
>>> # (file was saved encoded as UTF-8)
>>> fle = open('arabicx.py', 'rb')
>>> # for line of text in arabicx.py
>>> for linex in fle:
    print('new line')
    for bytex in linex:
        print(bin(bytex))

       
new line
0b1100001
0b100000
0b111101
0b100000
0b100111
0b11011000
0b10100111
0b11011001
0b10000100
0b11011000
0b10111001
0b11011000
0b10110001
0b11011000
0b10101000
0b11011001
0b10001010
0b11011000
0b10101001
0b100111
0b1010
new line
0b11011000
0b10100111
0b11011001
0b10000100
0b11011000
0b10111001
0b11011000
0b10110001
0b11011000
0b10101000
0b11011001
0b10001010
0b11011000
0b10101001
0b100000
0b111101
0b100000
0b110010
0b110010
0b1010


Lots of ones and zeros - that's OK - they will be useful:


new line
# the file starts out with
#     ASCII character 97 ("a")
0b1100001

# then a space - ASCII char 32
0b100000

# then an equals sign - ASCII char 61
0b111101

# then another space - ASCII char 32
0b100000

# then a single quote - ASCII char 39
0b100111

# then a bunch of Arabic characters
#     that are entered in order and show up
#     right to left.
# We'll just look at the first one:
# The next two bytes represent the ARABIC LETTER ALEF
#     Indeed, this is the rightmost letter in the 
#     first screenshot above
0b11011000
0b10100111
0b11011001
0b10000100
0b11011000
0b10111001
0b11011000
0b10110001
0b11011000
0b10101000
0b11011001
0b10001010
0b11011000
0b10101001

# then a closing single quote - ASCII char 39
0b100111

# then the return char ('\n') - ASCII char 10
0b1010
new line

# The next two bytes represent 
#     the ARABIC LETTER ALEF (again -
#     this time as part of a Python
#     identifier)

0b11011000
0b10100111
    .

    .
    .
   etc.
   etc.


What's worth noting is that even though the indentation looks all wrong in gedit, the file parsed fine in the Python 3.1 interpreter.  At the risk of stating the obvious, it's the arrangement of the bytes on disk and in memory in UTF-8 encoded source that matters, not how the letters show up on screen in your application of choice, be it idle, vim, emacs (I haven't tried Emacs on this problem), gedit, OpenOffice Writer (saved as UTF-8 encoded text - same for Word on Windows), etc.

I know people have had some luck getting Arabic string literals to show up correctly in Python 2.x.  If anyone knows of a good editor or set up for the Python 3.x series, I'd be grateful for your help.

If all else fails, the bytes tell the story. 

2 comments:

  1. Hi all,

    To use Arabic within the python file it must be UTF-8 also must has this at top "# coding = utf-8".

    And to get Arabic printed to terminal correctly. You have to use Pyfribidi2 lib or other unicode bidi implementation. ( BiDi + Arabic Shaping )

    ReplyDelete
  2. @Sneetsher

    Thanks for the tip on Pyfribidi2. Not sure if it's available for Python 3.x, but I'll check it out.

    Yes, you are correct. I neglected to place the UTF-8 flag atop the Python file, although I did save it as UTF-8 from gedit.

    Carl T.

    ReplyDelete