Friday, November 20, 2009

Pycon 2010 pre-favorites - the Carl T. edition

Catherine Devlin posted the talks she'd like to attend at Pycon.  It was fun reading her picks - she has a great sense of humor, although I'm bummed that everyone now knows about the submarine robot talk, and I'll have to fight for room in the lecture hall to get a spot.

These are my picks (not in order - the truth is I could spend all day talking about a lot of these talks - they are all good - too short a life - too many good talks):

1) Ha, go to the depths of the sea to your Octopus' Garden with your submarine robot, Catherine.  I'm heading skyward with robots in space.

 2) Jython in the Military, is near and dear to me.  Besides, bossman is giving the talk - never hurts to show up and support the team.

3) While we're on the subject of Jython, Extending Java Applications with Jython by Wierzbicki is one that has potential to make me less ignorant.


4) My posts on this blog have all been about Python 3.x - Chun's talk, Python 3:  The Next Generation and Smith's talk about Python 3.x string formatting are two I could seriously benefit from.  When Anthony Baxter gave a tutorial on Python 3 at OSCON a couple years back, I just about died when he said he wasn't going to cover new string formatting.  He gave me this look like, "Wow, who are you, are you nuts?"  And I gave him this look like, "A total Python nobody, and . . . as a matter of fact, yeah."  Anyway, the string formatting one is pretty important to me.


5) Changes to Unittest and Intro to Unittest - yeah, four years after being exposed to testing, I'm still taking baby steps - sue me.


6) The Diversity Suite - one diversity talk and two technical ones - talks Diversity as a Dependency, Infrastructure Construction in Africa, Hackerlab.  
Anna R. is giving the diversity one - I attended her women in computing one a few years back and learned a bit, plus she was cool enough to let me dink around with her brand new minicomputer - a novelty at the time.
There is so little about Africa out there on the web relative to Europe and the Far East - I want to see what's going on.
Hackerlab (actually Think Globally, Hack Locally - Teaching Python in Your Community) is Leigh's talk.  Leigh is a smart security person.  In addition she's upbeat and happy.  I, by contrast, am morose and depressive.  I am hoping to get smarter and more upbeat through the process of osmosis.

7) The Python Language Suite - The Mighty Dictionary, Deconstruction of an Object, Decorators, The Command Line.  I'm not as strong on some of the basics as I'd like to be.  No shortage of places to learn and get up to speed.


8) Dealing with unsightly data in the real world.  Wow, the story of my life - "Singing my life with his words, Killing me softly with his talk . . ."


There's a zillion other talks I'd like to attend; actually, I probably will end up attending only half of these.  I'm not a web programmer - there's a ton of talks on that end of things.  Check it out.

Wednesday, November 18, 2009

Unicode Normalization - Python 3.x Unicode Identifiers

As part of my effort to get up to speed on Unicode as it relates to Python 3.x identifiers (variables), I'm going to step back from my previous posts about the more exotic languages to something more basic in a Latin script.  While I hope to demonstrate the concept of Unicode normalization in Asian scripts in a future post, I need to walk before I run.

(Aside:  this is part of preparation for a poster submission I have done for Pycon 2010.  Although it's too late to submit a talk, you can still submit a proposal for a poster presentation until, IIRC, the end of this month:  Pycon 2010 poster information page.  While I'd like to see my submission accepted, if you've created economic nuclear fission, a cure for cancer or the common cold, or just a series of poster sessions that blow mine away, I'd graciously accept the loss.  Pycon 2010 has a great deal of good momentum on the technical end - there's still time to be part of that.)

To the Python interpreter!
>>> import unicodedata
>>> singlecodepointvar = chr(228)
>>> singlecodepointvar

'ä'
>>> decomposedunicodevar = unicodedata.decomposition(singlecodepointvar)
>>> decomposedunicodevar
'0061 0308'
>>> decomposedunicodevar = chr(0x61) + chr(0x308)
>>> decomposedunicodevar
'ä'

>>> # that may or may not have shown up correctly in your browser
>>> # I've had mixed luck with the decomposed ä
>>> # string literals do not get normalized (made to be equal)
>>> singlecodepointvar == decomposedunicodevar
False
>>> len(singlecodepointvar)
1
>>> len(decomposedunicodevar)
2
>>> unicodedata.name(singlecodepointvar[0])
'LATIN SMALL LETTER A WITH DIAERESIS'
>>> unicodedata.name(decomposedunicodevar[0])
'LATIN SMALL LETTER A'
>>> unicodedata.name(decomposedunicodevar[1])
'COMBINING DIAERESIS'
>>> # copy and paste two same looking string literals for good measure
>>> 'ä' == 'ä'
False
>>> # now, demonstrate normalization for identifiers
>>> # using regular a umlaut as identifer (chr(228))
>>> ä = 22
>>> # now using length two decomposed a with umlaut (chr(0x61) + chr(0x308))
>>> ä = 44
>>> # eval single codepoint
>>> eval(chr(228))
44
>>> # same - now eval decomposed a with umlaut
>>> eval(chr(0x61) + chr(0x308))
44
>>> # same again
>>> # how is it stored?
>>> globals()
{'ä': 44,
. . . a bunch of other variables . . .}
>>> # copy and paste identifier
>>> len('ä')
1
>>> ord('ä')
228
>>> # OK - it goes with the single codepoint 228

>>> # normalization of string literals
>>> a = chr(0x61) + chr(0x308)
>>> a
'ä'
>>> b = unicodedata.normalize('NFKC', a)
>>> b
'ä'
>>> len(b)
1
>>> len(a)
2
>>> ord(b)
228

>>> # normalizes to single code point 228
 

OK, that's a simple example of Unicode normalization as it applies to Python 3.x identifiers and Python 3.x string literals.  I really beat this point to death, but seeing is believing.  

Next time I delve into this topic I hope to do so with some East Asian characters.  Until then, happy unicoding (if google is a verb, why not?) in Python 3.x.

Thursday, November 12, 2009

Python 3.1 and bidi text - what you see isn't (always) what you get

Last time I was working on understanding UTF-8 in Python 3.1.  This time I'll try to be a little more adventurous and take on bidirectional text - Arabic.

(At this point I should mention I got the original idea to start messing around with this stuff from a couple folks on the python-diversity list.  One individual  was particularly interested in Arabic, another in languages of the Indian subcontinent.)

So I made an attempt at some Arabic text and an Arabic Python identifier.  I tried OpenOffice Writer 3.0.1, gedit (in Gnome), and Microsoft Word 2007 on another computer to get output that looked like valid Python with very mixed results.   My locale is English US.  Arabic text pasted into Word was left justified, but backwards.  In OpenOffice it was right justified.  Finally I settled on gedit, shown in the screenshot:



Two variables - one is "a", the other the Arabic word for "Arabic."  What's interesting is that the cursor position on line two shows up as column 13.  It really is counting, to someone used to Latin script, backwards.

"There's no way this will work in Python", I thought.  Oh me of little faith (feel free to groan) - another screenshot:




Wow.  It worked!  The screenshot is from idle.  Let's see how well this went (another screenshot):




Well, not perfect by any stretch - the Arabic text has to be reversed to show up correctly.  (disclaimer:  idle has been a great tool while I investigate scripts and Unicode in Python; it doesn't claim to deal with bidirectional text nor was I expecting it to.)

OK, we're finished with the screen shots.  From here on out we can rip apart the bytes that make up the arabicx.py file above and see what's going on:

>>> # open python file as bytes
>>> # (file was saved encoded as UTF-8)
>>> fle = open('arabicx.py', 'rb')
>>> # for line of text in arabicx.py
>>> for linex in fle:
    print('new line')
    for bytex in linex:
        print(bin(bytex))

       
new line
0b1100001
0b100000
0b111101
0b100000
0b100111
0b11011000
0b10100111
0b11011001
0b10000100
0b11011000
0b10111001
0b11011000
0b10110001
0b11011000
0b10101000
0b11011001
0b10001010
0b11011000
0b10101001
0b100111
0b1010
new line
0b11011000
0b10100111
0b11011001
0b10000100
0b11011000
0b10111001
0b11011000
0b10110001
0b11011000
0b10101000
0b11011001
0b10001010
0b11011000
0b10101001
0b100000
0b111101
0b100000
0b110010
0b110010
0b1010


Lots of ones and zeros - that's OK - they will be useful:


new line
# the file starts out with
#     ASCII character 97 ("a")
0b1100001

# then a space - ASCII char 32
0b100000

# then an equals sign - ASCII char 61
0b111101

# then another space - ASCII char 32
0b100000

# then a single quote - ASCII char 39
0b100111

# then a bunch of Arabic characters
#     that are entered in order and show up
#     right to left.
# We'll just look at the first one:
# The next two bytes represent the ARABIC LETTER ALEF
#     Indeed, this is the rightmost letter in the 
#     first screenshot above
0b11011000
0b10100111
0b11011001
0b10000100
0b11011000
0b10111001
0b11011000
0b10110001
0b11011000
0b10101000
0b11011001
0b10001010
0b11011000
0b10101001

# then a closing single quote - ASCII char 39
0b100111

# then the return char ('\n') - ASCII char 10
0b1010
new line

# The next two bytes represent 
#     the ARABIC LETTER ALEF (again -
#     this time as part of a Python
#     identifier)

0b11011000
0b10100111
    .

    .
    .
   etc.
   etc.


What's worth noting is that even though the indentation looks all wrong in gedit, the file parsed fine in the Python 3.1 interpreter.  At the risk of stating the obvious, it's the arrangement of the bytes on disk and in memory in UTF-8 encoded source that matters, not how the letters show up on screen in your application of choice, be it idle, vim, emacs (I haven't tried Emacs on this problem), gedit, OpenOffice Writer (saved as UTF-8 encoded text - same for Word on Windows), etc.

I know people have had some luck getting Arabic string literals to show up correctly in Python 2.x.  If anyone knows of a good editor or set up for the Python 3.x series, I'd be grateful for your help.

If all else fails, the bytes tell the story. 

Sunday, November 8, 2009

Python 3 and UTF-8 - getting a handle on it

Last time I attemped a dive into the world of Unicode.  Judging from the comments, I see I've got a bit to learn.  With that in mind, I've started reading Fonts & Encodings (the deer book) and made a visit to a useful UK site.

The first thing that I might want to understand before proceeding on the odyssey of UTF-8 or Unicode identifiers in Python is UTF-8 itself.  What does UTF-8 encoding mean?  (For those who know this stuff already, I've probably saved you some time; for those who are still kind enough to read further and check for accuracy, thank you!)

Enough waffling, back to the problem!

From last time, the Lao combination of characters:

LAO LETTER SO TAM
LAO VOWEL SIGN YY
LAO TONE MAI EK


Screenshot:




The screenshot doesn't do the Lao script justice, but at least we have something to work from.

>>> for glyphx in strx:
    print(ord(glyphx))

   
3722
3767
3784


OK, we've got our three Unicode numeric values for our glyphs:  3722, 3767, 3784.  Let's see if we can encode our string as UTF-8, look at it, and get back to those values.

>>> strutf8x = strx.encode('UTF-8')
>>> for bytex in strutf8x:
    print(bin(bytex))

    
0b11100000
0b10111010
0b10001010
0b11100000
0b10111010
0b10110111
0b11100000
0b10111011
0b10001000



At first glance, it looks like a bunch of ones and zeros, but there is a pattern.

In the first byte 0b11100000 the 111 is saying, "We'll need three bytes, including this one, for the integer representation of the codepoint for this glyph."

The zero after the three ones 0 is saying, "OK, we're finished counting bytes, I don't count for anything other than a flag that we're finished that part, but my friend to the right counts as your most significant bit."

As it turns out, none of the remaining bits 0000 has any value, so the next two bytes will have to express the number 3722.

0b10111010, 0b10001010 These two bytes have flags as well.  They both begin with 10, which says "I am a flag, your bits for calculation start after me."

So let's see if this holds up:

>>> int('111010001010', 2)
3722


Good deal.  We were able to work backwards from UTF-8 and get the same first ordinal number (3722) for the glyph we started out with.

OK, fine, but what does ASCII look like encoded this way?  Seven bits gives you 127 as your highest number.  Let's see:

>>> asciistr = chr(126)
>>> print(asciistr)
~
>>> unicodedata.name(asciistr)
'TILDE
'





The venerable tilde . . .

>>> utf8asciix = asciistr.encode('UTF-8')
>>> print(bin(utf8asciix))
Traceback (most recent call last):
  File "", line 1, in
    print(bin(utf8asciix))

TypeError: 'bytes' object cannot be interpreted as an integer

Gaaaahhhh!  What the heck did I do wrong?!  I tried to use the bin() function on a sequence of bytes, that's what I did wrong - duh!  Let's try this again:


>>> for bytex in utf8asciix:
    print(bin(bytex))

   
0b1111110


Much better, but wait a second - there's six bits flipped on right in a row - where is my leading 10 like we saw above?  For ASCII, it's not there.  This might make it more clear:

>>> bytereprsntn = bin(utf8asciix[0])
>>> print(bytereprsntn)
0b1111110

>>> bytereprsntn = bytereprsntn[0:2] + '0' + bytereprsntn[2:]
>>> bytereprsntn
'0b01111110'


OK, we've got all eight bits in our byte now - bin() does not use any more digits than it needs to express a number in binary.  The 0 prior to the  1111110 bits lets UTF-8 know that it's dealing with a codepoint that falls within the ASCII range.

When the Python 3.1 interpreter reads UTF-8 code, this is the process it employs to get at your Unicode identifiers (although it's coded far more efficiently than I described it).