Wednesday, November 18, 2009

Unicode Normalization - Python 3.x Unicode Identifiers

As part of my effort to get up to speed on Unicode as it relates to Python 3.x identifiers (variables), I'm going to step back from my previous posts about the more exotic languages to something more basic in a Latin script.  While I hope to demonstrate the concept of Unicode normalization in Asian scripts in a future post, I need to walk before I run.

(Aside:  this is part of preparation for a poster submission I have done for Pycon 2010.  Although it's too late to submit a talk, you can still submit a proposal for a poster presentation until, IIRC, the end of this month:  Pycon 2010 poster information page.  While I'd like to see my submission accepted, if you've created economic nuclear fission, a cure for cancer or the common cold, or just a series of poster sessions that blow mine away, I'd graciously accept the loss.  Pycon 2010 has a great deal of good momentum on the technical end - there's still time to be part of that.)

To the Python interpreter!
>>> import unicodedata
>>> singlecodepointvar = chr(228)
>>> singlecodepointvar

'ä'
>>> decomposedunicodevar = unicodedata.decomposition(singlecodepointvar)
>>> decomposedunicodevar
'0061 0308'
>>> decomposedunicodevar = chr(0x61) + chr(0x308)
>>> decomposedunicodevar
'ä'

>>> # that may or may not have shown up correctly in your browser
>>> # I've had mixed luck with the decomposed ä
>>> # string literals do not get normalized (made to be equal)
>>> singlecodepointvar == decomposedunicodevar
False
>>> len(singlecodepointvar)
1
>>> len(decomposedunicodevar)
2
>>> unicodedata.name(singlecodepointvar[0])
'LATIN SMALL LETTER A WITH DIAERESIS'
>>> unicodedata.name(decomposedunicodevar[0])
'LATIN SMALL LETTER A'
>>> unicodedata.name(decomposedunicodevar[1])
'COMBINING DIAERESIS'
>>> # copy and paste two same looking string literals for good measure
>>> 'ä' == 'ä'
False
>>> # now, demonstrate normalization for identifiers
>>> # using regular a umlaut as identifer (chr(228))
>>> ä = 22
>>> # now using length two decomposed a with umlaut (chr(0x61) + chr(0x308))
>>> ä = 44
>>> # eval single codepoint
>>> eval(chr(228))
44
>>> # same - now eval decomposed a with umlaut
>>> eval(chr(0x61) + chr(0x308))
44
>>> # same again
>>> # how is it stored?
>>> globals()
{'ä': 44,
. . . a bunch of other variables . . .}
>>> # copy and paste identifier
>>> len('ä')
1
>>> ord('ä')
228
>>> # OK - it goes with the single codepoint 228

>>> # normalization of string literals
>>> a = chr(0x61) + chr(0x308)
>>> a
'ä'
>>> b = unicodedata.normalize('NFKC', a)
>>> b
'ä'
>>> len(b)
1
>>> len(a)
2
>>> ord(b)
228

>>> # normalizes to single code point 228
 

OK, that's a simple example of Unicode normalization as it applies to Python 3.x identifiers and Python 3.x string literals.  I really beat this point to death, but seeing is believing.  

Next time I delve into this topic I hope to do so with some East Asian characters.  Until then, happy unicoding (if google is a verb, why not?) in Python 3.x.

No comments:

Post a Comment