(Aside: this is part of preparation for a poster submission I have done for Pycon 2010. Although it's too late to submit a talk, you can still submit a proposal for a poster presentation until, IIRC, the end of this month: Pycon 2010 poster information page. While I'd like to see my submission accepted, if you've created economic nuclear fission, a cure for cancer or the common cold, or just a series of poster sessions that blow mine away, I'd graciously accept the loss. Pycon 2010 has a great deal of good momentum on the technical end - there's still time to be part of that.)
To the Python interpreter!
>>> import unicodedata
>>> singlecodepointvar = chr(228)
>>> singlecodepointvar 'ä'
>>> decomposedunicodevar = unicodedata.decomposition(singlecodepointvar)
>>> decomposedunicodevar
'0061 0308'
>>> decomposedunicodevar = chr(0x61) + chr(0x308)
>>> decomposedunicodevar
'ä'
>>> decomposedunicodevar
'0061 0308'
>>> decomposedunicodevar = chr(0x61) + chr(0x308)
>>> decomposedunicodevar
'ä'
>>> # that may or may not have shown up correctly in your browser
>>> # I've had mixed luck with the decomposed ä
>>> # string literals do not get normalized (made to be equal)
>>> singlecodepointvar == decomposedunicodevar
False
>>> len(singlecodepointvar)
1
>>> len(decomposedunicodevar)
2
>>> unicodedata.name(singlecodepointvar[0])
'LATIN SMALL LETTER A WITH DIAERESIS'
>>> unicodedata.name(decomposedunicodevar[0])
'LATIN SMALL LETTER A'
>>> unicodedata.name(decomposedunicodevar[1])
'COMBINING DIAERESIS'
>>> # copy and paste two same looking string literals for good measure
>>> 'ä' == 'ä'
False
>>> # now, demonstrate normalization for identifiers
>>> # using regular a umlaut as identifer (chr(228))
>>> ä = 22
>>> # now using length two decomposed a with umlaut (chr(0x61) + chr(0x308))
>>> ä = 44
>>> # eval single codepoint
>>> eval(chr(228))
44
>>> # same - now eval decomposed a with umlaut
>>> eval(chr(0x61) + chr(0x308))
44
>>> # same again
>>> # how is it stored?
>>> globals()
{'ä': 44, . . . a bunch of other variables . . .}
>>> # copy and paste identifier
>>> len('ä')
1
>>> ord('ä')
228
>>> # OK - it goes with the single codepoint 228
>>> # string literals do not get normalized (made to be equal)
>>> singlecodepointvar == decomposedunicodevar
False
>>> len(singlecodepointvar)
1
>>> len(decomposedunicodevar)
2
>>> unicodedata.name(singlecodepointvar[0])
'LATIN SMALL LETTER A WITH DIAERESIS'
>>> unicodedata.name(decomposedunicodevar[0])
'LATIN SMALL LETTER A'
>>> unicodedata.name(decomposedunicodevar[1])
'COMBINING DIAERESIS'
>>> # copy and paste two same looking string literals for good measure
>>> 'ä' == 'ä'
False
>>> # now, demonstrate normalization for identifiers
>>> # using regular a umlaut as identifer (chr(228))
>>> ä = 22
>>> # now using length two decomposed a with umlaut (chr(0x61) + chr(0x308))
>>> ä = 44
>>> # eval single codepoint
>>> eval(chr(228))
44
>>> # same - now eval decomposed a with umlaut
>>> eval(chr(0x61) + chr(0x308))
44
>>> # same again
>>> # how is it stored?
>>> globals()
{'ä': 44, . . . a bunch of other variables . . .}
>>> # copy and paste identifier
>>> len('ä')
1
>>> ord('ä')
228
>>> # OK - it goes with the single codepoint 228
>>> # normalization of string literals
>>> a = chr(0x61) + chr(0x308)
>>> a
'ä'
>>> b = unicodedata.normalize('NFKC', a)
>>> b
'ä'
>>> len(b)
1
>>> len(a)
2
>>> ord(b)
228
>>> a = chr(0x61) + chr(0x308)
>>> a
'ä'
>>> b = unicodedata.normalize('NFKC', a)
>>> b
'ä'
>>> len(b)
1
>>> len(a)
2
>>> ord(b)
228
>>> # normalizes to single code point 228
OK, that's a simple example of Unicode normalization as it applies to Python 3.x identifiers and Python 3.x string literals. I really beat this point to death, but seeing is believing.
Next time I delve into this topic I hope to do so with some East Asian characters. Until then, happy unicoding (if google is a verb, why not?) in Python 3.x.
No comments:
Post a Comment