This project started earlier this year with the goal of making Python more accessible to non-native speakers of English. A personal goal of mine was to get a current survey of what is available in the way of Python information in languages other than English.
The project is by no means original. On the Python Wiki there were at least two other pages upon which this one was built: PythonAroundTheWorld and CategoryLanguages. The latter, to my recollection was the work of Andrew Kuchling and many others. That base survey of knowledge to build upon made this project easy and fun.
Christmas Day I received an autogenerated e-mail notifying me of changes to the ArabicLanguage page. Anass Ahmed had built upon the work of Ahmed Youssef (author of Prayertime) and placed a Wikipedia-like general info on Python section into the page. What didn't exist five months ago is now on par with the Thai and Russian pages, both of which are pretty robust.
Update: Ahmed Y. got with me and let me know there were some problems with the Arabic page and the Wikipedia-like entry. @siah on twitter had previously tweeted that some of the direct translations (Python = Pybon) were a bit comical. Well, all I can say is it's nice having some Arabic speakers covering your back.
I personally am intrigued by the possibilities that Python3 offers for unicode identifiers. The coolest thing I've found on this end of things is a couple of PDF links on the Khmer (Cambodian) page. Python 3.0 had hardly come out when an author or authors put together two quality pdf docs almost entirely in Khmer. Python keywords and imported library names are the only English in the docs.
There are a few more nuggets in the language pages. The definition of nugget will vary according to your interests, your skill level, and your programming tastes. Please feel free to add to the pages; they are by no means complete.
Thanks to Rami Chowdhury, vid svashka, Kirby Urner, and a host of others for their encouragement, advice, and ideas on this project. Special thanks to the many bilingual pythonistas and friends of the Python community who contributed code snippets, texts, and links.
Saturday, December 26, 2009
Tuesday, December 22, 2009
Python's unicodedata Module
unicodedata first apeared in Python 2.5. Some names currently associated with checkins to Python's svn source tree are von Löwis, Lemburg, Forgeot d'Arc, Ruohonen, Lundh. I probably won't get a chance to meet you folks, so I'll take this quick opportunity to say "Thanks". unicodeata is a lot of fun and brings everything you need from Unicode into the Python interpreter.
There is plenty of information about Unicode out at unicode.org. Putting it together with a module and a script is essentially what unicodedata's authors have done. For those interested, the makeunicodedata.py(http://svn.python.org/view/python/trunk/Tools/unicode/makeunicodedata.py?view=markup) script written by F. Lundh gives a high level overview of how some parts of the module are generated.
Python 3.1.1 (r311:74480, Nov 29 2009, 22:24:25)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "copyright", "credits" or "license()" for more information.
>>> import unicodedata
>>> dir(unicodedata)
['UCD', '__doc__', '__file__', '__name__', '__package__', 'bidirectional', 'category', 'combining', 'decimal', 'decomposition', 'digit', 'east_asian_width', 'lookup', 'mirrored', 'name', 'normalize', 'numeric', 'ucd_3_2_0', 'ucnhash_CAPI', 'unidata_version']
>>> # attempt to deal with UCD and ucd_3_2_0
>>> # ucd_3_2_0 is essentially unicodedata,
>>> # but for an earlier version of Unicode
>>> # (3.2.0)
>>> # Version of Unicode Python 3.1 is using:
>>> unicodedata.unidata_version
'5.1.0'
>>> unicodedata.UCD
>>> unicodedata.ucd_3_2_0
>>> # UCD isn't meant to be used extensively
>>> # in the Python interpreter.
>>> # Everything I tried with it (including
>>> # inheritance) raised an exception.
>>> type(unicodedata.ucd_3_2_0) == unicodedata.UCD
True
>>> # except that - let's leave these underlying
>>> # details to the folks who wrote the module.
>>> # Speaking of which - ucnhash_CAPI
>>> help(unicodedata.ucnhash_CAPI)
Help on PyCapsule object:
class PyCapsule(object)
| Capsule objects let you wrap a C "void *" pointer in a Python
| object. They're a way of passing data through the Python interpreter
| without creating your own custom type.
|
| Capsules are used for communication between extension modules.
| They provide a way for an extension module to export a C interface
| to other extension modules, so that extension modules can use the
| Python import mechanism to link to one another.
|
| Methods defined here:
|
| __repr__(...)
| x.__repr__() <==> repr(x)
>>> # Clear as C! Onto something more tangible . . .
>>> # name and lookup are useful and easy
normalize and decomposition are two methods I've covered extensively (or at least verbosely) in two previous posts here and here. We're going to skip them.
UAX #9 outlines the bidirectional algorithm and describes associated codes.
The bidirectional algorithm is a bit involved and I've only scratched the surface here. There are a number of codes and even invisible characters for overriding the expected direction of a group of characters. This is the best I can do with this for now.
>>> # Parentheses are the classic example
>>> # of mirrored characters
>>> parens = '()'
>>> for charx in parens:
print('{0:<20}{1}'.format(
unicodedata.name(charx),
unicodedata.mirrored(charx)))
LEFT PARENTHESIS 1
RIGHT PARENTHESIS 1
Hey! Wait a second - Myanmar is a lot closer to East Asia than Europe and the Americas. Folks, just trust me and Unicode on this one. I'm not an expert; nonetheless, that's the spec. It's an example of why it's good not to assume too much about Unicode without reading the UAX's and the code charts first. Let's move on.
A category of Mn is for combining non-spacing. In this case the virama is below the main character and does not move the cursor forward.
The combining code is typically zero. The virama is such a common character in almost all of the languages of India that it gets its own special code of nine. Actually, there are a number of Unicode characters for the virama - one for each language or character set:
9.0
>>> unicodedata.decimal('9')
9
>>> unicodedata.digit('9')
9
>>> longunicodename = 'BENGALI CURRENCY DENOMINATOR SIXTEEN'
>>> bengali16 = unicodedata.lookup(longunicodename)
>>> longunicodename = 'TAMIL NUMBER ONE THOUSAND'
>>> tamil1k = unicodedata.lookup(longunicodename)
>>> longunicodename = 'THAI DIGIT SIX'
>>> thai6 = unicodedata.lookup(longunicodename)
>>> # numeric is the most general of the groupings
>>> unicodedata.numeric(thrqtrs)
0.75
>>> unicodedata.digit(thrqtrs)
Traceback (most recent call last):
File "", line 1, in
unicodedata.digit(thrqtrs)
ValueError: not a digit
>>> # whoops!
>>> unicodedata.decimal(thrqtrs)
Traceback (most recent call last):
File "", line 1, in
unicodedata.decimal(thrqtrs)
ValueError: not a decimal
>>> # no go
>>> for codeptx in range(40000):
try:
if unicodedata.name(chr(codeptx)).find(
'TELUGU') > -1:
print('{0:<40}{1:>4}'.format(
unicodedata.name(chr(codeptx)),
chr(codeptx)))
except ValueError:
pass
There is plenty of information about Unicode out at unicode.org. Putting it together with a module and a script is essentially what unicodedata's authors have done. For those interested, the makeunicodedata.py(http://svn.python.org/view/python/trunk/Tools/unicode/makeunicodedata.py?view=markup) script written by F. Lundh gives a high level overview of how some parts of the module are generated.
Rather than just rehash the unicodedata documentation, I'd like to give an example of each method or member of the module and craft a few handy tricks toward the end of this blog entry.
To the interpreter!
Python 3.1.1 (r311:74480, Nov 29 2009, 22:24:25)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "copyright", "credits" or "license()" for more information.
>>> import unicodedata
>>> dir(unicodedata)
['UCD', '__doc__', '__file__', '__name__', '__package__', 'bidirectional', 'category', 'combining', 'decimal', 'decomposition', 'digit', 'east_asian_width', 'lookup', 'mirrored', 'name', 'normalize', 'numeric', 'ucd_3_2_0', 'ucnhash_CAPI', 'unidata_version']
>>> # attempt to deal with UCD and ucd_3_2_0
>>> # ucd_3_2_0 is essentially unicodedata,
>>> # but for an earlier version of Unicode
>>> # (3.2.0)
>>> # Version of Unicode Python 3.1 is using:
>>> unicodedata.unidata_version
'5.1.0'
>>> unicodedata.UCD
>>> unicodedata.ucd_3_2_0
>>> # UCD isn't meant to be used extensively
>>> # in the Python interpreter.
>>> # Everything I tried with it (including
>>> # inheritance) raised an exception.
>>> type(unicodedata.ucd_3_2_0) == unicodedata.UCD
True
>>> # except that - let's leave these underlying
>>> # details to the folks who wrote the module.
>>> # Speaking of which - ucnhash_CAPI
>>> help(unicodedata.ucnhash_CAPI)
Help on PyCapsule object:
class PyCapsule(object)
| Capsule objects let you wrap a C "void *" pointer in a Python
| object. They're a way of passing data through the Python interpreter
| without creating your own custom type.
|
| Capsules are used for communication between extension modules.
| They provide a way for an extension module to export a C interface
| to other extension modules, so that extension modules can use the
| Python import mechanism to link to one another.
|
| Methods defined here:
|
| __repr__(...)
| x.__repr__() <==> repr(x)
>>> # Clear as C! Onto something more tangible . . .
>>> # name and lookup are useful and easy
normalize and decomposition are two methods I've covered extensively (or at least verbosely) in two previous posts here and here. We're going to skip them.
UAX #9 outlines the bidirectional algorithm and describes associated codes.
The bidirectional algorithm is a bit involved and I've only scratched the surface here. There are a number of codes and even invisible characters for overriding the expected direction of a group of characters. This is the best I can do with this for now.
>>> # Parentheses are the classic example
>>> # of mirrored characters
>>> parens = '()'
>>> for charx in parens:
print('{0:<20}{1}'.format(
unicodedata.name(charx),
unicodedata.mirrored(charx)))
LEFT PARENTHESIS 1
RIGHT PARENTHESIS 1
East Asian Width is covered in UAX #11. In short, the width of the character usually corresponds to the space it takes up in memory or in disk. After that, it gets more complicated. The 'a' in the screenshot is an ASCII character, and like the rest of ASCII, falls under East Asian Narrow (Na). The Burmese character, on the other hand, does not fall under the East Asian Character Set, and takes on the designation Neutral (N).
Hey! Wait a second - Myanmar is a lot closer to East Asia than Europe and the Americas. Folks, just trust me and Unicode on this one. I'm not an expert; nonetheless, that's the spec. It's an example of why it's good not to assume too much about Unicode without reading the UAX's and the code charts first. Let's move on.
I have previously written blog post that covers combining codes a bit. We'll still look a little bit at them along with category codes.
Lo is a lowercase letter. Since Devangari doesn't have case, all the letters that can stand on their own are classified this way. Mc is combining-spacing - the character cannot stand alone, but when combined with another character it adds some width to the group of characters - the carriage moves to the left for those of us who learned to touch type on a typewriter back in the Jurassic.
A category of Mn is for combining non-spacing. In this case the virama is below the main character and does not move the cursor forward.
The combining code is typically zero. The virama is such a common character in almost all of the languages of India that it gets its own special code of nine. Actually, there are a number of Unicode characters for the virama - one for each language or character set:
All that is left are the numeric categories numeric, digit, and decimal.
>>> # numeric, digit, and decimal
>>> unicodedata.numeric('9')9.0
>>> unicodedata.decimal('9')
9
>>> unicodedata.digit('9')
9
Wow, I couldn't have picked a more boring example. Still, the fact that numeric yields a float is significant. Let's see if Unicode has any interesting characters:
>>> longunicodename = 'VULGAR FRACTION THREE QUARTERS'
>>> thrqtrs = unicodedata.lookup(longunicodename)>>> longunicodename = 'BENGALI CURRENCY DENOMINATOR SIXTEEN'
>>> bengali16 = unicodedata.lookup(longunicodename)
>>> longunicodename = 'TAMIL NUMBER ONE THOUSAND'
>>> tamil1k = unicodedata.lookup(longunicodename)
>>> longunicodename = 'THAI DIGIT SIX'
>>> thai6 = unicodedata.lookup(longunicodename)
>>> # numeric is the most general of the groupings
>>> thrqtrs
'¾'>>> unicodedata.numeric(thrqtrs)
0.75
>>> unicodedata.digit(thrqtrs)
Traceback (most recent call last):
File "
unicodedata.digit(thrqtrs)
ValueError: not a digit
>>> # whoops!
>>> unicodedata.decimal(thrqtrs)
Traceback (most recent call last):
File "
unicodedata.decimal(thrqtrs)
ValueError: not a decimal
>>> # no go
OK, that's a lot to cover. One fun thing you can do is look up a character set or language just based upon the inclusion of the word (TELUGU, for instance) for the language in the character description. It's not the same as a code chart, but it can be handy:
try:
if unicodedata.name(chr(codeptx)).find(
'TELUGU') > -1:
print('{0:<40}{1:>4}'.format(
unicodedata.name(chr(codeptx)),
chr(codeptx)))
except ValueError:
pass
And so on . . .
Well, that's what I think I know for now. Stay tuned.
Monday, December 21, 2009
More About Python 3.1 Unicode Identifiers
In a previous post, had a done a simple demonstration of Unicode decomposition and normalization with the latin character ä. This time I will do the same demonstration with non-latin characters beyond the range of 255.
To the interpreter!
We'll need to insert a couple screenshots to make sure the Malayalam characters I'm using show up.
>>> import unicodedata
>>> for ltr in malayalamword[:4]:
print('{0:<30} {1:>#6x} {2}'.format(
unicodedata.name(ltr),
ord(ltr), ltr))
>>> # decompose fourth letter (vowel sign O)
>>> unicodedata.decomposition(chr(0xd4a))
'0D46 0D3E'
>>> # make identifier
>>> # use letter RA to prevent error
>>> exec(chr(0xd30) + chr(0xd4a) +
' = 22')
>>> eval(chr(0xd30) + chr(0xd4a))
22
>>> # attempt to use eval with decomposition
>>> eval(chr(0xd30) + chr(0xd46) +
chr(0xd3e))
22
>>> # same
>>> # now set identifier with decomposed char
>>> exec(chr(0xd30) + chr(0xd46) +
chr(0xd3e) + ' = 44')
>>> eval(chr(0xd30) + chr(0xd46) +
chr(0xd3e))
44
>>> eval(chr(0xd30) + chr(0xd4a))
44
>>> # find representation of identifier
>>> localsx = locals()
>>> localidentifiers = [idx for idx
in localsx]
>>> localidentifiers.sort()
>>> localidentifiers.reverse()
>>> ord(localidentifiers[0][1])
3402
>>> hex(3402)
'0xd4a'
>>> # normalizes to single character
Notes:
1) I had to use the letter RA to start the identifier; the vowel sign is a combination character and cannot be used to start an identifier.
2) Armed with a little knowledge of Unicode, I find the unicodedata module quite handy. I owe a lengthly post to this module and its author, but I don't quite have a handle on its full scope yet.
To the interpreter!
We'll need to insert a couple screenshots to make sure the Malayalam characters I'm using show up.
>>> import unicodedata
>>> for ltr in malayalamword[:4]:
print('{0:<30} {1:>#6x} {2}'.format(
unicodedata.name(ltr),
ord(ltr), ltr))
>>> # decompose fourth letter (vowel sign O)
>>> unicodedata.decomposition(chr(0xd4a))
'0D46 0D3E'
>>> # make identifier
>>> # use letter RA to prevent error
>>> exec(chr(0xd30) + chr(0xd4a) +
' = 22')
>>> eval(chr(0xd30) + chr(0xd4a))
22
>>> # attempt to use eval with decomposition
>>> eval(chr(0xd30) + chr(0xd46) +
chr(0xd3e))
22
>>> # same
>>> # now set identifier with decomposed char
>>> exec(chr(0xd30) + chr(0xd46) +
chr(0xd3e) + ' = 44')
>>> eval(chr(0xd30) + chr(0xd46) +
chr(0xd3e))
44
>>> eval(chr(0xd30) + chr(0xd4a))
44
>>> # find representation of identifier
>>> localsx = locals()
>>> localidentifiers = [idx for idx
in localsx]
>>> localidentifiers.sort()
>>> localidentifiers.reverse()
>>> ord(localidentifiers[0][1])
3402
>>> hex(3402)
'0xd4a'
>>> # normalizes to single character
Notes:
1) I had to use the letter RA to start the identifier; the vowel sign is a combination character and cannot be used to start an identifier.
2) Armed with a little knowledge of Unicode, I find the unicodedata module quite handy. I owe a lengthly post to this module and its author, but I don't quite have a handle on its full scope yet.
Sunday, December 6, 2009
Testing - baby steps (unittest)
A couple days ago I posted some code that dealt with reading bytes from a file and manually interpreting them (as opposed to using the Python3.x bytes type's decode() method).
It occurred to me afterwards that I really should have written some kind of test for the code, even if it was a fairly crude snippet. I've heard everything from "Not testing is evil" within the Python community to "Look, that's developer stuff; you're not a developer; you're a geologist . . .", etc. in the workplace. My personal thought is that if I'm writing code, I should be testing it no matter what anyone says.
The code below is my attempt to test what I wrote the other day - it did reveal a couple errors and omissions, so it ended up being a good use of time. I've probably abused unittest's setUp() method a bit (making sure I was at the start of my test file). Otherwise, I hope it's an acceptable start.
To the code!
import unittest
import handlebytes
class TestBytes(unittest.TestCase):
def setUp(self):
self.filenamex = '/usr/home/carl/pythonblog/foreignbytestest'
self.fbx = handlebytes.FileByter(self.filenamex)
def testreadchar(self):
self.setUp()
self.fbx.readchar()
self.assertEqual(self.fbx.currentcharord,
0x65e5)
self.assertEqual(self.fbx.charbytes,
b'\xe6\x97\xa5')
def testgimmebyte(self):
self.setUp()
self.fbx.gimmebyte()
self.assertEqual(self.fbx.currentbyte,
b'\xe6')
def testinterpfirstbyte(self):
# one byte ASCII
self.assertEqual(handlebytes.interpfirstbyte(b'\x7f'),
(1, 0x7f))
# forbidden zone between ASCII and UTF-8 first byte
self.assertEqual(
handlebytes.interpfirstbyte(b'\xbf'),
handlebytes.ERRORX)
# two bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xd7'),
(2, 0x17))
# three bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xeb'),
(3, 0xb))
# four bytes
self.assertEqual(handlebytes.interpfirstbyte(
b'\xf4'), (4, 0x4))
# five bytes
self.assertEqual(handlebytes.interpfirstbyte(
b'\xf9'), (5, 0x1))
# six bytes
self.assertEqual(handlebytes.interpfirstbyte(
b'\xfd'), (6, 0x1))
# beyond range
self.assertEqual(handlebytes.interpfirstbyte(
b'\xfe'), handlebytes.ERRORX)
def testinterpsubsqntbyte(self):
self.assertEqual(handlebytes.interpsubsqntbyte(
b'\x9b'), 0x1b)
# beyond range
self.assertEqual(handlebytes.interpsubsqntbyte(
b'\xc0'), handlebytes.ERRORY)
if __name__ == '__main__':
unittest.main()
And to the command line:
$ python3.1 unittest_handlebytes.py
.Not a valid first byte of a UTF-8 character sequence.
Not a valid first byte of a UTF-8 character sequence.
.Not a valid byte for a UTF-8 multibyte sequence.
..
----------------------------------------------------------------------
Ran 4 tests in 0.001s
OK
$
Ahhh . . . nirvana!
Update 7DEC09: I took the advice of one of the commentors and clarified the setUp method use. In addition I added a tearDown method and made separate tests for each initial UTF-8 byte.
Because a file is being read sequentially, it seemed safest to have two separate classes for the tests that read the file.
import unittest
import handlebytes
class TestChar(unittest.TestCase):
def setUp(self):
self.filenamex = '/usr/home/carl/pythonblog/foreignbytestest'
self.fbx = handlebytes.FileByter(self.filenamex)
def testreadchar(self):
self.fbx.readchar()
self.assertEqual(self.fbx.currentcharord, 0x65e5)
self.assertEqual(self.fbx.charbytes, b'\xe6\x97\xa5')
def tearDown(self):
while 1:
try:
self.fbx.gimmebyte()
except ValueError:
break
class TestByte(unittest.TestCase):
def setUp(self):
self.filenamex = '/usr/home/carl/pythonblog/foreignbytestest'
self.fbx = handlebytes.FileByter(self.filenamex)
def testgimmebyte(self):
self.fbx.gimmebyte()
self.assertEqual(self.fbx.currentbyte, b'\xe6')
def tearDown(self):
while 1:
try:
self.fbx.gimmebyte()
except ValueError:
break
class TestFirstByte(unittest.TestCase):
def testascii(self):
# one byte ASCII
self.assertEqual(handlebytes.interpfirstbyte(b'\x7f'),
(1, 0x7f))
def testbadascii(self):
# forbidden zone between ASCII and UTF-8 first byte
self.assertEqual(handlebytes.interpfirstbyte(b'\xbf'),
handlebytes.ERRORX)
def testtwobytes(self):
# two bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xd7'),
(2, 0x17))
def testthreebytes(self):
# three bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xeb'),
(3, 0xb))
def testfourbytes(self):
# four bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xf4'),
(4, 0x4))
def testfivebytes(self):
# five bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xf9'),
(5, 0x1))
def testsixbytes(self):
# six bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xfd'),
(6, 0x1))
def testbeyondrange(self):
# beyond range
self.assertEqual(handlebytes.interpfirstbyte(b'\xfe'),
handlebytes.ERRORX)
class TestSubsqntByte(unittest.TestCase):
def testgoodsubqbyte(self):
self.assertEqual(handlebytes.interpsubsqntbyte(b'\x9b'), 0x1b)
def testbadsubqbyte(self):
# beyond range
self.assertEqual(handlebytes.interpsubsqntbyte(b'\xc0'),
handlebytes.ERRORY)
if __name__ == '__main__':
unittest.main()
The new run yields this output:
$ python3.1 unittest_handlebytesii.py
Closed file /usr/home/carl/pythonblog/foreignbytestest.
.Closed file /usr/home/carl/pythonblog/foreignbytestest.
..Not a valid first byte of a UTF-8 character sequence.
.Not a valid first byte of a UTF-8 character sequence.
......Not a valid byte for a UTF-8 multibyte sequence.
..
----------------------------------------------------------------------
Ran 12 tests in 0.002s
OK
$
It occurred to me afterwards that I really should have written some kind of test for the code, even if it was a fairly crude snippet. I've heard everything from "Not testing is evil" within the Python community to "Look, that's developer stuff; you're not a developer; you're a geologist . . .", etc. in the workplace. My personal thought is that if I'm writing code, I should be testing it no matter what anyone says.
The code below is my attempt to test what I wrote the other day - it did reveal a couple errors and omissions, so it ended up being a good use of time. I've probably abused unittest's setUp() method a bit (making sure I was at the start of my test file). Otherwise, I hope it's an acceptable start.
To the code!
import handlebytes
class TestBytes(unittest.TestCase):
def setUp(self):
self.filenamex = '/usr/home/carl/pythonblog/foreignbytestest'
self.fbx = handlebytes.FileByter(self.filenamex)
def testreadchar(self):
self.setUp()
self.fbx.readchar()
self.assertEqual(self.fbx.currentcharord,
0x65e5)
self.assertEqual(self.fbx.charbytes,
b'\xe6\x97\xa5')
def testgimmebyte(self):
self.setUp()
self.fbx.gimmebyte()
self.assertEqual(self.fbx.currentbyte,
b'\xe6')
def testinterpfirstbyte(self):
# one byte ASCII
self.assertEqual(handlebytes.interpfirstbyte(b'\x7f'),
(1, 0x7f))
# forbidden zone between ASCII and UTF-8 first byte
self.assertEqual(
handlebytes.interpfirstbyte(b'\xbf'),
handlebytes.ERRORX)
# two bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xd7'),
(2, 0x17))
# three bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xeb'),
(3, 0xb))
# four bytes
self.assertEqual(handlebytes.interpfirstbyte(
b'\xf4'), (4, 0x4))
# five bytes
self.assertEqual(handlebytes.interpfirstbyte(
b'\xf9'), (5, 0x1))
# six bytes
self.assertEqual(handlebytes.interpfirstbyte(
b'\xfd'), (6, 0x1))
# beyond range
self.assertEqual(handlebytes.interpfirstbyte(
b'\xfe'), handlebytes.ERRORX)
def testinterpsubsqntbyte(self):
self.assertEqual(handlebytes.interpsubsqntbyte(
b'\x9b'), 0x1b)
# beyond range
self.assertEqual(handlebytes.interpsubsqntbyte(
b'\xc0'), handlebytes.ERRORY)
if __name__ == '__main__':
unittest.main()
And to the command line:
$ python3.1 unittest_handlebytes.py
.Not a valid first byte of a UTF-8 character sequence.
Not a valid first byte of a UTF-8 character sequence.
.Not a valid byte for a UTF-8 multibyte sequence.
..
----------------------------------------------------------------------
Ran 4 tests in 0.001s
OK
$
Ahhh . . . nirvana!
Update 7DEC09: I took the advice of one of the commentors and clarified the setUp method use. In addition I added a tearDown method and made separate tests for each initial UTF-8 byte.
Because a file is being read sequentially, it seemed safest to have two separate classes for the tests that read the file.
import unittest
import handlebytes
class TestChar(unittest.TestCase):
def setUp(self):
self.filenamex = '/usr/home/carl/pythonblog/foreignbytestest'
self.fbx = handlebytes.FileByter(self.filenamex)
def testreadchar(self):
self.fbx.readchar()
self.assertEqual(self.fbx.currentcharord, 0x65e5)
self.assertEqual(self.fbx.charbytes, b'\xe6\x97\xa5')
def tearDown(self):
while 1:
try:
self.fbx.gimmebyte()
except ValueError:
break
class TestByte(unittest.TestCase):
def setUp(self):
self.filenamex = '/usr/home/carl/pythonblog/foreignbytestest'
self.fbx = handlebytes.FileByter(self.filenamex)
def testgimmebyte(self):
self.fbx.gimmebyte()
self.assertEqual(self.fbx.currentbyte, b'\xe6')
def tearDown(self):
while 1:
try:
self.fbx.gimmebyte()
except ValueError:
break
class TestFirstByte(unittest.TestCase):
def testascii(self):
# one byte ASCII
self.assertEqual(handlebytes.interpfirstbyte(b'\x7f'),
(1, 0x7f))
def testbadascii(self):
# forbidden zone between ASCII and UTF-8 first byte
self.assertEqual(handlebytes.interpfirstbyte(b'\xbf'),
handlebytes.ERRORX)
def testtwobytes(self):
# two bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xd7'),
(2, 0x17))
def testthreebytes(self):
# three bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xeb'),
(3, 0xb))
def testfourbytes(self):
# four bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xf4'),
(4, 0x4))
def testfivebytes(self):
# five bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xf9'),
(5, 0x1))
def testsixbytes(self):
# six bytes
self.assertEqual(handlebytes.interpfirstbyte(b'\xfd'),
(6, 0x1))
def testbeyondrange(self):
# beyond range
self.assertEqual(handlebytes.interpfirstbyte(b'\xfe'),
handlebytes.ERRORX)
class TestSubsqntByte(unittest.TestCase):
def testgoodsubqbyte(self):
self.assertEqual(handlebytes.interpsubsqntbyte(b'\x9b'), 0x1b)
def testbadsubqbyte(self):
# beyond range
self.assertEqual(handlebytes.interpsubsqntbyte(b'\xc0'),
handlebytes.ERRORY)
if __name__ == '__main__':
unittest.main()
The new run yields this output:
Closed file /usr/home/carl/pythonblog/foreignbytestest.
.Closed file /usr/home/carl/pythonblog/foreignbytestest.
..Not a valid first byte of a UTF-8 character sequence.
.Not a valid first byte of a UTF-8 character sequence.
......Not a valid byte for a UTF-8 multibyte sequence.
..
----------------------------------------------------------------------
Ran 12 tests in 0.002s
OK
$
Saturday, December 5, 2009
Python 3.1.1, Unicode, UTF-8, bytes
A while back I did two posts, one on Lao UTF-8 and one on Arabic UTF-8, using binary representation of bytes to demonstrate UTF-8 encoding in Python.
I still wasn't satisfied with my knowledge of Python 3.x and string handling, nor was the whole UTF-8 thing second nature to me.
On Monday, I'm giving a talk to our local Linux group about the subject of my Python poster session, Python 3.1, Unicode, and Languages of the Indian Ocean Region. In the immortal words of Dean Wormer, "Fat, drunk and stupid is no way to go through life, son." You know, it's no way to give a presentation, either. Time to get smarter now.
The code in this post is nothing but a verbose rehash of what the Python string object does already - decode. But understanding exactly what is going on is helpful. That's the beauty of having an interpreter in Python or LISP - you can check your assumptions and find out your mistakes as fast as you can type (it would help if I could type faster).
Basic things I learned in this exercise:
1) empty, one item, or many items, Python's bytes object is a sequence that must be iterated over or indexed to get byte values.
2) the bytes object has a decoding method, the string object has an encoding method; the two methods work together to switch back and forth between bytes and strings, or, less appropriately, between bytes and characters.
3) binary math goes way beyond decimal 255. Being able to read hexadecimal numbers and recognize significant ones is a real help when working with UTF-8 and Unicode.
Enough qualifiying and waffling. To the code and the interpreter!
"""
Utilities for dealing with bytes, UTF-8, and Unicode.
"""
# new in Python 3.1.1
from collections import OrderedDict
# gates for flag bytes of UTF-8
# 128 - just beyond ASCII
# '0b10000000'
ASCIICUTOFF = 0x80
# 192 - 2 byte designator for UTF-8 character
# '0b11000000'
TWOBYTER = 0xc0
CUTOFFS = OrderedDict()
# 224 - just beyond two byte representation
# '0b11100000'
CUTOFFS['TWOBYTES'] = 0xe0
# 240 - just beyond three byte representation
# '0b11110000'
CUTOFFS['THREEBYTES'] = 0xf0
# 248 - just beyond four byte representation
# '0b11111000'
CUTOFFS['FOURBYTES'] = 0xf8
# 252 - just beyond five byte representation
# '0b11111100'
CUTOFFS['FIVEBYTES'] = 0xfc
# 254 - end of the line
# '0b11111110'
CUTOFFS['SIXBYTES'] = 0xfe
# for binary math
THIRTYTWO = 0x20
THIRTY = 30
SIX = 6
TWO = 0x2
ONE = 0x1
ERRORX = 0, 0
ERRORY = -1
# on subsequent bytes, cannot have second to
# most significant (seventh) bit flipped on
# 63
FORBIDDEN = 0x3f
def interpfirstbyte(firstbyte):
"""
In a UTF-8 byte sequence,
function which interprets
the first byte of a UTF-8
encoded character.
Returns 2 tuple of number of
bytes total in the UTF-8 character
encoding and value of significant
bits in first byte of UTF-8
character encoding.
interp1stbyte(0xf1) -> (4, 1)
"""
twopow = THIRTYTWO
if firstbyte[0] < ASCIICUTOFF:
return ONE, firstbyte[0]
elif firstbyte[0] < TWOBYTER:
print('Not a valid first byte of a UTF-8 character sequence.')
return ERRORX
# characters beyond ASCII range
counter = TWO
for cutoff in CUTOFFS:
if firstbyte[0] < CUTOFFS[cutoff]:
return (counter, int(firstbyte[0] % (CUTOFFS[cutoff] - twopow)))
twopow /= TWO
counter += ONE
print('Not a valid first byte of a UTF-8 character sequence.')
return ERRORX
def interpsubsqntbyte(nextbyte):
"""
In a multiple byte UTF-8 character
representation, returns significant
part of byte as an integer.
interpsubsqntbyte(0x81) -> 1
"""
retval = int(nextbyte[0] % ASCIICUTOFF)
if retval > FORBIDDEN:
print('Not a valid byte for a UTF-8 multibyte sequence.')
return ERRORY
return retval
class FileByter:
"""
Attempt to put UTF-8 readable file
object into a class that processes
one byte, then one character at a time.
"""
def __init__(self, filenamex):
self.filename = filenamex
self.fle = open(self.filename, 'rb')
self.currentbyte = None
self.numbytes = 0
self.modx = 0
self.currentcharord = -1
self.charbytes = b''
def gimmebyte(self):
"""
Assigns the next byte in the file
being read to self.currentbyte.
Closes file and assigns currentbyte
value of None at end of file.
"""
self.currentbyte = self.fle.read(1)
if len(self.currentbyte) == 0:
self.fle.close()
print('Closed file {0}.'.format(self.filename))
self.currentbyte = None
def readchar(self):
counter = 0
powx = 0
self.charbytes = b''
self.gimmebyte()
if self.currentbyte:
self.numbytes, self.modx = interpfirstbyte(self.currentbyte)
if (self.numbytes, self.modx) == ERRORX:
print('Not on first byte of UTF-8 char sequence. Try again.')
return None
self.charbytes += self.currentbyte
if self.numbytes == ONE:
return None
counter = self.numbytes
powx = THIRTY - (SIX - counter) * SIX
self.currentcharord = self.modx * TWO ** powx
powx -= SIX
while counter > ONE:
self.gimmebyte()
self.charbytes += self.currentbyte
self.modx = int(interpsubsqntbyte(self.currentbyte))
if self.modx == ERRORY:
print('Invalid subsequent byte in UTF-8 sequence.')
return None
self.currentcharord += int(self.modx * TWO ** powx)
counter -= ONE
powx -= SIX
I'll work a little with this file of some foreign language strings in the screenshot below:
Python 3.1.1 (r311:74480, Nov 29 2009, 22:24:25)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "copyright", "credits" or "license()" for more information.
>>> import sys
>>> sys.path += ['/usr/home/carl/pythonblog']
>>> import handlebytes
>>> fbx = handlebytes.FileByter('/usr/home/carl/pythonblog/foreignbytestest')
>>> fbx.readchar()
>>> fbx.charbytes
b'\xe6\x97\xa5'
>>> bin(0xe6)
'0b11100110'
>>> # three bytes in the character
>>> somejapanesecharacter = fbx.charbytes.decode('UTF-8')
>>> import unicodedata
>>> unicodedata.name(somejapanesecharacter)
'CJK UNIFIED IDEOGRAPH-65E5'
>>> # OK - let's try one byte at a time
>>> fbx.gimmebyte()
>>> fbx.currentbyte
b'\xe6'
>>> bin(0xe6)
'0b11100110'
>>> # 3 bytes again
>>> fbx.gimmebyte()
>>> fbx.currentbyte
b'\x9c'
>>> bin(0x9c)
'0b10011100'
>>> # first two bits are 10 as expected
>>> fbx.readchar()
Not a valid first byte of a UTF-8 character sequence.
>>> # third byte - can't have 10 bits leading UTF-8 first byte
>>> fbx.readchar()
>>> fbx.charbytes
b'\xe8\xaa\x9e'
>>> # OK - on to an ASCII char
>>> fbx.readchar()
>>> fbx.charbytes
b'\n'
>>> fbx.readchar()
>>> fbx.charbytes
b'a'
>>> # ASCII characters show up as themselves, not as hex numbers.
That's a lot of activity just to understand a few simple concepts. It helps me. Hopefully someone else can benefit from it too.
Final notes:
1) I like the collections.SortedDict object; it saves me having to work with a sorting key and an associated list. I believe Ronacher and Hettinger were responsible for getting it into the standard lib. Thanks, here's to people with German names from outside of Germany (Austria and America respectively).
2) The only characters I'm familiar with that have a four byte representation in UTF-8 are musical notes. They don't show up in idle, so I left that part of the testing of this go. Everything here is three bytes or less.
I still wasn't satisfied with my knowledge of Python 3.x and string handling, nor was the whole UTF-8 thing second nature to me.
On Monday, I'm giving a talk to our local Linux group about the subject of my Python poster session, Python 3.1, Unicode, and Languages of the Indian Ocean Region. In the immortal words of Dean Wormer, "Fat, drunk and stupid is no way to go through life, son." You know, it's no way to give a presentation, either. Time to get smarter now.
The code in this post is nothing but a verbose rehash of what the Python string object does already - decode. But understanding exactly what is going on is helpful. That's the beauty of having an interpreter in Python or LISP - you can check your assumptions and find out your mistakes as fast as you can type (it would help if I could type faster).
Basic things I learned in this exercise:
1) empty, one item, or many items, Python's bytes object is a sequence that must be iterated over or indexed to get byte values.
2) the bytes object has a decoding method, the string object has an encoding method; the two methods work together to switch back and forth between bytes and strings, or, less appropriately, between bytes and characters.
3) binary math goes way beyond decimal 255. Being able to read hexadecimal numbers and recognize significant ones is a real help when working with UTF-8 and Unicode.
Enough qualifiying and waffling. To the code and the interpreter!
"""
Utilities for dealing with bytes, UTF-8, and Unicode.
"""
# new in Python 3.1.1
from collections import OrderedDict
# gates for flag bytes of UTF-8
# 128 - just beyond ASCII
# '0b10000000'
ASCIICUTOFF = 0x80
# 192 - 2 byte designator for UTF-8 character
# '0b11000000'
TWOBYTER = 0xc0
CUTOFFS = OrderedDict()
# 224 - just beyond two byte representation
# '0b11100000'
CUTOFFS['TWOBYTES'] = 0xe0
# 240 - just beyond three byte representation
# '0b11110000'
CUTOFFS['THREEBYTES'] = 0xf0
# 248 - just beyond four byte representation
# '0b11111000'
CUTOFFS['FOURBYTES'] = 0xf8
# 252 - just beyond five byte representation
# '0b11111100'
CUTOFFS['FIVEBYTES'] = 0xfc
# 254 - end of the line
# '0b11111110'
CUTOFFS['SIXBYTES'] = 0xfe
# for binary math
THIRTYTWO = 0x20
THIRTY = 30
SIX = 6
TWO = 0x2
ONE = 0x1
ERRORX = 0, 0
ERRORY = -1
# on subsequent bytes, cannot have second to
# most significant (seventh) bit flipped on
# 63
FORBIDDEN = 0x3f
def interpfirstbyte(firstbyte):
"""
In a UTF-8 byte sequence,
function which interprets
the first byte of a UTF-8
encoded character.
Returns 2 tuple of number of
bytes total in the UTF-8 character
encoding and value of significant
bits in first byte of UTF-8
character encoding.
interp1stbyte(0xf1) -> (4, 1)
"""
twopow = THIRTYTWO
if firstbyte[0] < ASCIICUTOFF:
return ONE, firstbyte[0]
elif firstbyte[0] < TWOBYTER:
print('Not a valid first byte of a UTF-8 character sequence.')
return ERRORX
# characters beyond ASCII range
counter = TWO
for cutoff in CUTOFFS:
if firstbyte[0] < CUTOFFS[cutoff]:
return (counter, int(firstbyte[0] % (CUTOFFS[cutoff] - twopow)))
twopow /= TWO
counter += ONE
print('Not a valid first byte of a UTF-8 character sequence.')
return ERRORX
def interpsubsqntbyte(nextbyte):
"""
In a multiple byte UTF-8 character
representation, returns significant
part of byte as an integer.
interpsubsqntbyte(0x81) -> 1
"""
retval = int(nextbyte[0] % ASCIICUTOFF)
if retval > FORBIDDEN:
print('Not a valid byte for a UTF-8 multibyte sequence.')
return ERRORY
return retval
class FileByter:
"""
Attempt to put UTF-8 readable file
object into a class that processes
one byte, then one character at a time.
"""
def __init__(self, filenamex):
self.filename = filenamex
self.fle = open(self.filename, 'rb')
self.currentbyte = None
self.numbytes = 0
self.modx = 0
self.currentcharord = -1
self.charbytes = b''
def gimmebyte(self):
"""
Assigns the next byte in the file
being read to self.currentbyte.
Closes file and assigns currentbyte
value of None at end of file.
"""
self.currentbyte = self.fle.read(1)
if len(self.currentbyte) == 0:
self.fle.close()
print('Closed file {0}.'.format(self.filename))
self.currentbyte = None
def readchar(self):
counter = 0
powx = 0
self.charbytes = b''
self.gimmebyte()
if self.currentbyte:
self.numbytes, self.modx = interpfirstbyte(self.currentbyte)
if (self.numbytes, self.modx) == ERRORX:
print('Not on first byte of UTF-8 char sequence. Try again.')
return None
self.charbytes += self.currentbyte
if self.numbytes == ONE:
return None
counter = self.numbytes
powx = THIRTY - (SIX - counter) * SIX
self.currentcharord = self.modx * TWO ** powx
powx -= SIX
while counter > ONE:
self.gimmebyte()
self.charbytes += self.currentbyte
self.modx = int(interpsubsqntbyte(self.currentbyte))
if self.modx == ERRORY:
print('Invalid subsequent byte in UTF-8 sequence.')
return None
self.currentcharord += int(self.modx * TWO ** powx)
counter -= ONE
powx -= SIX
I'll work a little with this file of some foreign language strings in the screenshot below:
Python 3.1.1 (r311:74480, Nov 29 2009, 22:24:25)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "copyright", "credits" or "license()" for more information.
>>> import sys
>>> sys.path += ['/usr/home/carl/pythonblog']
>>> import handlebytes
>>> fbx = handlebytes.FileByter('/usr/home/carl/pythonblog/foreignbytestest')
>>> fbx.readchar()
>>> fbx.charbytes
b'\xe6\x97\xa5'
>>> bin(0xe6)
'0b11100110'
>>> # three bytes in the character
>>> somejapanesecharacter = fbx.charbytes.decode('UTF-8')
>>> import unicodedata
>>> unicodedata.name(somejapanesecharacter)
'CJK UNIFIED IDEOGRAPH-65E5'
>>> # OK - let's try one byte at a time
>>> fbx.gimmebyte()
>>> fbx.currentbyte
b'\xe6'
>>> bin(0xe6)
'0b11100110'
>>> # 3 bytes again
>>> fbx.gimmebyte()
>>> fbx.currentbyte
b'\x9c'
>>> bin(0x9c)
'0b10011100'
>>> # first two bits are 10 as expected
>>> fbx.readchar()
Not a valid first byte of a UTF-8 character sequence.
>>> # third byte - can't have 10 bits leading UTF-8 first byte
>>> fbx.readchar()
>>> fbx.charbytes
b'\xe8\xaa\x9e'
>>> # OK - on to an ASCII char
>>> fbx.readchar()
>>> fbx.charbytes
b'\n'
>>> fbx.readchar()
>>> fbx.charbytes
b'a'
>>> # ASCII characters show up as themselves, not as hex numbers.
That's a lot of activity just to understand a few simple concepts. It helps me. Hopefully someone else can benefit from it too.
Final notes:
1) I like the collections.SortedDict object; it saves me having to work with a sorting key and an associated list. I believe Ronacher and Hettinger were responsible for getting it into the standard lib. Thanks, here's to people with German names from outside of Germany (Austria and America respectively).
2) The only characters I'm familiar with that have a four byte representation in UTF-8 are musical notes. They don't show up in idle, so I left that part of the testing of this go. Everything here is three bytes or less.
Subscribe to:
Posts (Atom)