Saturday, October 23, 2010

jython, regular expressions, and unicode

Jython enables access to Java's regular expression classes and methods.  One feature of Java's regular expression library that Python does not have is the ability to search on Unicode general categories (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt).  These are abbreviations:  Mn = non-spacing character, Lu = uppercase letter, etc.

Here is a quick example for Mn (non-spacing).

$ /usr/local/jdk-1.7.0/bin/java -jar jython.jar
Jython 2.5.1 (Release_2_5_1:6813, Sep 26 2009, 13:47:54)
[OpenJDK Client VM (Sun Microsystems Inc.)] on java1.7.0-internal
Type "help", "copyright", "credits" or "license" for more information.
>>> from java.util import regex
>>> nonspacingx = regex.Pattern.compile(r'\p{Mn}')
>>> import unicodedata
>>> for charcode in range(0x900, 0xA00):
...     mtchx = nonspacingx.matcher(unichr(charcode))
...     if mtchx.matches():
...         print 'match at character %X, %s' % (charcode, unicodedata.name(unichr(charcode)))
...
match at character 901, DEVANAGARI SIGN CANDRABINDU
match at character 902, DEVANAGARI SIGN ANUSVARA
match at character 93C, DEVANAGARI SIGN NUKTA
match at character 941, DEVANAGARI VOWEL SIGN U
match at character 942, DEVANAGARI VOWEL SIGN UU
match at character 943, DEVANAGARI VOWEL SIGN VOCALIC R
match at character 944, DEVANAGARI VOWEL SIGN VOCALIC RR
match at character 945, DEVANAGARI VOWEL SIGN CANDRA E
match at character 946, DEVANAGARI VOWEL SIGN SHORT E
match at character 947, DEVANAGARI VOWEL SIGN E
match at character 948, DEVANAGARI VOWEL SIGN AI
match at character 94D, DEVANAGARI SIGN VIRAMA
match at character 951, DEVANAGARI STRESS SIGN UDATTA
match at character 952, DEVANAGARI STRESS SIGN ANUDATTA
match at character 953, DEVANAGARI GRAVE ACCENT
match at character 954, DEVANAGARI ACUTE ACCENT
match at character 962, DEVANAGARI VOWEL SIGN VOCALIC L
match at character 963, DEVANAGARI VOWEL SIGN VOCALIC LL
match at character 981, BENGALI SIGN CANDRABINDU
match at character 9BC, BENGALI SIGN NUKTA
match at character 9C1, BENGALI VOWEL SIGN U
match at character 9C2, BENGALI VOWEL SIGN UU
match at character 9C3, BENGALI VOWEL SIGN VOCALIC R
match at character 9C4, BENGALI VOWEL SIGN VOCALIC RR
match at character 9CD, BENGALI SIGN VIRAMA
match at character 9E2, BENGALI VOWEL SIGN VOCALIC L
match at character 9E3, BENGALI VOWEL SIGN VOCALIC LL
>>> 

That's nice, but how about a less contrived example.  I got a Bengali word off one of the links on the BengaliLanguage page on the Python Wiki.  The word is saved to a file bengalisnippet.
All I have to do is open the file, get the line and let my regex rip, right?

>>> fle = open('bengalisnippet', 'r')
>>> linex = fle.readline()
>>> linex = linex.decode('utf-8')
>>> mtchx = nonspacingx.matcher(linex)
>>> mtchx.matches()
False
>>>

Um, no.

Let's investigate and try this again.
>>> linex
u'\u0995\u09bf\u099b\u09c1\n'
>>> unicodedata.category(linex[0])
'Lo'
>>> unicodedata.category(linex[1])
'Mc'
>>> unicodedata.category(linex[2])
'Lo'
>>> unicodedata.category(linex[3])
'Mn'

OK, the character we're looking for is the last one (except for the return character).

>>> mtchx.find()
True
>>> mtchx.start()
3
>>> mtchx.end()
4
>>>

I was using the wrong method (matches).  find is analogous to search in Python.

For me the utility of this is being able to determine if characters are rendering correctly.  I can locate the trouble spots in an unfamiliar language's script and investigate them (combining and non-spacing characters don't always show up correctly).

Java regular expressions are a bit more involved than Python's.  This is one case where the extra effort required may be worth the trouble.
                                       

No comments:

Post a Comment