Monday, October 25, 2010

Regular Expression Unicode Blocks in IronPython and jython

One last thing that's available in jython and IronPython, but not in CPython regular expressions is Unicode Blocks.  Blocks are similar to Unicode Scripts, but do not correspond one to one with them.  Blocks, as the name implies, represent continuous sequences of Unicode code points.  This page, recommended to me by artisonian on twitter, has a good synopsis.

Where Unicode Blocks are most useful (where they correspond best with Unicode Scripts) is in the South Asian languages (India and vicinity).  Here is some code written for the detection of Bengali characters in a string in IronPython and jython.  The syntax is similar.

Iron Python

/bin/mono /home/carl/IronPython-2.0.3/ipy.exe                                 <
IronPython 2.0.3 (2.0.0.0) on .NET 2.0.50727.1433
Type "help", "copyright", "credits" or "license" for more information.
>>> from System.Text import RegularExpressions as regex
>>> fle = open('bengalisnippet', 'r')
>>> linex = fle.readline()
>>> fle.close()
>>> rex = regex.Regex(r'\p{IsBengali}+')
>>> linex = linex.decode('utf-8')
>>> mtchx = rex.Match(linex)
>>> mtchx.ToString()
u'\u0995\u09bf\u099b\u09c1'
>>> mtchx.Success
True
>>>

jython

Jython 2.5.1 (Release_2_5_1:6813, Sep 26 2009, 13:47:54)
[OpenJDK Client VM (Sun Microsystems Inc.)] on java1.7.0-internal
Type "help", "copyright", "credits" or "license" for more information.
>>> from java.util import regex
>>> rex = regex.Pattern.compile(r'\p{InBengali}+')
>>> fle = open('bengalisnippet', 'r')
>>> linex = fle.readline()
>>> linex = linex.decode('utf-8')
>>> mtchx = rex.matcher(linex)
>>> mtchx
java.util.regex.Matcher[pattern=\p{InBengali}+ region=0,5 lastmatch=]
>>> mtchx.find()
True
>>> mtchx.start()
0
>>> mtchx.end()
4
>>> linex
u'\u0995\u09bf\u099b\u09c1\n'
>>>

No comments:

Post a Comment