pyright: October 2010

Monday, October 25, 2010

Regular Expression Unicode Blocks in IronPython and jython

One last thing that's available in jython and IronPython, but not in CPython regular expressions is Unicode Blocks.  Blocks are similar to Unicode Scripts, but do not correspond one to one with them.  Blocks, as the name implies, represent continuous sequences of Unicode code points.  This page, recommended to me by artisonian on twitter, has a good synopsis.

Where Unicode Blocks are most useful (where they correspond best with Unicode Scripts) is in the South Asian languages (India and vicinity).  Here is some code written for the detection of Bengali characters in a string in IronPython and jython.  The syntax is similar. 

Iron Python

/bin/mono /home/carl/IronPython-2.0.3/ipy.exe <
IronPython 2.0.3 (2.0.0.0) on .NET 2.0.50727.1433
Type "help", "copyright", "credits" or "license" for more information.
>>> from System.Text import RegularExpressions as regex
>>> fle = open('bengalisnippet', 'r')
>>> linex = fle.readline()
>>> fle.close()
>>> rex = regex.Regex(r'\p{IsBengali}+')
>>> linex = linex.decode('utf-8')
>>> mtchx = rex.Match(linex)
>>> mtchx.ToString()
u'\u0995\u09bf\u099b\u09c1'
>>> mtchx.Success
True
>>>

jython

Jython 2.5.1 (Release_2_5_1:6813, Sep 26 2009, 13:47:54)
[OpenJDK Client VM (Sun Microsystems Inc.)] on java1.7.0-internal
Type "help", "copyright", "credits" or "license" for more information.
>>> from java.util import regex
>>> rex = regex.Pattern.compile(r'\p{InBengali}+')
>>> fle = open('bengalisnippet', 'r')
>>> linex = fle.readline()
>>> linex = linex.decode('utf-8')
>>> mtchx = rex.matcher(linex)
>>> mtchx
java.util.regex.Matcher[pattern=\p{InBengali}+ region=0,5 lastmatch=]
>>> mtchx.find()
True
>>> mtchx.start()
0
>>> mtchx.end()
4
>>> linex
u'\u0995\u09bf\u099b\u09c1\n'
>>>

Saturday, October 23, 2010

IronPython, unicode, and regular expressions

This is a quick follow on to my last post on jython.

The basic idea is that .NET has the capability to search for characters belonging to Unicode general categories (in this case Mn for non-spacing character).

IronPython 2.0.3 (2.0.0.0) on .NET 2.0.50727.1433
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> from System.Text import RegularExpressions as regex
>>> nonspacingx = regex.Regex(r'\p{Mn}')
>>> ns = unichr(0x9C1)
>>> ns
u'\u09c1'
>>> nonspacingx.Match(ns)
<System.Text.RegularExpressions.Match object at 0x000000000000002B [?]>
>>> ns = u'a' + ns
>>> ns
u'a\u09c1'

>>> mtchx = nonspacingx.Match(ns)
<System.Text.RegularExpressions.Match object at 0x000000000000002C [?]>
>>> mtchx.ToString()
u'\u09c1'
>>> mtchx.Index
1
>>> mtchx.Length
1
>>> mtchx.Success
True
>>>

Although the names are different, Java and .NET both provide a means of using general categories in regular expressions. Match in .NET matches occurrences within the string, not just at the start. Success is the boolean value indicating a match.

As an aside, the unicodedata module referenced in the jython post is available for IronPython. It is not in the download for either IronPython or FePy, but is available as a separate download from the FePy site.

jython, regular expressions, and unicode

Jython enables access to Java's regular expression classes and methods. One feature of Java's regular expression library that Python does not have is the ability to search on Unicode general categories (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt). These are abbreviations: Mn = non-spacing character, Lu = uppercase letter, etc.

Here is a quick example for Mn (non-spacing).

$ /usr/local/jdk-1.7.0/bin/java -jar jython.jar
Jython 2.5.1 (Release_2_5_1:6813, Sep 26 2009, 13:47:54)
[OpenJDK Client VM (Sun Microsystems Inc.)] on java1.7.0-internal
Type "help", "copyright", "credits" or "license" for more information.
>>> from java.util import regex
>>> nonspacingx = regex.Pattern.compile(r'\p{Mn}')
>>> import unicodedata
>>> for charcode in range(0x900, 0xA00):
...     mtchx = nonspacingx.matcher(unichr(charcode))
...     if mtchx.matches():
...         print 'match at character %X, %s' % (charcode, unicodedata.name(unichr(charcode)))
...
match at character 901, DEVANAGARI SIGN CANDRABINDU
match at character 902, DEVANAGARI SIGN ANUSVARA
match at character 93C, DEVANAGARI SIGN NUKTA
match at character 941, DEVANAGARI VOWEL SIGN U
match at character 942, DEVANAGARI VOWEL SIGN UU
match at character 943, DEVANAGARI VOWEL SIGN VOCALIC R
match at character 944, DEVANAGARI VOWEL SIGN VOCALIC RR
match at character 945, DEVANAGARI VOWEL SIGN CANDRA E
match at character 946, DEVANAGARI VOWEL SIGN SHORT E
match at character 947, DEVANAGARI VOWEL SIGN E
match at character 948, DEVANAGARI VOWEL SIGN AI
match at character 94D, DEVANAGARI SIGN VIRAMA
match at character 951, DEVANAGARI STRESS SIGN UDATTA
match at character 952, DEVANAGARI STRESS SIGN ANUDATTA
match at character 953, DEVANAGARI GRAVE ACCENT
match at character 954, DEVANAGARI ACUTE ACCENT
match at character 962, DEVANAGARI VOWEL SIGN VOCALIC L
match at character 963, DEVANAGARI VOWEL SIGN VOCALIC LL
match at character 981, BENGALI SIGN CANDRABINDU
match at character 9BC, BENGALI SIGN NUKTA
match at character 9C1, BENGALI VOWEL SIGN U
match at character 9C2, BENGALI VOWEL SIGN UU
match at character 9C3, BENGALI VOWEL SIGN VOCALIC R
match at character 9C4, BENGALI VOWEL SIGN VOCALIC RR
match at character 9CD, BENGALI SIGN VIRAMA
match at character 9E2, BENGALI VOWEL SIGN VOCALIC L
match at character 9E3, BENGALI VOWEL SIGN VOCALIC LL
>>>

That's nice, but how about a less contrived example.  I got a Bengali word off one of the links on the BengaliLanguage page on the Python Wiki.  The word is saved to a file bengalisnippet.

All I have to do is open the file, get the line and let my regex rip, right?

>>> fle = open('bengalisnippet', 'r')
>>> linex = fle.readline()
>>> linex = linex.decode('utf-8')
>>> mtchx = nonspacingx.matcher(linex)
>>> mtchx.matches()
False
>>>

Um, no.

Let's investigate and try this again.

>>> linex

u'\u0995\u09bf\u099b\u09c1\n'
>>> unicodedata.category(linex[0])
'Lo'
>>> unicodedata.category(linex[1])
'Mc'
>>> unicodedata.category(linex[2])
'Lo'
>>> unicodedata.category(linex[3])
'Mn'

OK, the character we're looking for is the last one (except for the return character).

>>> mtchx.find()
True
>>> mtchx.start()
3
>>> mtchx.end()
4
>>>

I was using the wrong method (matches). find is analogous to search in Python.

For me the utility of this is being able to determine if characters are rendering correctly. I can locate the trouble spots in an unfamiliar language's script and investigate them (combining and non-spacing characters don't always show up correctly).

Java regular expressions are a bit more involved than Python's. This is one case where the extra effort required may be worth the trouble.

Monday, October 18, 2010

java.lang.String.matches method

Recently I've been working on learning regular expressions.  Something about the Java implementation (in jython) I found curious.

Jython 2.5.1 (Release_2_5_1:6813, Sep 26 2009, 13:47:54)

[OpenJDK Client VM (Sun Microsystems Inc.)] on java1.7.0-internal

Type "help", "copyright", "credits" or "license" for more information.

>>> from java.lang import String

>>> teststring = String('def hello():')

>>> teststring.matches(r'\s*def\s+\w*\(\):$')

True

>>>

Python has the re.match and re.search methods. C# has something similar. This just seemed like a strange, less efficient construct (presumably the regular expression gets interpreted on the fly instead of compiled). Go figure.

Sunday, October 3, 2010

Second Javascript Attempt - the Zen of Python (again)

Last time, in my enthusiasm, I published some not ready for prime time html/JavaScript code. Since then the W3C validator has helped me to see the error of my ways. This is my second shot at making the first part of the Zen of Python magically appear in a web browser:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<HTML>
<HEAD>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
<TITLE>
The Zen of Python, by Tim Peters
</TITLE>
<SCRIPT TYPE="text/javascript">

</SCRIPT>
</HEAD>
<BODY ID = "bodyx" ONLOAD = "setTimeout('doall()', INTERVALX);">
<H1 ID = "zen" STYLE = "text-align:center;font-family:sans-serif">
THE ZEN OF PYTHON
</H1>
</BODY>
</HTML>

Friday, October 1, 2010

JavaScript attempt - the Zen of Python

I just completed a JavaScript course and couldn't resist messing with a web page (html file). This rotates through the first part of the Zen of Python at five second intervals (warning - newbish code):

<HTML>
<DOCUMENT>
<HEAD>
<TITLE>
The Zen of Python, by Tim Peters
</TITLE>
<SCRIPT LANGUAGE="JavaScript">

</SCRIPT>
</HEAD>
<BODY ONLOAD = "setTimeout('doall()', 3000);">
<FORM NAME = "formx">
</FORM>
</BODY>
</DOCUMENT>
</HTML>

pyright