Saturday, October 23, 2010

IronPython, unicode, and regular expressions

This is a quick follow on to my last post on jython.

The basic idea is that .NET has the capability to search for characters belonging to Unicode general categories (in this case Mn for non-spacing character).

IronPython 2.0.3 (2.0.0.0) on .NET 2.0.50727.1433
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> from System.Text import RegularExpressions as regex
>>> nonspacingx = regex.Regex(r'\p{Mn}')
>>> ns = unichr(0x9C1)
>>> ns
u'\u09c1'
>>> nonspacingx.Match(ns)
<System.Text.RegularExpressions.Match object at 0x000000000000002B [?]>
>>> ns = u'a' + ns
>>> ns
u'a\u09c1'

>>> mtchx = nonspacingx.Match(ns)
<System.Text.RegularExpressions.Match object at 0x000000000000002C [?]>
>>> mtchx.ToString()
u'\u09c1'
>>> mtchx.Index
1
>>> mtchx.Length
1
>>> mtchx.Success
True
>>> 

Although the names are different, Java and .NET both provide a means of using general categories in regular expressions.  Match in .NET matches occurrences within the string, not just at the start.  Success is the boolean value indicating a match.

As an aside, the unicodedata module referenced in the jython post is available for IronPython.  It is not in the download for either IronPython or FePy, but is available as a separate download from the FePy site.                               

No comments:

Post a Comment