Re: [IronPython] differences in IronPython/CPython regular expressions?
George Silva georger.si...@gmail.com wrote: If youre on Windows, you can test the native c# behvaior with a software called Rad Software regular expression designer. Its very helpful. Thanks, George. That looks like a useful piece of software. Bill ___ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Re: [IronPython] differences in IronPython/CPython regular expressions?
Jeff Hardy jdha...@gmail.com wrote: The fact that it works on CPython fairly fast indicates a bug somewhere, I'm just not sure if it's IronPython or Mono. I just tried it with real MS .NET, on a 64-bit Windows 7 machine with a new download of IronPython 2.7. On that platform, it core-dumps (well, ipy exits with a StackOverflowException). Bill ___ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Re: [IronPython] differences in IronPython/CPython regular expressions?
On Thu, Jun 2, 2011 at 9:41 AM, Bill Janssen jans...@parc.com wrote: Jeff Hardy jdha...@gmail.com wrote: The fact that it works on CPython fairly fast indicates a bug somewhere, I'm just not sure if it's IronPython or Mono. I just tried it with real MS .NET, on a 64-bit Windows 7 machine with a new download of IronPython 2.7. On that platform, it core-dumps (well, ipy exits with a StackOverflowException). Any chance you could get a debugger on there and figure out where the SOE is (IronPython or .NET)? If not, I can try to take a look if you send the complete regex, but probably not until the weekend. - Jeff ___ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Re: [IronPython] differences in IronPython/CPython regular expressions?
Jeff Hardy jdha...@gmail.com wrote: On Thu, Jun 2, 2011 at 9:41 AM, Bill Janssen jans...@parc.com wrote: Jeff Hardy jdha...@gmail.com wrote: The fact that it works on CPython fairly fast indicates a bug somewhere, I'm just not sure if it's IronPython or Mono. I just tried it with real MS .NET, on a 64-bit Windows 7 machine with a new download of IronPython 2.7. On that platform, it core-dumps (well, ipy exits with a StackOverflowException). Any chance you could get a debugger on there and figure out where the SOE is (IronPython or .NET)? If not, I can try to take a look if you send the complete regex, but probably not until the weekend. Would gdb work? I'll try. The fact that it's different between .NET and Mono makes me guess it's in the System::Text::RegularExpressions package. Bill ___ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Re: [IronPython] differences in IronPython/CPython regular expressions?
On Thu, Jun 2, 2011 at 11:58 AM, Bill Janssen jans...@parc.com wrote: Would gdb work? I'll try. Mono's debugger might be better, if their regex engine is managed. It looks like there's some Mono support in gdb but I've never used it. On Windows, windbg is your friend. It's about as user-friendly as gdb, though (take that how you want). - Jeff ___ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Re: [IronPython] differences in IronPython/CPython regular expressions?
The fact that it's different between .NET and Mono makes me guess it's in the System::Text::RegularExpressions package. If that is the case, it should be easy to test by using C#. Just write a little console app to test you RegEx on both Mono and MS.NET. If that fails then it is not a problem with IronPython but in the .NET core. Of course I just may be smoking something. Lee ___ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Re: [IronPython] differences in IronPython/CPython regular expressions?
On Wed, Jun 1, 2011 at 4:03 PM, Bill Janssen jans...@parc.com wrote: I have a large RE (223613 chars) that works fine in CPython 2.6, but That's truly horrible, but I assume you have a good reason for it. seems to produce an endless loop in IronPython (see below). I'm using Mono 2.10 (.NET 4.0.x) on Ubuntu, with IronPython 2.7. Anyone have pointers to the differences between them? Is System::Text::RegularExpressions in .NET configurable in some fashion that might help? First off, is there a reason you don't use re.IGNORECASE? That would cut the regex in half, at least. For the most part, CPython and IronPython regexes should be fairly compatible - IronPython takes the regex and massages it to work with System.Text.RE, but the changes are pretty straightforward and small, and I don't think the re you provided hits any of them. It's quite possible that the Mono version of System.Text.RE can't handle the expression; you could test this saving the full regex and building a small C# program that runs it. The regex template has a lot of potential backtracking in it; are you sure it's not caught in a pathological (exponential) case? Finally, is one ginormous really the best way to do this? Have you tried other approaches? - Jeff I'm a .NET newbie. TIA, Bill -- import sys, os, re try: # we use the name lists in nltk to create person-name matching patterns import nltk.data except ImportError: sys.stderr.write(Can't import nltk; can't do name lists.\nSee http://www.nltk.org/.\n;) sys.exit(1) else: __MALE_NAME_EXCLUDES = (Hill, Ave, ) __FEMALE_NAME_EXCLUDES = () __FEMALE_NAMES = [x for x in nltk.data.load(corpora/names/female.txt, format=raw).split(\n) if (x and (x not in __FEMALE_NAME_EXCLUDES))] __FEMALE_NAMES += [x.upper() for x in __FEMALE_NAMES] __MALE_NAMES = [x for x in nltk.data.load(corpora/names/male.txt, format=raw).split(\n) if (x and (x not in __MALE_NAME_EXCLUDES))] __MALE_NAMES += [x.upper() for x in __MALE_NAMES] __INITS = [chr(x) for x in range(ord('A'), ord('Z'))] PERSON_PATTERN = re.compile( ^((?PhonorificMr|Ms|Mrs|Dr|MR|MS|MRS|DR)\.? )? # honorific (?Pfirstname + |.join(__FEMALE_NAMES + __MALE_NAMES + __INITS) + # first name ) ( (?Pmiddlename([A-Z]\.)|( + |.join(__FEMALE_NAMES + __MALE_NAMES) + # middle initial or name )))? +(?Plastname[A-Z][A-Za-z]+), # space then last name re.MULTILINE) print PERSON_PATTERN.match(Mr. John Smith) ___ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com ___ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Re: [IronPython] differences in IronPython/CPython regular expressions?
Jeff Hardy jdha...@gmail.com wrote: On Wed, Jun 1, 2011 at 4:03 PM, Bill Janssen jans...@parc.com wrote: I have a large RE (223613 chars) that works fine in CPython 2.6, but That's truly horrible, but I assume you have a good reason for it. Hi, Jeff. Yes, I think so. seems to produce an endless loop in IronPython (see below). I'm using Mono 2.10 (.NET 4.0.x) on Ubuntu, with IronPython 2.7. Anyone have pointers to the differences between them? Is System::Text::RegularExpressions in .NET configurable in some fashion that might help? First off, is there a reason you don't use re.IGNORECASE? That would cut the regex in half, at least. Sure. Names sensitive to capitalization; the rule I'm implementing says names are either capitalized or upper-case. For the most part, CPython and IronPython regexes should be fairly compatible - IronPython takes the regex and massages it to work with System.Text.RE, but the changes are pretty straightforward and small, Are those changes documented anywhere? and I don't think the re you provided hits any of them. It's quite possible that the Mono version of System.Text.RE can't handle the expression; you could test this saving the full regex and building a small C# program that runs it. The regex template has a lot of potential backtracking in it; are you sure it's not caught in a pathological (exponential) case? No; all I'm sure of is that this runs in 1.2 seconds in CPython, and takes up a core for 15 minutes (till I kill it) with IronPython/Mono. Something is clearly hitting a bug somewhere... I suppose I should try it on Windows. Finally, is one ginormous really the best way to do this? Have you tried other approaches? No need, until I hit .NET. I'm used to working with a full-featured finite-state machine (PARC's xfst; see http://www.cis.upenn.edu/~cis639/docs/xfst.html), and was wondering if we could do similar things with Python's RE machinery. Long lists like these names are often used for lists of companies or cities or such. People's names are actually a fairly simple and short example of this :-). Bill ___ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Re: [IronPython] differences in IronPython/CPython regular expressions?
If youre on Windows, you can test the native c# behvaior with a software called Rad Software regular expression designer. Its very helpful. On Wed, Jun 1, 2011 at 8:44 PM, Bill Janssen jans...@parc.com wrote: Jeff Hardy jdha...@gmail.com wrote: On Wed, Jun 1, 2011 at 4:03 PM, Bill Janssen jans...@parc.com wrote: I have a large RE (223613 chars) that works fine in CPython 2.6, but That's truly horrible, but I assume you have a good reason for it. Hi, Jeff. Yes, I think so. seems to produce an endless loop in IronPython (see below). I'm using Mono 2.10 (.NET 4.0.x) on Ubuntu, with IronPython 2.7. Anyone have pointers to the differences between them? Is System::Text::RegularExpressions in .NET configurable in some fashion that might help? First off, is there a reason you don't use re.IGNORECASE? That would cut the regex in half, at least. Sure. Names sensitive to capitalization; the rule I'm implementing says names are either capitalized or upper-case. For the most part, CPython and IronPython regexes should be fairly compatible - IronPython takes the regex and massages it to work with System.Text.RE, but the changes are pretty straightforward and small, Are those changes documented anywhere? and I don't think the re you provided hits any of them. It's quite possible that the Mono version of System.Text.RE can't handle the expression; you could test this saving the full regex and building a small C# program that runs it. The regex template has a lot of potential backtracking in it; are you sure it's not caught in a pathological (exponential) case? No; all I'm sure of is that this runs in 1.2 seconds in CPython, and takes up a core for 15 minutes (till I kill it) with IronPython/Mono. Something is clearly hitting a bug somewhere... I suppose I should try it on Windows. Finally, is one ginormous really the best way to do this? Have you tried other approaches? No need, until I hit .NET. I'm used to working with a full-featured finite-state machine (PARC's xfst; see http://www.cis.upenn.edu/~cis639/docs/xfst.html), and was wondering if we could do similar things with Python's RE machinery. Long lists like these names are often used for lists of companies or cities or such. People's names are actually a fairly simple and short example of this :-). Bill ___ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com -- George R. C. Silva Desenvolvimento em GIS http://geoprocessamento.net http://blog.geoprocessamento.net ___ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Re: [IronPython] differences in IronPython/CPython regular expressions?
Sure. Names sensitive to capitalization; the rule I'm implementing says names are either capitalized or upper-case. Ah, I see that now. I assumed the name lists were in lower case. For the most part, CPython and IronPython regexes should be fairly compatible - IronPython takes the regex and massages it to work with System.Text.RE, but the changes are pretty straightforward and small, Are those changes documented anywhere? The code is in Languages\IronPython\IronPython.Modules\re.cs in the PreParseRegex function; it's pretty straightforward, if a little long. Looking at it again, it's quite possible there's a bug in there, but we'd need a minimal repro to have any hope of finding it. No need, until I hit .NET. I'm used to working with a full-featured finite-state machine (PARC's xfst; see http://www.cis.upenn.edu/~cis639/docs/xfst.html), and was wondering if we could do similar things with Python's RE machinery. Long lists like these names are often used for lists of companies or cities or such. People's names are actually a fairly simple and short example of this :-). The fact that it works on CPython fairly fast indicates a bug somewhere, I'm just not sure if it's IronPython or Mono. - Jeff ___ Users mailing list Users@lists.ironpython.com http://lists.ironpython.com/listinfo.cgi/users-ironpython.com