Thanks Torsten, That may help. It actually makes some sense since the escape period is actually what we are looking for. "\." tells the regex processor to match just for a period, where as "." tells the regex processor to match any single character.
Thanks! Brad -----Original Message----- From: Torsten Krah [mailto:[email protected]] Sent: Tuesday, August 03, 2010 5:20 AM To: [email protected] Cc: brad Subject: Re: For HTML - is parse-html twice as fast as parse-tika Am Montag, 2. August 2010, um 20:14:32 schrieb brad: > I do have about 10 > entries in the regex-urlfilter.txt file, but they are mainly to > exclude sites. For Example: I've got too this problem with 1.1. nutch often hanging at util.regexp... forever. It does hang if i just use (in regexfilter property files) something like: http://www.mydomain.local/ If i change this to be: http://www\.mydomain\.local/ it does work - i have no glue why i have to escape the "." to be a period as "." should match the period too. However for me it solved this annoying hang @java util pattern matching. Maybe you can give this a try - maybe it does help, maybe not :-). You can get more information on "which" regex nutch "hangs" if you overwrite the extension point or the plugin code and add some debugging line just before the match call and find some other regex which does match and does not hang ;-). Torsten -- Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge. Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html Really, I'm not out to destroy Microsoft. That will just be a completely unintentional side effect." -- Linus Torvalds

