Thanks Torsten,
That may help.  It actually makes some sense since the escape period is
actually what we are looking for.  "\." tells the regex processor to match
just for a period, where as "." tells the regex processor to match any
single character.

Thanks!
Brad

-----Original Message-----
From: Torsten Krah [mailto:[email protected]] 
Sent: Tuesday, August 03, 2010 5:20 AM
To: [email protected]
Cc: brad
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

Am Montag, 2. August 2010, um 20:14:32 schrieb brad:
>  I do have about 10
> entries in the regex-urlfilter.txt file, but they are mainly to 
> exclude sites.  For Example:

I've got too this problem with 1.1. nutch often hanging at util.regexp... 
forever.
It does hang if i just use (in regexfilter property files) something like:

http://www.mydomain.local/

If i change this to be:

http://www\.mydomain\.local/

it does work - i have no glue why i have to escape the "." to be a period as
"." should match the period too. However for me it solved this annoying hang
@java util pattern matching. Maybe you can give this a try - maybe it does
help, maybe not :-).

You can get more information on "which" regex nutch "hangs" if you overwrite
the extension point or the plugin code and add some debugging line just
before the match call and find some other regex which does match and does
not hang ;-).

Torsten


--
Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html

Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect."
        -- Linus Torvalds

Reply via email to