Hi Julien,
I don't mean to sound dumb on this, but what is the difference between
automaton-urlfilter.txt and regex-urlfilter.txt?

When I look at the files they seem like they have the same default content.

A google search didn't turn up much...

Is there some documentation I missed somewhere?

Thanks
Brad
 

-----Original Message-----
From: Julien Nioche [mailto:[email protected]] 
Sent: Tuesday, August 03, 2010 7:22 AM
To: [email protected]; [email protected]
Subject: Re: For HTML - is parse-html twice as fast as parse-tika

why not using urlfilter-automaton instead? It is much faster than the regex
one

On 3 August 2010 13:19, Torsten Krah
<[email protected]>wrote:

> Am Montag, 2. August 2010, um 20:14:32 schrieb brad:
> >  I do have about 10
> > entries in the regex-urlfilter.txt file, but they are mainly to 
> > exclude sites.  For Example:
>
> I've got too this problem with 1.1. nutch often hanging at util.regexp...
> forever.
> It does hang if i just use (in regexfilter property files) something like:
>
> http://www.mydomain.local/
>
> If i change this to be:
>
> http://www\.mydomain\.local/
>
> it does work - i have no glue why i have to escape the "." to be a 
> period as "." should match the period too. However for me it solved 
> this annoying hang @java util pattern matching. Maybe you can give 
> this a try - maybe it does help, maybe not :-).
>
> You can get more information on "which" regex nutch "hangs" if you 
> overwrite the extension point or the plugin code and add some 
> debugging line just before the match call and find some other regex 
> which does match and does not hang ;-).
>
> Torsten
>
>
> --
> Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
> Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html
>
> Really, I'm not out to destroy Microsoft. That will just be a 
> completely unintentional side effect."
>        -- Linus Torvalds
>



--
DigitalPebble Ltd

Open Source Solutions for Text Engineering http://www.digitalpebble.com

Reply via email to