the syntax is very similar indeed. automaton uses a FSA library

see http://weblogs.java.net/blog/2006/03/27/faster-java-regex-package

On 3 August 2010 16:07, brad <[email protected]> wrote:

> Hi Julien,
> I don't mean to sound dumb on this, but what is the difference between
> automaton-urlfilter.txt and regex-urlfilter.txt?
>
> When I look at the files they seem like they have the same default content.
>
> A google search didn't turn up much...
>
> Is there some documentation I missed somewhere?
>
> Thanks
> Brad
>
>
> -----Original Message-----
> From: Julien Nioche [mailto:[email protected]]
> Sent: Tuesday, August 03, 2010 7:22 AM
> To: [email protected]; [email protected]
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> why not using urlfilter-automaton instead? It is much faster than the regex
> one
>
> On 3 August 2010 13:19, Torsten Krah
> <[email protected]>wrote:
>
> > Am Montag, 2. August 2010, um 20:14:32 schrieb brad:
> > >  I do have about 10
> > > entries in the regex-urlfilter.txt file, but they are mainly to
> > > exclude sites.  For Example:
> >
> > I've got too this problem with 1.1. nutch often hanging at util.regexp...
> > forever.
> > It does hang if i just use (in regexfilter property files) something
> like:
> >
> > http://www.mydomain.local/
> >
> > If i change this to be:
> >
> > http://www\.mydomain\.local/
> >
> > it does work - i have no glue why i have to escape the "." to be a
> > period as "." should match the period too. However for me it solved
> > this annoying hang @java util pattern matching. Maybe you can give
> > this a try - maybe it does help, maybe not :-).
> >
> > You can get more information on "which" regex nutch "hangs" if you
> > overwrite the extension point or the plugin code and add some
> > debugging line just before the match call and find some other regex
> > which does match and does not hang ;-).
> >
> > Torsten
> >
> >
> > --
> > Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
> > Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html
> >
> > Really, I'm not out to destroy Microsoft. That will just be a
> > completely unintentional side effect."
> >        -- Linus Torvalds
> >
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering http://www.digitalpebble.com
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to