the syntax is very similar indeed. automaton uses a FSA library see http://weblogs.java.net/blog/2006/03/27/faster-java-regex-package
On 3 August 2010 16:07, brad <[email protected]> wrote: > Hi Julien, > I don't mean to sound dumb on this, but what is the difference between > automaton-urlfilter.txt and regex-urlfilter.txt? > > When I look at the files they seem like they have the same default content. > > A google search didn't turn up much... > > Is there some documentation I missed somewhere? > > Thanks > Brad > > > -----Original Message----- > From: Julien Nioche [mailto:[email protected]] > Sent: Tuesday, August 03, 2010 7:22 AM > To: [email protected]; [email protected] > Subject: Re: For HTML - is parse-html twice as fast as parse-tika > > why not using urlfilter-automaton instead? It is much faster than the regex > one > > On 3 August 2010 13:19, Torsten Krah > <[email protected]>wrote: > > > Am Montag, 2. August 2010, um 20:14:32 schrieb brad: > > > I do have about 10 > > > entries in the regex-urlfilter.txt file, but they are mainly to > > > exclude sites. For Example: > > > > I've got too this problem with 1.1. nutch often hanging at util.regexp... > > forever. > > It does hang if i just use (in regexfilter property files) something > like: > > > > http://www.mydomain.local/ > > > > If i change this to be: > > > > http://www\.mydomain\.local/ > > > > it does work - i have no glue why i have to escape the "." to be a > > period as "." should match the period too. However for me it solved > > this annoying hang @java util pattern matching. Maybe you can give > > this a try - maybe it does help, maybe not :-). > > > > You can get more information on "which" regex nutch "hangs" if you > > overwrite the extension point or the plugin code and add some > > debugging line just before the match call and find some other regex > > which does match and does not hang ;-). > > > > Torsten > > > > > > -- > > Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge. > > Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html > > > > Really, I'm not out to destroy Microsoft. That will just be a > > completely unintentional side effect." > > -- Linus Torvalds > > > > > > -- > DigitalPebble Ltd > > Open Source Solutions for Text Engineering http://www.digitalpebble.com > > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

