Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Ken Krugler
1. Keeps the well-known perl syntax for regexp (and then find a way to simulate them with automaton limited syntax) ? My vote would be for option 1. It's less work for everyone (except for the person incorporating the new library :) That's my prefered solution too. The first challenge is

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Doug Cutting
Jérôme Charron wrote: So, two solutions: 1. Keep java regexp ... 2. Switch to automaton and provide a java implementation of this regexp (it is more a protection pattern than really a filter pattern, and it could probably be hard-coded). If it were easy to implement all java regex features in

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Jérôme Charron
Beside that, we may should add a kind of timeout to the url filter in general. Since it can happen that a user configure a regex for his nutch setup that run in the same problem as we had run right now. Something like below attached. Would you agree? I can create a serious patch and test it

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Jérôme Charron
If it were easy to implement all java regex features in dk.brics.automaton.RegExp, then they probably would have. Alternately, if they'd implemented all java regex features, it probably wouldn't be so fast. So I worry that attempts to translate are doomed. Better to accept the differences:

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Andrzej Bialecki
Jérôme Charron wrote: 3. Add new plugins that use dk.brics.automaton.RegExp, using different default regex file names. Then folks can, if they choose, configure things to use these faster regex libraries, but only if they're willing to write the simpler regexes that it supports. If, over time,

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Stefan Groschupf
Beside that, we may should add a kind of timeout to the url filter in general. I think this is overkill. There is already a Hadoop task timeout. Is that not sufficient? No! What happens is that the url filter hang and than the complete task is time outed instead of just skipping this

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Doug Cutting
Stefan Groschupf wrote: Instead I would suggest go a step forward by add a (configurable) timeout mechanism and skip bad records in reducing in general. Processing such big data and losing all data because just of one bad record is very sad. That's a good suggestion. Ideally we could use

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Stefan Groschupf
Doug, Instead I would suggest go a step forward by add a (configurable) timeout mechanism and skip bad records in reducing in general. Processing such big data and losing all data because just of one bad record is very sad. That's a good suggestion. Ideally we could use

Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Stefan Groschupf
* Change the syntax used in Nutch? +1, my point of view is that we can do that for nutch 0.8 as far we document (see nutch-user ) it. :-) Stefan

Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Howie Wang
I have made some quick tests with regex-urlfilter... The major problem is that it doen't use the Perl syntax... For instance, ît doesn't support the boundary matchers ^ and $ (which are used in nutch) Are there other ways to match start/end of string in the other regex library? I use ^http a

Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Matt Kangas
I've been watching discussion of faster regex libs with much interest. But if regex speed seems to be a problem, would using less regexes be a good answer? Protocol and extension filtering could be done by another URLFilter plugin that is dedicated to this task, and uses more lightweight

Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Howie Wang
Thanks to everybody for your suggestions. But really, my problem is not technical, but political : What should we do if we switch to automaton regexp lib ? 1. Keeps the well-known perl syntax for regexp (and then find a way to simulate them with automaton limited syntax) ? 2. Switch to the

Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Andrzej Bialecki
Incze Lajos wrote: * simulate ^ and $ operators by prepending and appending special start and end markers to the input string. E.g. String START = __START__; String END = __END__; inputString = START + inputString + END; What about char START = '^'; char END = '$';

Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Andrzej Bialecki
Jack Tang wrote: Hi all RegExp is widely used in nutch, and I now wondering is it jdk/jakarta classes is faster enough? Here is the benchmarks i found on web. http://tusker.org/regex/regex_benchmark.html it seems dk.brics.automaton.RegExp is fastest among the libs. It's not only faster,

Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Jérôme Charron
It's not only faster, it also scales better for large and complex expressions, it is also possible to build automata from several expressions with AND/OR operators, which is the use case we have in regexp-utlfilter. It seems awesome! Does somebody plans to switch to this lib in nutch? Does

Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Andrzej Bialecki
Jérôme Charron wrote: It's not only faster, it also scales better for large and complex expressions, it is also possible to build automata from several expressions with AND/OR operators, which is the use case we have in regexp-utlfilter. It seems awesome! I forgot to add: it is also

Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Jérôme Charron
Thanks for volunteering, you're welcome ... ;-) Good job Andrzej !;-) So, That's now in my todo list to check the perl5 compatibility issue and to provide some benchs to the community... Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Much faster RegExp lib needed in nutch?

2006-03-11 Thread Jack Tang
Hi all RegExp is widely used in nutch, and I now wondering is it jdk/jakarta classes is faster enough? Here is the benchmarks i found on web. http://tusker.org/regex/regex_benchmark.html it seems dk.brics.automaton.RegExp is fastest among the libs. /Jack -- Keep Discovering ... ...