Hi Kirby,
On Thu, May 23, 2013 at 6:36 PM, Kirby Bohling <[email protected]>wrote: > > Not that I think you need them in particular, but it seems like Nutch could > be doing plenty of benchmarking, and micro benchmarking in particular. > I agree with this. It is not my goal to attack this head on but (I think) it is useful for us to know more about the different components of Nutch and how they operate, micro benchmarking would certainly be a way of making this realistic. This being said, I am quite keen on the idea of third party libraries (such as bk.brics automaton [0]) being tested in thier own environment, by their own development team. In this case, some comparative *results* (of an older bk.brics library) can be seen here [1]. Anyone is free to infer from this what they wish, but it gives a bit of an idea about the gains which can be achieved. If regex p is something which you (I mean this collectively to refer to anyone) think is a bottle neck for your Nutch deployment. Try out the automaton plugin and hopefully things get better for you. AFAIK we use the most up-to-date library available here so things should work well. Thanks for the post Kirby. [0] http://www.brics.dk/automaton/index.html [1] http://tusker.org/regex/regex_benchmark.html

