Re-reading my e-mail, I realize it might be read poorly. Thanks for giving me the benefit of the doubt.
There's a bunch of good material out on the web, and ultimately, the truth is that micro benchmarks can always be misleading, and the only accurate benchmark is real workload testing. That said, micros benchmarks can be accurate and useful in limit contexts. There are several good resources if you want to do micro benchmarking w/ Java: https://code.google.com/p/caliper/ (Full Disclosure: Written by my employer. I've never used it, but the theory/docs are sound) Peter Lawrey has a good blog that touches on issues of performance and has a couple of posts explicitly on mistakes he's made in extremely high performance benchmarking: http://mechanical-sympathy.blogspot.com/ Pretty decent explanations here: http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java Not that I think you need them in particular, but it seems like Nutch could be doing plenty of benchmarking, and micro benchmarking in particular. Knowing the pitfalls is valuable. Lots of very smart people screw this up regularly and make poorly founded decisions armed with faulty data. Not always sure I qualify as smart, but I've done it more than once. Anyways, those reference have plenty of gory details for folks who are interested in why things like this happen. Kirby On Thu, May 23, 2013 at 6:48 PM, Lewis John Mcgibbney < [email protected]> wrote: > You know, this was my suspicion Kirby. > Thanks for giving the heads up... automaton rocks. > Lewis > > > On Thu, May 23, 2013 at 5:06 PM, Kirby Bohling <[email protected] > >wrote: > > > Standard micro-benchmark issues with Java, run the 50 last and it'll run > > faster. JVM warmup, and JIT compilation, yadda, yadda, yadda. > > > > > > On Thu, May 23, 2013 at 1:57 PM, Lewis John Mcgibbney < > > [email protected]> wrote: > > > > > Hi All, > > > A really nice aspect of the regex (urlfilter-automaton and > > urfilter-regex) > > > plugin implementation's in Nutch is that there is a small but very > useful > > > RegexURLFilterBaseTest [0] which compares benchmarks for simple regex > > > parsing. > > > The results we get are as follows > > > > > > urls automaton regex > > > 50 343ms 210ms > > > 100 48ms 187ms > > > 200 65ms 363ms > > > 400 100ms 692ms > > > 800 165ms 1385ms > > > > > > The problem I have here is understanding why the first (50) bench > appears > > > to be more expensive for both implementations? > > > Additionally, why does this same bench cost much more for automaton? > > > > > > Anyone have a clue? > > > Thanks > > > Lewis > > > > > > [0] > > > > > > > > > http://svn.apache.org/viewvc/nutch/branches/2.x/src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java?view=markup > > > > > > -- > > > *Lewis* > > > > > > > > > -- > *Lewis* >

