Hi Brad, Could you run and measure the parser independently of the fetching? That would remove any possible side effect due to caching, network issues etc...
All you need to do is remove the subdirectories parse_text, parse_data and crawl_parse then run : nutch parse Thanks Julien PS: regarding parse-html being phased out : see Andrzej's JIRA from this morning On 31 July 2010 22:43, brad <[email protected]> wrote: > > I have been experiencing some performance issues with Tika and general > > parsing > > (see Parsing Performance - related to Java concurrency issue) > > > > Ken pointed out that both the both Tika and Nutch HtmlParser show up in > my > > jstack list using the delivered configuration. > > > > Julien suggested checking parsing with only parse-tika (html) and then > > with parse-html. > > > > So here is what I did. > > > > Option 1) parse-tika > > parse-(rss|text|js|tika) > > parse-plugin.xml as delivered > tika-mimetypes.xml as delivered > > > Option 2) parse-html > > parse-(rss|text|html|js|tika) > > parse-plugin.xml turned ON <plugin id="parse-html" /> > > tika-mimetypes.xml commented out <mime-type type="text/html"> > > > > Using the same generated crawl, ran fetch with parse for each of the > > options for 2 hours. > > All other configurations and settings are identical > > > > Results: > > Parse-tika > > INFO mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 errors, > 27.8 > > pages/s, 12916 kb/s > > > > Parse-html > > INFO mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 errors, > > 60.1 pages/s, 27980 kb/s, > > > > > > The results: > > Parse-html is 116% faster than parse-tika for html for the same period of > > time and same URLs > > > > The error rate was about the same parse-html 3%, parse-tika 3.3% > > Most of the errors are read timeouts > > > > > > So is parse-html better? It appears to be faster. But, is the data as > > good? > > Other considerations? Is parse-html really going to be phased out? > > > > Brad > > > > > > > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

