> I have been experiencing some performance issues with Tika and general
> parsing
> (see Parsing Performance - related to Java concurrency issue)
>
> Ken pointed out that both the both Tika and Nutch HtmlParser show up in my
> jstack list using the delivered configuration.
>
> Julien suggested checking parsing with only parse-tika (html) and then
> with parse-html.
>
> So here is what I did.
>
> Option 1) parse-tika
> parse-(rss|text|js|tika)
> parse-plugin.xml as delivered
tika-mimetypes.xml as delivered
> Option 2) parse-html
> parse-(rss|text|html|js|tika)
> parse-plugin.xml turned ON <plugin id="parse-html" />
> tika-mimetypes.xml commented out <mime-type type="text/html">
>
> Using the same generated crawl, ran fetch with parse for each of the
> options for 2 hours.
> All other configurations and settings are identical
>
> Results:
> Parse-tika
> INFO mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 errors, 27.8
> pages/s, 12916 kb/s
>
> Parse-html
> INFO mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 errors,
> 60.1 pages/s, 27980 kb/s,
>
>
> The results:
> Parse-html is 116% faster than parse-tika for html for the same period of
> time and same URLs
>
> The error rate was about the same parse-html 3%, parse-tika 3.3%
> Most of the errors are read timeouts
>
>
> So is parse-html better? It appears to be faster. But, is the data as
> good?
> Other considerations? Is parse-html really going to be phased out?
>
> Brad
>
>
>