> I have been experiencing some performance issues with Tika and general
> parsing 
> (see Parsing Performance - related to Java concurrency issue)
> 
> Ken pointed out that both the both Tika and Nutch HtmlParser show up in my
> jstack list using the delivered configuration.
> 
> Julien suggested checking parsing with only parse-tika (html) and then
> with parse-html.
> 
> So here is what I did.
> 
> Option 1) parse-tika
>           parse-(rss|text|js|tika)
>           parse-plugin.xml as delivered
          tika-mimetypes.xml as delivered

> Option 2) parse-html
>           parse-(rss|text|html|js|tika)
>           parse-plugin.xml turned ON <plugin id="parse-html" />
>           tika-mimetypes.xml commented out <mime-type type="text/html">
> 
> Using the same generated crawl, ran fetch with parse for each of the
> options for 2 hours.  
> All other configurations and settings are identical
> 
> Results:
> Parse-tika
> INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 errors, 27.8
> pages/s, 12916 kb/s
> 
> Parse-html
> INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 errors,
> 60.1 pages/s, 27980 kb/s, 
> 
> 
> The results:
> Parse-html is 116% faster than parse-tika for html for the same period of
> time and same URLs
> 
> The error rate was about the same parse-html 3%, parse-tika 3.3%
> Most of the errors are read timeouts
> 
> 
> So is parse-html better?  It appears to be faster.  But, is the data as
> good? 
> Other considerations?  Is parse-html really going to be phased out?
> 
> Brad
> 
> 
> 

Reply via email to