Hi Brad, Thanks for sharing this. It would be interesting to profile the parsing and have a better idea of what makes such a difference. Could it be the detection of the encoding for instance?
Jul On 18 August 2010 17:48, brad <[email protected]> wrote: > I finally had a chance to test the Nutch html parsing this without fetching > per Julien suggestion. The results were pretty much the same as my > previous > tests: > > parse-html Tika-html > Elapsed Time: 04:21:47 08:55:57 > Parse (Success): 150,634 150,615 > Parse (failed): 3,788 3,807 > > So, based on this test, parse-html is a little more than twice as fast as > tika's html parsing. > > This was done on Linux Centos 5.5, 8gb ram, Intel Xeon CPU X3220 @ 2.40GHz > Only Nutch related processes were running on the server > Nutch 1.2 - which now has the nice timings feature! > > The data was retrieved using: > bin/nutch fetch <segment> -noParsing -threads 200 > > All data was parsed using: > bin/nutch parse <segment> -threads 200 > > Brad > > > -----Original Message----- > From: Ken Krugler [mailto:[email protected]] > Sent: Wednesday, August 11, 2010 2:20 PM > To: [email protected] > Subject: Re: For HTML - is parse-html twice as fast as parse-tika > > Hi Brad, > > On Aug 2, 2010, at 9:26am, brad wrote: > > > Hi Julien, > > I'll see if I can give a try later this week. > > [snip] > > Were you able to try the parse-only approach that Julien suggested below? > > I'm asking because (a) I do a fair amount of work with/on the Tika HTML > parsing support, and (b) I've also run into surprisingly slow parse > performance with Tika, though I didn't compare to Nutch's older parser (or > using NekoHTML instead of TagSoup). > > Thanks, > > -- Ken > > > > -----Original Message----- > > From: Julien Nioche [mailto:[email protected]] > > Sent: Monday, August 02, 2010 5:11 AM > > To: [email protected] > > Subject: Re: For HTML - is parse-html twice as fast as parse-tika > > > > Hi Brad, > > > > Could you run and measure the parser independently of the fetching? > > That > > would remove any possible side effect due to caching, network issues > > etc... > > > > All you need to do is remove the subdirectories parse_text, parse_data > > and crawl_parse then run : nutch parse > > > > Thanks > > > > Julien > > > > PS: regarding parse-html being phased out : see Andrzej's JIRA from > > this morning > > > > > > On 31 July 2010 22:43, brad <[email protected]> wrote: > > > >>> I have been experiencing some performance issues with Tika and > >>> general parsing (see Parsing Performance - related to Java > >>> concurrency issue) > >>> > >>> Ken pointed out that both the both Tika and Nutch HtmlParser show up > >>> in > >> my > >>> jstack list using the delivered configuration. > >>> > >>> Julien suggested checking parsing with only parse-tika (html) and > >>> then with parse-html. > >>> > >>> So here is what I did. > >>> > >>> Option 1) parse-tika > >>> parse-(rss|text|js|tika) > >>> parse-plugin.xml as delivered > >> tika-mimetypes.xml as delivered > >> > >>> Option 2) parse-html > >>> parse-(rss|text|html|js|tika) > >>> parse-plugin.xml turned ON <plugin id="parse-html" /> > >>> tika-mimetypes.xml commented out <mime-type > >>> type="text/html"> > >>> > >>> Using the same generated crawl, ran fetch with parse for each of the > >>> options for 2 hours. > >>> All other configurations and settings are identical > >>> > >>> Results: > >>> Parse-tika > >>> INFO mapred.LocalJobRunner - 200 threads, 200370 pages, 6756 > >>> errors, > >> 27.8 > >>> pages/s, 12916 kb/s > >>> > >>> Parse-html > >>> INFO mapred.LocalJobRunner - 200 threads, 433738 pages, 13360 > >>> errors, > >>> 60.1 pages/s, 27980 kb/s, > >>> > >>> > >>> The results: > >>> Parse-html is 116% faster than parse-tika for html for the same > >>> period of time and same URLs > >>> > >>> The error rate was about the same parse-html 3%, parse-tika 3.3% > >>> Most of the errors are read timeouts > >>> > >>> > >>> So is parse-html better? It appears to be faster. But, is the data > >>> as good? > >>> Other considerations? Is parse-html really going to be phased out? > >>> > >>> Brad > >>> > >>> > >>> > >> > > > > > > > > -- > > DigitalPebble Ltd > > > > Open Source Solutions for Text Engineering http:// > > www.digitalpebble.com > > > > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

