Hi Brad,

Thanks for sharing this. It would be interesting to profile the parsing and
have a better idea of what makes such a difference. Could it be the
detection of the encoding for instance?

Jul


On 18 August 2010 17:48, brad <[email protected]> wrote:

> I finally had a chance to test the Nutch html parsing this without fetching
> per Julien suggestion.  The results were pretty much the same as my
> previous
> tests:
>
>                        parse-html              Tika-html
> Elapsed Time:   04:21:47                08:55:57
> Parse (Success):        150,634         150,615
> Parse (failed): 3,788                   3,807
>
> So, based on this test, parse-html is a little more than twice as fast as
> tika's html parsing.
>
> This was done on Linux Centos 5.5, 8gb ram, Intel Xeon CPU X3220 @ 2.40GHz
> Only Nutch related processes were running on the server
> Nutch 1.2 - which now has the nice timings feature!
>
> The data was retrieved using:
> bin/nutch fetch <segment> -noParsing -threads 200
>
> All data was parsed using:
> bin/nutch parse <segment> -threads 200
>
> Brad
>
>
> -----Original Message-----
> From: Ken Krugler [mailto:[email protected]]
> Sent: Wednesday, August 11, 2010 2:20 PM
> To: [email protected]
> Subject: Re: For HTML - is parse-html twice as fast as parse-tika
>
> Hi Brad,
>
> On Aug 2, 2010, at 9:26am, brad wrote:
>
> > Hi Julien,
> > I'll see if I can give a try later this week.
>
> [snip]
>
> Were you able to try the parse-only approach that Julien suggested below?
>
> I'm asking because (a) I do a fair amount of work with/on the Tika HTML
> parsing support, and (b) I've also run into surprisingly slow parse
> performance with Tika, though I didn't compare to Nutch's older parser (or
> using NekoHTML instead of TagSoup).
>
> Thanks,
>
> -- Ken
>
>
> > -----Original Message-----
> > From: Julien Nioche [mailto:[email protected]]
> > Sent: Monday, August 02, 2010 5:11 AM
> > To: [email protected]
> > Subject: Re: For HTML - is parse-html twice as fast as parse-tika
> >
> > Hi Brad,
> >
> > Could you run and measure the parser independently of the fetching?
> > That
> > would remove any possible side effect due to caching, network issues
> > etc...
> >
> > All you need to do is remove the subdirectories parse_text, parse_data
> > and crawl_parse then run : nutch parse
> >
> > Thanks
> >
> > Julien
> >
> > PS: regarding parse-html being phased out : see Andrzej's JIRA from
> > this morning
> >
> >
> > On 31 July 2010 22:43, brad <[email protected]> wrote:
> >
> >>> I have been experiencing some performance issues with Tika and
> >>> general parsing (see Parsing Performance - related to Java
> >>> concurrency issue)
> >>>
> >>> Ken pointed out that both the both Tika and Nutch HtmlParser show up
> >>> in
> >> my
> >>> jstack list using the delivered configuration.
> >>>
> >>> Julien suggested checking parsing with only parse-tika (html) and
> >>> then with parse-html.
> >>>
> >>> So here is what I did.
> >>>
> >>> Option 1) parse-tika
> >>>          parse-(rss|text|js|tika)
> >>>          parse-plugin.xml as delivered
> >>         tika-mimetypes.xml as delivered
> >>
> >>> Option 2) parse-html
> >>>          parse-(rss|text|html|js|tika)
> >>>          parse-plugin.xml turned ON <plugin id="parse-html" />
> >>>          tika-mimetypes.xml commented out <mime-type
> >>> type="text/html">
> >>>
> >>> Using the same generated crawl, ran fetch with parse for each of the
> >>> options for 2 hours.
> >>> All other configurations and settings are identical
> >>>
> >>> Results:
> >>> Parse-tika
> >>> INFO  mapred.LocalJobRunner - 200 threads, 200370 pages, 6756
> >>> errors,
> >> 27.8
> >>> pages/s, 12916 kb/s
> >>>
> >>> Parse-html
> >>> INFO  mapred.LocalJobRunner - 200 threads, 433738 pages, 13360
> >>> errors,
> >>> 60.1 pages/s, 27980 kb/s,
> >>>
> >>>
> >>> The results:
> >>> Parse-html is 116% faster than parse-tika for html for the same
> >>> period of time and same URLs
> >>>
> >>> The error rate was about the same parse-html 3%, parse-tika 3.3%
> >>> Most of the errors are read timeouts
> >>>
> >>>
> >>> So is parse-html better?  It appears to be faster.  But, is the data
> >>> as good?
> >>> Other considerations?  Is parse-html really going to be phased out?
> >>>
> >>> Brad
> >>>
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > DigitalPebble Ltd
> >
> > Open Source Solutions for Text Engineering http://
> > www.digitalpebble.com
> >
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to