Re: Parse benchmark/performance

feng lu Sun, 10 Mar 2013 08:28:18 -0700

Hi

i run nutch 1.6 in my laptop with 4-core cpu and 6Gb memory. Total fetch
1500 pages, each size of each page is average 20kb. the total parse time is
28s. So the average parse time is 53 pages/s.


Can you see the parsed log information like this.

Parsed (1ms):http://www.openspf.org/
Parsed (0ms):http://www.osgi.org/Download/Release4V42
Parsed (0ms):
http://www.packtpub.com/cassandra-apache-high-performance-cookbook/book
Parsed (0ms):http://www.rackspace.com/
Parsed (0ms):http://www.rosenlaw.com/oslbook.htm
Parsed (0ms):http://www.softwarefreedom.org/
Parsed (0ms):http://www.us.apache.org/dist/
Parsed (1ms):http://www.w3.org/2001/sw/
Parsed (0ms):http://www.w3.org/RDF/

Do you see the parsing time of page larger than 1s?



On Sun, Mar 10, 2013 at 6:31 PM, Roland von Herget <
[email protected]> wrote:

> Hi Ye,
>
> just a small note: i'm running 2.1 with combined fetcher/parser job in
> local mode. I fetch about 20 pages/s and parsing was never a bottelneck.
> So, i didn't know anything about 1.x, but this seems to be pretty slow.
>
> --Roland
> Am 10.03.2013 09:53 schrieb "Ye T Thet" <[email protected]>:
>
> > Thanks Feng Lu,
> >
> > I am running in local mode. So no leverage on map reduce model actually.
> My
> > assumption is that running a single box with the 4x computing power is
> more
> > efficient than a cluster with 4 box with 1x computing power. Counting 1
> > map/reduce task has its own over head. I guess my assumption is wrong
> then?
> >
> > My question is leaning a bit towards map reduce fundamental now, I guess,
> > is it true to say what I can gain performance (speed) by splitting tasks
> to
> > multiple Map Reduce while using the same computer power let's say 4x.
> >
> > Example: 8 map 8 reduce on 4x computing power is more efficient with 1
> map
> > 1 reduce with 4x computing power?
> >
> > My guess is that 48hr to parse 100k urls does not sound efficient.
> > Unfortunately 100k is just the beginning for me. :( I am looking at 10
> > Millions per fetch cycle. I am looking for ideas and pointer on how to
> gain
> > speed. May be using/tweaking Map Reduce would the the answer?
> >
> > If you have done similar cases, what is the ideal Map Reduce setting per
> > slave? I can post more details if it would help.
> >
> > Any input would be greatly appreciated.
> >
> > Cheers,
> >
> > Ye
> >
> > On Sun, Mar 10, 2013 at 11:17 AM, feng lu <[email protected]> wrote:
> >
> > > Hi Ye
> > >
> > > Do you run nutch in local mode?
> > >
> > > 48hr to parse 100k urls, maybe each url spend 1.7s/page. i think the
> > mainly
> > > time spend on ParseSegment include parse html and write parse data to
> > DFS,
> > > that include text to parse_text, data to parse_data and links to
> > > crawl_parse.
> > >
> > > Maybe you can run the nutch in cluster using deploy mode. It will make
> > full
> > > use of the MR distributed computing capabilities
> > >
> > >
> > >
> > >
> > > On Sat, Mar 9, 2013 at 11:09 AM, Ye T Thet <[email protected]>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > It is *NOT* about the fetch. It is about the parse.
> > > >
> > > > I am using Nutch 1.2. 500 hosts. The fetcher is doing ok with default
> > > > setting 1 thread per host and 100 concurrent thread.
> > > >
> > > > It bottle neck is at the parse.
> > > >
> > > > Regards,
> > > >
> > > > On Sat, Mar 9, 2013 at 12:27 AM, kiran chitturi
> > > > <[email protected]>wrote:
> > > >
> > > > > Hi!
> > > > >
> > > > > Which version of Nutch are you using ?
> > > > >
> > > > > Can you please give some more details on your configuration like
> > > > >
> > > > > i) How many threads you are using per queue ?
> > > > > ii) Are the URLs you are crawling belong to a single host or
> > different
> > > > > hosts ?
> > > > >
> > > > > There are lot of things that come in to affect AFAIK while trying
> to
> > > find
> > > > > the optimal/decent performance.
> > > > >
> > > > > If you are using one thread per queue and crawling on a single
> host,
> > it
> > > > is
> > > > > gonna take quite some time because of the politeness policy and
> there
> > > is
> > > > a
> > > > > five second gap.
> > > > >
> > > > > The configurations can be changed and there is discussion on the
> > > mailing
> > > > > list.
> > > > >
> > > > > I had a similar problem where fetcher is taking very long with a
> big
> > > > topN,
> > > > > and i have attained stability by changing the configuration. Please
> > > find
> > > > > the discussion at [1].
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html
> > > > >
> > > > > Hope this Helps.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Mar 8, 2013 at 11:12 AM, Ye T Thet <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > Good day folks,
> > > > > >
> > > > > > I have a question about the benchmarking on ParseSegment of the
> > > Nutch.
> > > > > >
> > > > > > I separated the fetch and parse process. Thus the URLs are
> fetched
> > > then
> > > > > the
> > > > > > fetched contents for the current depth is parsed. I believe that
> is
> > > the
> > > > > > recommended approach to gain performance as well.
> > > > > >
> > > > > > I am running on Amazon EC2, Medium Instance, Ubuntu image, spec
> of
> > > 3.7
> > > > > GiB
> > > > > > and 2 core. I am only parsing the HTML contents (using HTML
> > parsers).
> > > > > > Fetched URL is around *100k* and the content size is *3 GiB*. I
> > have
> > > > been
> > > > > > experimenting to get the benchmark on the crawlers (fetch, parse
> > > > > circle). I
> > > > > > have crawled few data sets. According to my observation, to parse
> > the
> > > > > HTML
> > > > > > documents of 100k URL estimate of 3 GiB took me around 32hrs to
> 48
> > > hrs
> > > > > via
> > > > > > ParseSegment. Through out the parsing process CPU and Memory
> > > > utilization
> > > > > is
> > > > > > almost 100%.
> > > > > >
> > > > > > Has anyone got performance benchmark on the ParseSegment? Does 48
> > hrs
> > > > to
> > > > > > parse 3Gib of contents from 100k URLs sounds reasonable? If it is
> > too
> > > > > off,
> > > > > > is there any direction I should explore to gain the performance?
> > Like
> > > > > > tweaking ParseSegment flow?
> > > > > >
> > > > > > Apology for too many questions. Thanks in advance.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Ye
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kiran Chitturi
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Don't Grow Old, Grow Up... :-)
> > >
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Parse benchmark/performance

Reply via email to