Thanks Feng Lu,

I am running in local mode. So no leverage on map reduce model actually. My
assumption is that running a single box with the 4x computing power is more
efficient than a cluster with 4 box with 1x computing power. Counting 1
map/reduce task has its own over head. I guess my assumption is wrong then?

My question is leaning a bit towards map reduce fundamental now, I guess,
is it true to say what I can gain performance (speed) by splitting tasks to
multiple Map Reduce while using the same computer power let's say 4x.

Example: 8 map 8 reduce on 4x computing power is more efficient with 1 map
1 reduce with 4x computing power?

My guess is that 48hr to parse 100k urls does not sound efficient.
Unfortunately 100k is just the beginning for me. :( I am looking at 10
Millions per fetch cycle. I am looking for ideas and pointer on how to gain
speed. May be using/tweaking Map Reduce would the the answer?

If you have done similar cases, what is the ideal Map Reduce setting per
slave? I can post more details if it would help.

Any input would be greatly appreciated.

Cheers,

Ye

On Sun, Mar 10, 2013 at 11:17 AM, feng lu <[email protected]> wrote:

> Hi Ye
>
> Do you run nutch in local mode?
>
> 48hr to parse 100k urls, maybe each url spend 1.7s/page. i think the mainly
> time spend on ParseSegment include parse html and write parse data to DFS,
> that include text to parse_text, data to parse_data and links to
> crawl_parse.
>
> Maybe you can run the nutch in cluster using deploy mode. It will make full
> use of the MR distributed computing capabilities
>
>
>
>
> On Sat, Mar 9, 2013 at 11:09 AM, Ye T Thet <[email protected]> wrote:
>
> > Hi,
> >
> > It is *NOT* about the fetch. It is about the parse.
> >
> > I am using Nutch 1.2. 500 hosts. The fetcher is doing ok with default
> > setting 1 thread per host and 100 concurrent thread.
> >
> > It bottle neck is at the parse.
> >
> > Regards,
> >
> > On Sat, Mar 9, 2013 at 12:27 AM, kiran chitturi
> > <[email protected]>wrote:
> >
> > > Hi!
> > >
> > > Which version of Nutch are you using ?
> > >
> > > Can you please give some more details on your configuration like
> > >
> > > i) How many threads you are using per queue ?
> > > ii) Are the URLs you are crawling belong to a single host or different
> > > hosts ?
> > >
> > > There are lot of things that come in to affect AFAIK while trying to
> find
> > > the optimal/decent performance.
> > >
> > > If you are using one thread per queue and crawling on a single host, it
> > is
> > > gonna take quite some time because of the politeness policy and there
> is
> > a
> > > five second gap.
> > >
> > > The configurations can be changed and there is discussion on the
> mailing
> > > list.
> > >
> > > I had a similar problem where fetcher is taking very long with a big
> > topN,
> > > and i have attained stability by changing the configuration. Please
> find
> > > the discussion at [1].
> > >
> > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html
> > >
> > > Hope this Helps.
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Mar 8, 2013 at 11:12 AM, Ye T Thet <[email protected]>
> > wrote:
> > >
> > > > Good day folks,
> > > >
> > > > I have a question about the benchmarking on ParseSegment of the
> Nutch.
> > > >
> > > > I separated the fetch and parse process. Thus the URLs are fetched
> then
> > > the
> > > > fetched contents for the current depth is parsed. I believe that is
> the
> > > > recommended approach to gain performance as well.
> > > >
> > > > I am running on Amazon EC2, Medium Instance, Ubuntu image, spec of
> 3.7
> > > GiB
> > > > and 2 core. I am only parsing the HTML contents (using HTML parsers).
> > > > Fetched URL is around *100k* and the content size is *3 GiB*. I have
> > been
> > > > experimenting to get the benchmark on the crawlers (fetch, parse
> > > circle). I
> > > > have crawled few data sets. According to my observation, to parse the
> > > HTML
> > > > documents of 100k URL estimate of 3 GiB took me around 32hrs to 48
> hrs
> > > via
> > > > ParseSegment. Through out the parsing process CPU and Memory
> > utilization
> > > is
> > > > almost 100%.
> > > >
> > > > Has anyone got performance benchmark on the ParseSegment? Does 48 hrs
> > to
> > > > parse 3Gib of contents from 100k URLs sounds reasonable? If it is too
> > > off,
> > > > is there any direction I should explore to gain the performance? Like
> > > > tweaking ParseSegment flow?
> > > >
> > > > Apology for too many questions. Thanks in advance.
> > > >
> > > > Cheers,
> > > >
> > > > Ye
> > > >
> > >
> > >
> > >
> > > --
> > > Kiran Chitturi
> > >
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Reply via email to