Hi Ye,

just a small note: i'm running 2.1 with combined fetcher/parser job in
local mode. I fetch about 20 pages/s and parsing was never a bottelneck.
So, i didn't know anything about 1.x, but this seems to be pretty slow.

--Roland
Am 10.03.2013 09:53 schrieb "Ye T Thet" <[email protected]>:

> Thanks Feng Lu,
>
> I am running in local mode. So no leverage on map reduce model actually. My
> assumption is that running a single box with the 4x computing power is more
> efficient than a cluster with 4 box with 1x computing power. Counting 1
> map/reduce task has its own over head. I guess my assumption is wrong then?
>
> My question is leaning a bit towards map reduce fundamental now, I guess,
> is it true to say what I can gain performance (speed) by splitting tasks to
> multiple Map Reduce while using the same computer power let's say 4x.
>
> Example: 8 map 8 reduce on 4x computing power is more efficient with 1 map
> 1 reduce with 4x computing power?
>
> My guess is that 48hr to parse 100k urls does not sound efficient.
> Unfortunately 100k is just the beginning for me. :( I am looking at 10
> Millions per fetch cycle. I am looking for ideas and pointer on how to gain
> speed. May be using/tweaking Map Reduce would the the answer?
>
> If you have done similar cases, what is the ideal Map Reduce setting per
> slave? I can post more details if it would help.
>
> Any input would be greatly appreciated.
>
> Cheers,
>
> Ye
>
> On Sun, Mar 10, 2013 at 11:17 AM, feng lu <[email protected]> wrote:
>
> > Hi Ye
> >
> > Do you run nutch in local mode?
> >
> > 48hr to parse 100k urls, maybe each url spend 1.7s/page. i think the
> mainly
> > time spend on ParseSegment include parse html and write parse data to
> DFS,
> > that include text to parse_text, data to parse_data and links to
> > crawl_parse.
> >
> > Maybe you can run the nutch in cluster using deploy mode. It will make
> full
> > use of the MR distributed computing capabilities
> >
> >
> >
> >
> > On Sat, Mar 9, 2013 at 11:09 AM, Ye T Thet <[email protected]>
> wrote:
> >
> > > Hi,
> > >
> > > It is *NOT* about the fetch. It is about the parse.
> > >
> > > I am using Nutch 1.2. 500 hosts. The fetcher is doing ok with default
> > > setting 1 thread per host and 100 concurrent thread.
> > >
> > > It bottle neck is at the parse.
> > >
> > > Regards,
> > >
> > > On Sat, Mar 9, 2013 at 12:27 AM, kiran chitturi
> > > <[email protected]>wrote:
> > >
> > > > Hi!
> > > >
> > > > Which version of Nutch are you using ?
> > > >
> > > > Can you please give some more details on your configuration like
> > > >
> > > > i) How many threads you are using per queue ?
> > > > ii) Are the URLs you are crawling belong to a single host or
> different
> > > > hosts ?
> > > >
> > > > There are lot of things that come in to affect AFAIK while trying to
> > find
> > > > the optimal/decent performance.
> > > >
> > > > If you are using one thread per queue and crawling on a single host,
> it
> > > is
> > > > gonna take quite some time because of the politeness policy and there
> > is
> > > a
> > > > five second gap.
> > > >
> > > > The configurations can be changed and there is discussion on the
> > mailing
> > > > list.
> > > >
> > > > I had a similar problem where fetcher is taking very long with a big
> > > topN,
> > > > and i have attained stability by changing the configuration. Please
> > find
> > > > the discussion at [1].
> > > >
> > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html
> > > >
> > > > Hope this Helps.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Mar 8, 2013 at 11:12 AM, Ye T Thet <[email protected]>
> > > wrote:
> > > >
> > > > > Good day folks,
> > > > >
> > > > > I have a question about the benchmarking on ParseSegment of the
> > Nutch.
> > > > >
> > > > > I separated the fetch and parse process. Thus the URLs are fetched
> > then
> > > > the
> > > > > fetched contents for the current depth is parsed. I believe that is
> > the
> > > > > recommended approach to gain performance as well.
> > > > >
> > > > > I am running on Amazon EC2, Medium Instance, Ubuntu image, spec of
> > 3.7
> > > > GiB
> > > > > and 2 core. I am only parsing the HTML contents (using HTML
> parsers).
> > > > > Fetched URL is around *100k* and the content size is *3 GiB*. I
> have
> > > been
> > > > > experimenting to get the benchmark on the crawlers (fetch, parse
> > > > circle). I
> > > > > have crawled few data sets. According to my observation, to parse
> the
> > > > HTML
> > > > > documents of 100k URL estimate of 3 GiB took me around 32hrs to 48
> > hrs
> > > > via
> > > > > ParseSegment. Through out the parsing process CPU and Memory
> > > utilization
> > > > is
> > > > > almost 100%.
> > > > >
> > > > > Has anyone got performance benchmark on the ParseSegment? Does 48
> hrs
> > > to
> > > > > parse 3Gib of contents from 100k URLs sounds reasonable? If it is
> too
> > > > off,
> > > > > is there any direction I should explore to gain the performance?
> Like
> > > > > tweaking ParseSegment flow?
> > > > >
> > > > > Apology for too many questions. Thanks in advance.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Ye
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kiran Chitturi
> > > >
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>

Reply via email to