Hi Ye, just a small note: i'm running 2.1 with combined fetcher/parser job in local mode. I fetch about 20 pages/s and parsing was never a bottelneck. So, i didn't know anything about 1.x, but this seems to be pretty slow.
--Roland Am 10.03.2013 09:53 schrieb "Ye T Thet" <[email protected]>: > Thanks Feng Lu, > > I am running in local mode. So no leverage on map reduce model actually. My > assumption is that running a single box with the 4x computing power is more > efficient than a cluster with 4 box with 1x computing power. Counting 1 > map/reduce task has its own over head. I guess my assumption is wrong then? > > My question is leaning a bit towards map reduce fundamental now, I guess, > is it true to say what I can gain performance (speed) by splitting tasks to > multiple Map Reduce while using the same computer power let's say 4x. > > Example: 8 map 8 reduce on 4x computing power is more efficient with 1 map > 1 reduce with 4x computing power? > > My guess is that 48hr to parse 100k urls does not sound efficient. > Unfortunately 100k is just the beginning for me. :( I am looking at 10 > Millions per fetch cycle. I am looking for ideas and pointer on how to gain > speed. May be using/tweaking Map Reduce would the the answer? > > If you have done similar cases, what is the ideal Map Reduce setting per > slave? I can post more details if it would help. > > Any input would be greatly appreciated. > > Cheers, > > Ye > > On Sun, Mar 10, 2013 at 11:17 AM, feng lu <[email protected]> wrote: > > > Hi Ye > > > > Do you run nutch in local mode? > > > > 48hr to parse 100k urls, maybe each url spend 1.7s/page. i think the > mainly > > time spend on ParseSegment include parse html and write parse data to > DFS, > > that include text to parse_text, data to parse_data and links to > > crawl_parse. > > > > Maybe you can run the nutch in cluster using deploy mode. It will make > full > > use of the MR distributed computing capabilities > > > > > > > > > > On Sat, Mar 9, 2013 at 11:09 AM, Ye T Thet <[email protected]> > wrote: > > > > > Hi, > > > > > > It is *NOT* about the fetch. It is about the parse. > > > > > > I am using Nutch 1.2. 500 hosts. The fetcher is doing ok with default > > > setting 1 thread per host and 100 concurrent thread. > > > > > > It bottle neck is at the parse. > > > > > > Regards, > > > > > > On Sat, Mar 9, 2013 at 12:27 AM, kiran chitturi > > > <[email protected]>wrote: > > > > > > > Hi! > > > > > > > > Which version of Nutch are you using ? > > > > > > > > Can you please give some more details on your configuration like > > > > > > > > i) How many threads you are using per queue ? > > > > ii) Are the URLs you are crawling belong to a single host or > different > > > > hosts ? > > > > > > > > There are lot of things that come in to affect AFAIK while trying to > > find > > > > the optimal/decent performance. > > > > > > > > If you are using one thread per queue and crawling on a single host, > it > > > is > > > > gonna take quite some time because of the politeness policy and there > > is > > > a > > > > five second gap. > > > > > > > > The configurations can be changed and there is discussion on the > > mailing > > > > list. > > > > > > > > I had a similar problem where fetcher is taking very long with a big > > > topN, > > > > and i have attained stability by changing the configuration. Please > > find > > > > the discussion at [1]. > > > > > > > > > > > > > > > > > > http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html > > > > > > > > Hope this Helps. > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 8, 2013 at 11:12 AM, Ye T Thet <[email protected]> > > > wrote: > > > > > > > > > Good day folks, > > > > > > > > > > I have a question about the benchmarking on ParseSegment of the > > Nutch. > > > > > > > > > > I separated the fetch and parse process. Thus the URLs are fetched > > then > > > > the > > > > > fetched contents for the current depth is parsed. I believe that is > > the > > > > > recommended approach to gain performance as well. > > > > > > > > > > I am running on Amazon EC2, Medium Instance, Ubuntu image, spec of > > 3.7 > > > > GiB > > > > > and 2 core. I am only parsing the HTML contents (using HTML > parsers). > > > > > Fetched URL is around *100k* and the content size is *3 GiB*. I > have > > > been > > > > > experimenting to get the benchmark on the crawlers (fetch, parse > > > > circle). I > > > > > have crawled few data sets. According to my observation, to parse > the > > > > HTML > > > > > documents of 100k URL estimate of 3 GiB took me around 32hrs to 48 > > hrs > > > > via > > > > > ParseSegment. Through out the parsing process CPU and Memory > > > utilization > > > > is > > > > > almost 100%. > > > > > > > > > > Has anyone got performance benchmark on the ParseSegment? Does 48 > hrs > > > to > > > > > parse 3Gib of contents from 100k URLs sounds reasonable? If it is > too > > > > off, > > > > > is there any direction I should explore to gain the performance? > Like > > > > > tweaking ParseSegment flow? > > > > > > > > > > Apology for too many questions. Thanks in advance. > > > > > > > > > > Cheers, > > > > > > > > > > Ye > > > > > > > > > > > > > > > > > > > > > -- > > > > Kiran Chitturi > > > > > > > > > > > > > > > -- > > Don't Grow Old, Grow Up... :-) > > >

