Hi Ye Do you run nutch in local mode?
48hr to parse 100k urls, maybe each url spend 1.7s/page. i think the mainly time spend on ParseSegment include parse html and write parse data to DFS, that include text to parse_text, data to parse_data and links to crawl_parse. Maybe you can run the nutch in cluster using deploy mode. It will make full use of the MR distributed computing capabilities On Sat, Mar 9, 2013 at 11:09 AM, Ye T Thet <[email protected]> wrote: > Hi, > > It is *NOT* about the fetch. It is about the parse. > > I am using Nutch 1.2. 500 hosts. The fetcher is doing ok with default > setting 1 thread per host and 100 concurrent thread. > > It bottle neck is at the parse. > > Regards, > > On Sat, Mar 9, 2013 at 12:27 AM, kiran chitturi > <[email protected]>wrote: > > > Hi! > > > > Which version of Nutch are you using ? > > > > Can you please give some more details on your configuration like > > > > i) How many threads you are using per queue ? > > ii) Are the URLs you are crawling belong to a single host or different > > hosts ? > > > > There are lot of things that come in to affect AFAIK while trying to find > > the optimal/decent performance. > > > > If you are using one thread per queue and crawling on a single host, it > is > > gonna take quite some time because of the politeness policy and there is > a > > five second gap. > > > > The configurations can be changed and there is discussion on the mailing > > list. > > > > I had a similar problem where fetcher is taking very long with a big > topN, > > and i have attained stability by changing the configuration. Please find > > the discussion at [1]. > > > > > > > http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html > > > > Hope this Helps. > > > > > > > > > > > > On Fri, Mar 8, 2013 at 11:12 AM, Ye T Thet <[email protected]> > wrote: > > > > > Good day folks, > > > > > > I have a question about the benchmarking on ParseSegment of the Nutch. > > > > > > I separated the fetch and parse process. Thus the URLs are fetched then > > the > > > fetched contents for the current depth is parsed. I believe that is the > > > recommended approach to gain performance as well. > > > > > > I am running on Amazon EC2, Medium Instance, Ubuntu image, spec of 3.7 > > GiB > > > and 2 core. I am only parsing the HTML contents (using HTML parsers). > > > Fetched URL is around *100k* and the content size is *3 GiB*. I have > been > > > experimenting to get the benchmark on the crawlers (fetch, parse > > circle). I > > > have crawled few data sets. According to my observation, to parse the > > HTML > > > documents of 100k URL estimate of 3 GiB took me around 32hrs to 48 hrs > > via > > > ParseSegment. Through out the parsing process CPU and Memory > utilization > > is > > > almost 100%. > > > > > > Has anyone got performance benchmark on the ParseSegment? Does 48 hrs > to > > > parse 3Gib of contents from 100k URLs sounds reasonable? If it is too > > off, > > > is there any direction I should explore to gain the performance? Like > > > tweaking ParseSegment flow? > > > > > > Apology for too many questions. Thanks in advance. > > > > > > Cheers, > > > > > > Ye > > > > > > > > > > > -- > > Kiran Chitturi > > > -- Don't Grow Old, Grow Up... :-)

