Hi, It is *NOT* about the fetch. It is about the parse.
I am using Nutch 1.2. 500 hosts. The fetcher is doing ok with default setting 1 thread per host and 100 concurrent thread. It bottle neck is at the parse. Regards, On Sat, Mar 9, 2013 at 12:27 AM, kiran chitturi <[email protected]>wrote: > Hi! > > Which version of Nutch are you using ? > > Can you please give some more details on your configuration like > > i) How many threads you are using per queue ? > ii) Are the URLs you are crawling belong to a single host or different > hosts ? > > There are lot of things that come in to affect AFAIK while trying to find > the optimal/decent performance. > > If you are using one thread per queue and crawling on a single host, it is > gonna take quite some time because of the politeness policy and there is a > five second gap. > > The configurations can be changed and there is discussion on the mailing > list. > > I had a similar problem where fetcher is taking very long with a big topN, > and i have attained stability by changing the configuration. Please find > the discussion at [1]. > > > http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html > > Hope this Helps. > > > > > > On Fri, Mar 8, 2013 at 11:12 AM, Ye T Thet <[email protected]> wrote: > > > Good day folks, > > > > I have a question about the benchmarking on ParseSegment of the Nutch. > > > > I separated the fetch and parse process. Thus the URLs are fetched then > the > > fetched contents for the current depth is parsed. I believe that is the > > recommended approach to gain performance as well. > > > > I am running on Amazon EC2, Medium Instance, Ubuntu image, spec of 3.7 > GiB > > and 2 core. I am only parsing the HTML contents (using HTML parsers). > > Fetched URL is around *100k* and the content size is *3 GiB*. I have been > > experimenting to get the benchmark on the crawlers (fetch, parse > circle). I > > have crawled few data sets. According to my observation, to parse the > HTML > > documents of 100k URL estimate of 3 GiB took me around 32hrs to 48 hrs > via > > ParseSegment. Through out the parsing process CPU and Memory utilization > is > > almost 100%. > > > > Has anyone got performance benchmark on the ParseSegment? Does 48 hrs to > > parse 3Gib of contents from 100k URLs sounds reasonable? If it is too > off, > > is there any direction I should explore to gain the performance? Like > > tweaking ParseSegment flow? > > > > Apology for too many questions. Thanks in advance. > > > > Cheers, > > > > Ye > > > > > > -- > Kiran Chitturi >

