Hey Peter, My hardware was a cluster of high-end production machines (RAM and CPU specs were 100 times better than a normal desktop PC). I think if you procure EC2 instances of alteast type "medium", you can expect better perf.
I have no idea about who is faster among nutch 2.1 and 1.6. I want to know it too :) Can anyone from the @dev or @user comment on that ? Thanks, Tejas Patil On Thu, Jan 31, 2013 at 12:09 AM, peterbarretto <[email protected]>wrote: > Hi Tejas, > > I am currently running nutch 1.6 on windows 7, pentium dual core 2.8Ghz, 2 > GB ram > I will be using amazon ec2 servers later for crawling. > > What was ur hardware when you ran 4 million urls with 80Gb data? > > Will nutch 2.1 give a faster crawl speed than 1.6? > > > Tejas Patil wrote > > I had ran crawls with topN as large as 4 million while having crawldb of > > ~80 GB. It worked fine without any such issue. > > Maybe the hardware / cluster you have is not capable of handling load > > above > > 500. Note that if topN is low, then no matter how many fetcher threads > you > > create, you wont be able to increase #crawls. Also, as there is a > > considerable amount of time spent in generate and update phase, overall > > crawl rate will be low. If you are planning to use the same machine, you > > will have to work with lower values (and thus expect lower crawl rate). > > > > thanks, > > Tejas Patil > > > > > > On Wed, Jan 30, 2013 at 8:06 PM, Lewis John Mcgibbney < > > > lewis.mcgibbney@ > > >> wrote: > > > >> You are not getting very many URLs! > >> > >> On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto < > > > peterbarretto08@ > > > > >wrote: > >> > >> > > >> > 2013-01-29 08:44:35,014 INFO crawl.CrawlDbReader - TOTAL urls: 96404 > >> > > >> > 2013-01-29 08:44:35,018 INFO crawl.CrawlDbReader - status 1 > >> > (db_unfetched): > >> > 85672 > >> > > >> > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4037637.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

