So, I tried this with the generate.max.count property set to 5000, rebuild ant; ant jar; ant job and reran fetch. It still appears the same, first 79 reducers zip through and the last one is crawling, literally...
As for the logs, I mentioned on one of my earlier threads that when I run from the deploy directory, I am not getting any logs generated. I looked for the logs directory under local as well as under deploy, and just to make sure, also in the grid. I do not see the logs directory. So I created it manually under deploy before starting fetch, and still there is nothing in this directory, On Tue, Jul 2, 2013 at 3:20 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi, > Please try > *http://s.apache.org/mo* > Specifically the generate.max.count property. > Many many URLs are unfetched here... look into the logs and see what is > going on. This is really quite bad and there is most likely one/a small > number of reasons which ultimately determine why so many URLs are > unfetched. > Golden rule here, unless you know that lareg fetches can be left to their > own devices (unmonitored), then try to generate many small fetch lists and > check on the progress. This really helps you to improve throughput and > increase the productiveness to time ratio for fetching tasks. > hth > Lewis > > > On Tue, Jul 2, 2013 at 2:48 PM, h b <[email protected]> wrote: > > > Hi, > > I seeded 4 urls, all in the same domain. > > I am running fetch with 20 threads and 80 numTasks. The reducer is stuck > on > > the last reduce. > > I ran a dump of the readdb to see the status, and I see 122K of the total > > 133K urls are 'status_unfetched'. This is after 12 hours. The delay > between > > fetches is 5s (default) > > > > My hadoop cluster has 10 datanodes, each is about 24 core and 48G Ram. > > The average size of each page is 150KB. The site I am crawling responds > > fast enough (it is internal) > > So I do not understand where the bottleneck is? > > > > It is still not complete. > > > > > > > > On Tue, Jul 2, 2013 at 5:12 AM, Markus Jelsma < > [email protected] > > >wrote: > > > > > Hi, > > > > > > Nutch can easily scale to many many billions of records, it just > depends > > > on how many and how powerful your nodes are. Crawl speed is not very > > > relevant as it is always very fast, the problem usually is updating the > > > databases. If you spread your data over more machines you will increase > > > your throughput! We can easily manage 2m records on a very small 1 > core 1 > > > GB VPS but we can also manage dozens of billions records on a small > > cluster > > > of 5 16 core 16GB nodes. It depends on your cluster! > > > > > > Cheers, > > > Markus > > > > > > > > > > > > -----Original message----- > > > > From:h b <[email protected]> > > > > Sent: Tuesday 2nd July 2013 7:35 > > > > To: [email protected] > > > > Subject: Nutch scalability tests > > > > > > > > Hi > > > > Does anyone have some stats around scalability of how many urls you > > > crawled > > > > and how long it took. Definitely these stats are environment based > and > > > the > > > > site(s) crawled, but would be nice to see > > > > some stats here. > > > > > > > > I used nutch with HBase and solr and have got a nice working > enviroment > > > and > > > > so far have been able to crawl a limited set, rather very very > limited > > > set > > > > of urls satisfactorily. Now that I have a proof of concept, I want to > > run > > > > it full blown, but before I do that, I want to see if my setup can > even > > > > handle this. If not, I want to see how I can throttle my runs. So > some > > > > stats/test results would be nice to have. > > > > > > > > > > > > Regards > > > > Hemant > > > > > > > > > > > > > -- > *Lewis* >

