Re: Nutch scalability tests

h b Tue, 02 Jul 2013 15:53:47 -0700

So, I tried this with the generate.max.count property set to 5000, rebuild
ant; ant jar; ant job and reran fetch.
It still appears the same, first 79 reducers zip through and the last one
is crawling, literally...


As for the logs, I mentioned on one of my earlier threads that when I run
from the deploy directory, I am not getting any logs generated.
I looked for the logs directory under local as well as under deploy, and
just to make sure, also in the grid. I do not see the logs directory. So I
created it manually under deploy before starting fetch, and still there is
nothing in this directory,




On Tue, Jul 2, 2013 at 3:20 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi,
> Please try
> *http://s.apache.org/mo*
> Specifically the generate.max.count property.
> Many many URLs are unfetched here... look into the logs and see what is
> going on. This is really quite bad and there is most likely one/a small
> number of reasons which ultimately determine why so many URLs are
> unfetched.
> Golden rule here, unless you know that lareg fetches can be left to their
> own devices (unmonitored), then try to generate many small fetch lists and
> check on the progress. This really helps you to improve throughput and
> increase the productiveness to time ratio for fetching tasks.
> hth
> Lewis
>
>
> On Tue, Jul 2, 2013 at 2:48 PM, h b <[email protected]> wrote:
>
> > Hi,
> > I seeded 4 urls, all in the same domain.
> > I am running fetch with 20 threads and 80 numTasks. The reducer is stuck
> on
> > the last reduce.
> > I ran a dump of the readdb to see the status, and I see 122K of the total
> > 133K urls are 'status_unfetched'. This is after 12 hours. The delay
> between
> > fetches is 5s (default)
> >
> > My hadoop cluster has 10 datanodes, each is about 24 core and 48G Ram.
> > The average size of each page is 150KB. The site I am crawling responds
> > fast enough (it is internal)
> > So I do not understand where the bottleneck is?
> >
> > It is still not complete.
> >
> >
> >
> > On Tue, Jul 2, 2013 at 5:12 AM, Markus Jelsma <
> [email protected]
> > >wrote:
> >
> > > Hi,
> > >
> > > Nutch can easily scale to many many billions of records, it just
> depends
> > > on how many and how powerful your nodes are. Crawl speed is not very
> > > relevant as it is always very fast, the problem usually is updating the
> > > databases. If you spread your data over more machines you will increase
> > > your throughput! We can easily manage 2m records on a very small 1
> core 1
> > > GB VPS but we can also manage dozens of billions records on a small
> > cluster
> > > of 5 16 core 16GB nodes. It depends on your cluster!
> > >
> > > Cheers,
> > > Markus
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:h b <[email protected]>
> > > > Sent: Tuesday 2nd July 2013 7:35
> > > > To: [email protected]
> > > > Subject: Nutch scalability tests
> > > >
> > > > Hi
> > > > Does anyone have some stats around scalability of how many urls you
> > > crawled
> > > > and how long it took. Definitely these stats are environment based
> and
> > > the
> > > > site(s) crawled, but would be nice to see
> > > >  some stats here.
> > > >
> > > > I used nutch with HBase and solr and have got a nice working
> enviroment
> > > and
> > > > so far have been able to crawl a limited set, rather very very
> limited
> > > set
> > > > of urls satisfactorily. Now that I have a proof of concept, I want to
> > run
> > > > it full blown, but before I do that, I want to see if my setup can
> even
> > > > handle this. If not, I want to see how I can throttle my runs. So
> some
> > > > stats/test results would be nice to have.
> > > >
> > > >
> > > > Regards
> > > > Hemant
> > > >
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: Nutch scalability tests

Reply via email to