RE: Nutch scalability tests

Markus Jelsma Wed, 03 Jul 2013 13:52:24 -0700

How many different hosts do you crawl? I see one reducer and only one queue and 
Nutch queus by domain or host. Hosts will always end up in the same queue so 
Nutch will only crawl a lot and very fast if there's a large number of queues 
to process.


The only thing you can do then is increase the number of threads per queue but 
that's not very polite and can DoS the server(s).
 
-----Original message-----
> From:h b <[email protected]>
> Sent: Wednesday 3rd July 2013 22:46
> To: [email protected]
> Subject: Re: Nutch scalability tests
> 
> Hi,
> I reran this job again. I had 5 urls in my seed, and first pass of fetch,
> fetched about 230 pages in 20 minutes.
> Then I ran a second pass of fetch, and it has been running over 3.5 hours.
> Again, it is still the 1 reducer doing all the work, and its jobtracker has
> nothing in its log yet.
> 
> 20/20 spinwaiting/active, 2062 pages, 1 errors, 0.2 0 pages/s, 423 673
> kb/s, 1000 URLs in 1 queues > reduce
> 
> I am digging in the code to see why is everything going to just one
> reducer. The fetch mapper key is
> 
> new IntWritable(random.nextInt(65536))
> 
> So it makes sense that these be distributed across multiple reducers. But
> looks like all keys are falling under the same reducer.
> Can't think of how to make these keys fall into multiple reducers.
> 
> 
> 
> On Wed, Jul 3, 2013 at 8:55 AM, Tejas Patil <[email protected]>wrote:
> 
> > The steps you performed are right.
> >
> > Did you get the log for that one "hardworking" reducer ? It will hint us
> > why the job took so much. Ideally you should get logs for every job and its
> > attempts. If you cannot get the log for that reducer, then I feel that your
> > cluster is having some problem and this needs to be addressed.
> >
> >
> > On Wed, Jul 3, 2013 at 8:47 AM, h b <[email protected]> wrote:
> >
> > > Hi Tejas, looks like we were tying at the same time
> > > So anyway, my job ended fine, just to be sure what I am doing is right, I
> > > have cleared the db and started another round again. If I stumble again,
> > > will respond back on this thread.
> > >
> > >
> > > On Wed, Jul 3, 2013 at 8:43 AM, Tejas Patil <[email protected]
> > > >wrote:
> > >
> > > > > The second run, still shows 1 reduce running, although it shows as
> > 100%
> > > > complete, so my thought is it is writing out to the disk, though it has
> > > > been about 30+ minutes.
> > > > > This one reducers log on the jobtracker however, is empty.
> > > >
> > > > This is weird. There can be a explanation for first line: The data
> > > crawled
> > > > was large so dumping would take a lot of time but as you said there
> > were
> > > > very less urls so it should not take 30+ mins unless you crawled some
> > > super
> > > > large files.
> > > > Have you checked the job attempts for the job ? If there are no logs
> > > there
> > > > then there is something weird going on with your cluster.
> > > >
> > > >
> > > > On Wed, Jul 3, 2013 at 8:32 AM, h b <[email protected]> wrote:
> > > >
> > > > > oh and yes, generate.max.count is set to 5000
> > > > >
> > > > >
> > > > > On Wed, Jul 3, 2013 at 8:29 AM, h b <[email protected]> wrote:
> > > > >
> > > > > > I dropped my webpage database, restarted with 5 seed urls. First
> > > fetch
> > > > > > completed in a few seconds. The second run, still shows 1 reduce
> > > > running,
> > > > > > although it shows as 100% complete, so my thought is it is writing
> > > out
> > > > to
> > > > > > the disk, though it has been about 30+ minutes.
> > > > > > Again, I had 80 reducers, when I look at the log of these reducers
> > in
> > > > the
> > > > > > hadoop jobtracker, I see
> > > > > >
> > > > > > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0
> > kb/s, 0
> > > > > URLs in 0 queues
> > > > > >
> > > > > > in all of them, which leads me to think that the completed 79
> > > reducers
> > > > > actually fetched nothing, which might explain why this 1 stuck
> > reducer
> > > is
> > > > > working so hard.
> > > > > >
> > > > > > This may be expected, since I am crawling a single domain. This one
> > > > > reducers log on the jobtracker however, is empty. Don't know what to
> > > make
> > > > > of that.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 2, 2013 at 4:15 PM, Lewis John Mcgibbney <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > >> Hi,
> > > > > >>
> > > > > >> On Tue, Jul 2, 2013 at 3:53 PM, h b <[email protected]> wrote:
> > > > > >>
> > > > > >> > So, I tried this with the generate.max.count property set to
> > 5000,
> > > > > >> rebuild
> > > > > >> > ant; ant jar; ant job and reran fetch.
> > > > > >> > It still appears the same, first 79 reducers zip through and the
> > > > last
> > > > > >> one
> > > > > >> > is crawling, literally...
> > > > > >> >
> > > > > >>
> > > > > >> Sorry I should have been more explicit. This property does not
> > > > directly
> > > > > >> affect fetching. It is used when GENERATING fetch lists. Meaning
> > > that
> > > > it
> > > > > >> needs to be present and acknowledged at the generate phase...
> > before
> > > > > >> fetching is executed.
> > > > > >> Besides this, is there any progress being made at all on the last
> > > > > reduce?
> > > > > >> if you look at your CPU (and heap) for the box this is running on,
> > > it
> > > > is
> > > > > >> usual to notice high levels for both of these respectively. Maybe
> > > this
> > > > > >> output writer is just taking a good while to write data down to
> > > > HDFS...
> > > > > >> assuming you are using 1.x.
> > > > > >>
> > > > > >>
> > > > > >> >
> > > > > >> > As for the logs, I mentioned on one of my earlier threads that
> > > when
> > > > I
> > > > > >> run
> > > > > >> > from the deploy directory, I am not getting any logs generated.
> > > > > >> > I looked for the logs directory under local as well as under
> > > deploy,
> > > > > and
> > > > > >> > just to make sure, also in the grid. I do not see the logs
> > > > directory.
> > > > > >> So I
> > > > > >> > created it manually under deploy before starting fetch, and
> > still
> > > > > there
> > > > > >> is
> > > > > >> > nothing in this directory,
> > > > > >> >
> > > > > >> >
> > > > > >> OK so when you run Nutch as a deployed job in your logs are
> > present
> > > > > within
> > > > > >> $HADOOP_LOG_DIR... you can check some logs on the JobTracker
> > WebApp
> > > > e.g.
> > > > > >> you will be able to see the reduce tasks for the fetch job and you
> > > > will
> > > > > >> also be able to see varying snippets or all of the log here.
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

RE: Nutch scalability tests

Reply via email to