Re: Nutch scalability tests

h b Wed, 03 Jul 2013 08:39:33 -0700

oh and yes, generate.max.count is set to 5000


On Wed, Jul 3, 2013 at 8:29 AM, h b <[email protected]> wrote:

> I dropped my webpage database, restarted with 5 seed urls. First fetch
> completed in a few seconds. The second run, still shows 1 reduce running,
> although it shows as 100% complete, so my thought is it is writing out to
> the disk, though it has been about 30+ minutes.
> Again, I had 80 reducers, when I look at the log of these reducers in the
> hadoop jobtracker, I see
>
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 
> 0 queues
>
> in all of them, which leads me to think that the completed 79 reducers 
> actually fetched nothing, which might explain why this 1 stuck reducer is 
> working so hard.
>
> This may be expected, since I am crawling a single domain. This one reducers 
> log on the jobtracker however, is empty. Don't know what to make of that.
>
>
>
>
>
> On Tue, Jul 2, 2013 at 4:15 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Hi,
>>
>> On Tue, Jul 2, 2013 at 3:53 PM, h b <[email protected]> wrote:
>>
>> > So, I tried this with the generate.max.count property set to 5000,
>> rebuild
>> > ant; ant jar; ant job and reran fetch.
>> > It still appears the same, first 79 reducers zip through and the last
>> one
>> > is crawling, literally...
>> >
>>
>> Sorry I should have been more explicit. This property does not directly
>> affect fetching. It is used when GENERATING fetch lists. Meaning that it
>> needs to be present and acknowledged at the generate phase... before
>> fetching is executed.
>> Besides this, is there any progress being made at all on the last reduce?
>> if you look at your CPU (and heap) for the box this is running on, it is
>> usual to notice high levels for both of these respectively. Maybe this
>> output writer is just taking a good while to write data down to HDFS...
>> assuming you are using 1.x.
>>
>>
>> >
>> > As for the logs, I mentioned on one of my earlier threads that when I
>> run
>> > from the deploy directory, I am not getting any logs generated.
>> > I looked for the logs directory under local as well as under deploy, and
>> > just to make sure, also in the grid. I do not see the logs directory.
>> So I
>> > created it manually under deploy before starting fetch, and still there
>> is
>> > nothing in this directory,
>> >
>> >
>> OK so when you run Nutch as a deployed job in your logs are present within
>> $HADOOP_LOG_DIR... you can check some logs on the JobTracker WebApp e.g.
>> you will be able to see the reduce tasks for the fetch job and you will
>> also be able to see varying snippets or all of the log here.
>>
>
>

Re: Nutch scalability tests

Reply via email to