oh and yes, generate.max.count is set to 5000
On Wed, Jul 3, 2013 at 8:29 AM, h b <[email protected]> wrote: > I dropped my webpage database, restarted with 5 seed urls. First fetch > completed in a few seconds. The second run, still shows 1 reduce running, > although it shows as 100% complete, so my thought is it is writing out to > the disk, though it has been about 30+ minutes. > Again, I had 80 reducers, when I look at the log of these reducers in the > hadoop jobtracker, I see > > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in > 0 queues > > in all of them, which leads me to think that the completed 79 reducers > actually fetched nothing, which might explain why this 1 stuck reducer is > working so hard. > > This may be expected, since I am crawling a single domain. This one reducers > log on the jobtracker however, is empty. Don't know what to make of that. > > > > > > On Tue, Jul 2, 2013 at 4:15 PM, Lewis John Mcgibbney < > [email protected]> wrote: > >> Hi, >> >> On Tue, Jul 2, 2013 at 3:53 PM, h b <[email protected]> wrote: >> >> > So, I tried this with the generate.max.count property set to 5000, >> rebuild >> > ant; ant jar; ant job and reran fetch. >> > It still appears the same, first 79 reducers zip through and the last >> one >> > is crawling, literally... >> > >> >> Sorry I should have been more explicit. This property does not directly >> affect fetching. It is used when GENERATING fetch lists. Meaning that it >> needs to be present and acknowledged at the generate phase... before >> fetching is executed. >> Besides this, is there any progress being made at all on the last reduce? >> if you look at your CPU (and heap) for the box this is running on, it is >> usual to notice high levels for both of these respectively. Maybe this >> output writer is just taking a good while to write data down to HDFS... >> assuming you are using 1.x. >> >> >> > >> > As for the logs, I mentioned on one of my earlier threads that when I >> run >> > from the deploy directory, I am not getting any logs generated. >> > I looked for the logs directory under local as well as under deploy, and >> > just to make sure, also in the grid. I do not see the logs directory. >> So I >> > created it manually under deploy before starting fetch, and still there >> is >> > nothing in this directory, >> > >> > >> OK so when you run Nutch as a deployed job in your logs are present within >> $HADOOP_LOG_DIR... you can check some logs on the JobTracker WebApp e.g. >> you will be able to see the reduce tasks for the fetch job and you will >> also be able to see varying snippets or all of the log here. >> > >

