On Tuesday 29 June 2010 20:50:01 Julien Nioche wrote: > Markus, > > > The depth of the queue is simply 50 * number of threads, so I gather that > you are using 10 threads. There is a JIRA where we discussed making this > value parametrable. > > the threads just wiggle between 490 and 500. > > > you probably mean 'the total size wiggles...'? yes > > The number of remaining URLs to fetch is not known from the Fetcher as it > reads the fetchlist as it goes. The fact that you can see the real number > of remaining URLs when > 500 is simply due to the fact that all the input > URLs have been read and all the remaining ones are in the queue. > > > The 50x value could be set in a parameter however this is not the issue > here. The point it that the queue is what's stored in memory whereas the > total number of URLS is the queue + what's left to be read from HDFS. > > I'd suggest using the mapreduce webapp to monitor the progress and not > simply looking at the logs (i.e. you need to run it in distributed mode). > There are now details about how many URLs have been fetched successfully or > not + of course the progress of the map operations which indicates how much > has been read from HDFS. Since you know how many URLs you put in the > fetchlist in the first place, it would be trivial to work out what's left. thanks, i understand and try the monitor. > > HTH > > Julien >
Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

