On 2010-09-28 13:55, Markus Jelsma wrote:
Thanks. Could we modify the code so it will only output the info before the tasks are initialized? If so, how to proceed?
This is a bit tricky, because the code is executed differently depending on whether it executes in local mode (or from a local application) and in distributed mode (or from one of the mapreduce tasks).
In local mode resources are taken from a classpath determined during the execution of the driver application (the one with main()), and these may include (and often do!) multiple copies of local files, such as conf/nutch-site.xml and nutch-site.xml that is packed inside a job jar. Furthermore, plugins in local mode are NOT loaded from nutch.job, but instead from the plugins/ directory... so their composition may be different than the one that is used by distributed tasks.
Now, the crux of the matter is that in order to print this list only once you would have to do this from the driver application - but when you run Nutch in distributed mode the driver application uses a different classpath than each of the tasks will use, so the list could be different, which would be very confusing...
All in all, I think it's best to print it possibly many times from tasks, or not at all. This choice could be implemented as a logging level, or as a config property.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

