Hi Guys I am using nutch 1.8 to fetch pdf documents from an http server. The jobs have been running OK until recently when I started getting the following error:
-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 fetching http://server1/doccontrol/DC-10%20Incoming%20Correspondence(IAE-US)/15C_221427_IAE_LTR_IAE_0845%20Letter%20from%20Alvarez.pdf (queue crawl delay=5000ms) -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 QueueFeeder finished: total 4655 records + hit by time limit :1184 -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 * queue: http://ws0895 >> dropping! -finishing thread FetcherThread, activeThreads=49 -finishing thread FetcherThread, activeThreads=48 . . . . -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1340) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1376) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1349) New pdf's are added to server everynight and the whole of the content then re-fetched, ie the content is growing so I can understand that a limit might be reached. I have searched on the error and it seems that this behaviour should be governed by fetcher.timelimit.mins property I've checked the nutch-default and nutch-site files and can only find a single entry: <property> <name>fetcher.timelimit.mins</name> <value>-1</value> <description>This is the number of minutes allocated to the fetching. Once this value is reached, any remaining entry from the input URL list is skipped and all active queues are emptied. The default value of -1 deactivates the time limit. </description> </property> Should there not therefore be no time limit? Any suggestions on what else might be causing this problem? Thanks P

