Thanks Markus I will try that and see if it fixes things.
The server has 24GB of memory but only about 1GB free without the nutch process running!! Are the PDFBox files in Tika 1.6 (PDFBox 1.8.6) likely to have fixed this or should I go for 1.8.8 on the PDFBox site? Thanks again P On 9 January 2015 at 13:46, Markus Jelsma <[email protected]> wrote: > Do you have enough memory? 50 thtreads and PDF's and and older Tika > version will get you in trouble. That PDFBox version eats memory! Try > upgrading to the latest PDFBox, you can drop jars manually and reference > them in Tika's plugin.xml. > > M > > > > -----Original message----- > > From:Paul Rogers <[email protected]> > > Sent: Friday 9th January 2015 18:35 > > To: [email protected] > > Subject: Problem with time out on QueueFeeder > > > > Hi Guys > > > > I am using nutch 1.8 to fetch pdf documents from an http server. The > jobs > > have been running OK until recently when I started getting the following > > error: > > > > -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 > > fetching > > > http://server1/doccontrol/DC-10%20Incoming%20Correspondence(IAE-US)/15C_221427_IAE_LTR_IAE_0845%20Letter%20from%20Alvarez.pdf > > (queue crawl delay=5000ms) > > -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 > > -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 > > QueueFeeder finished: total 4655 records + hit by time limit :1184 > > -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 > > * queue: http://ws0895 >> dropping! > > -finishing thread FetcherThread, activeThreads=49 > > -finishing thread FetcherThread, activeThreads=48 > > . > > . > > . > > . > > -finishing thread FetcherThread, activeThreads=3 > > -finishing thread FetcherThread, activeThreads=2 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=0 > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > -activeThreads=0 > > Fetcher: java.io.IOException: Job failed! > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) > > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1340) > > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1376) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1349) > > > > New pdf's are added to server everynight and the whole of the content > then > > re-fetched, ie the content is growing so I can understand that a limit > > might be reached. > > > > I have searched on the error and it seems that this behaviour should be > > governed by fetcher.timelimit.mins property > > > > I've checked the nutch-default and nutch-site files and can only find a > > single entry: > > > > <property> > > <name>fetcher.timelimit.mins</name> > > <value>-1</value> > > <description>This is the number of minutes allocated to the fetching. > > Once this value is reached, any remaining entry from the input URL list > > is skipped > > and all active queues are emptied. The default value of -1 deactivates > > the time limit. > > </description> > > </property> > > > > Should there not therefore be no time limit? > > > > Any suggestions on what else might be causing this problem? > > > > Thanks > > > > P > > >

