Hi Markus Rebooting the server frees up 23GB of memory.
Have installed PDFBox 1.8.8 and am running fetch again. Will update you on results. Thanks P On 9 January 2015 at 14:11, Paul Rogers <[email protected]> wrote: > Thanks Markus > > I will try that and see if it fixes things. > > The server has 24GB of memory but only about 1GB free without the nutch > process running!! > > Are the PDFBox files in Tika 1.6 (PDFBox 1.8.6) likely to have fixed this > or should I go for 1.8.8 on the PDFBox site? > > Thanks again > > P > > On 9 January 2015 at 13:46, Markus Jelsma <[email protected]> > wrote: > >> Do you have enough memory? 50 thtreads and PDF's and and older Tika >> version will get you in trouble. That PDFBox version eats memory! Try >> upgrading to the latest PDFBox, you can drop jars manually and reference >> them in Tika's plugin.xml. >> >> M >> >> >> >> -----Original message----- >> > From:Paul Rogers <[email protected]> >> > Sent: Friday 9th January 2015 18:35 >> > To: [email protected] >> > Subject: Problem with time out on QueueFeeder >> > >> > Hi Guys >> > >> > I am using nutch 1.8 to fetch pdf documents from an http server. The >> jobs >> > have been running OK until recently when I started getting the following >> > error: >> > >> > -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 >> > fetching >> > >> http://server1/doccontrol/DC-10%20Incoming%20Correspondence(IAE-US)/15C_221427_IAE_LTR_IAE_0845%20Letter%20from%20Alvarez.pdf >> > (queue crawl delay=5000ms) >> > -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 >> > -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 >> > QueueFeeder finished: total 4655 records + hit by time limit :1184 >> > -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 >> > * queue: http://ws0895 >> dropping! >> > -finishing thread FetcherThread, activeThreads=49 >> > -finishing thread FetcherThread, activeThreads=48 >> > . >> > . >> > . >> > . >> > -finishing thread FetcherThread, activeThreads=3 >> > -finishing thread FetcherThread, activeThreads=2 >> > -finishing thread FetcherThread, activeThreads=1 >> > -finishing thread FetcherThread, activeThreads=0 >> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >> > -activeThreads=0 >> > Fetcher: java.io.IOException: Job failed! >> > at >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) >> > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1340) >> > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1376) >> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1349) >> > >> > New pdf's are added to server everynight and the whole of the content >> then >> > re-fetched, ie the content is growing so I can understand that a limit >> > might be reached. >> > >> > I have searched on the error and it seems that this behaviour should be >> > governed by fetcher.timelimit.mins property >> > >> > I've checked the nutch-default and nutch-site files and can only find a >> > single entry: >> > >> > <property> >> > <name>fetcher.timelimit.mins</name> >> > <value>-1</value> >> > <description>This is the number of minutes allocated to the fetching. >> > Once this value is reached, any remaining entry from the input URL >> list >> > is skipped >> > and all active queues are emptied. The default value of -1 deactivates >> > the time limit. >> > </description> >> > </property> >> > >> > Should there not therefore be no time limit? >> > >> > Any suggestions on what else might be causing this problem? >> > >> > Thanks >> > >> > P >> > >> > >

