Do you have enough memory? 50 thtreads and PDF's and and older Tika version 
will get you in trouble. That PDFBox version eats memory! Try upgrading to the 
latest PDFBox, you can drop jars manually and reference them in Tika's 
plugin.xml.

M

 
 
-----Original message-----
> From:Paul Rogers <[email protected]>
> Sent: Friday 9th January 2015 18:35
> To: [email protected]
> Subject: Problem with time out on QueueFeeder
> 
> Hi Guys
> 
> I am using nutch 1.8 to fetch pdf documents from an http server.  The jobs
> have been running OK until recently when I started getting the following
> error:
> 
> -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500
> fetching
> http://server1/doccontrol/DC-10%20Incoming%20Correspondence(IAE-US)/15C_221427_IAE_LTR_IAE_0845%20Letter%20from%20Alvarez.pdf
> (queue crawl delay=5000ms)
> -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500
> -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500
> QueueFeeder finished: total 4655 records + hit by time limit :1184
> -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500
> * queue: http://ws0895 >> dropping!
> -finishing thread FetcherThread, activeThreads=49
> -finishing thread FetcherThread, activeThreads=48
> .
> .
> .
> .
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1340)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1376)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1349)
> 
> New pdf's are added to server everynight and the whole of the content then
> re-fetched, ie the content is growing so I can understand that a limit
> might be reached.
> 
> I have searched on the error and it seems that this behaviour should be
> governed by fetcher.timelimit.mins property
> 
> I've checked the nutch-default and nutch-site files and can only find a
> single entry:
> 
> <property>
>   <name>fetcher.timelimit.mins</name>
>   <value>-1</value>
>   <description>This is the number of minutes allocated to the fetching.
>   Once this value is reached, any remaining entry from the input URL list
> is skipped
>   and all active queues are emptied. The default value of -1 deactivates
> the time limit.
>   </description>
> </property>
> 
> Should there not therefore be  no time limit?
> 
> Any suggestions on what else might be causing this problem?
> 
> Thanks
> 
> P
> 

Reply via email to