Hi Markus

Rebooting the server frees up 23GB of memory.

Have installed PDFBox 1.8.8 and am running fetch again.  Will update you on
results.

Thanks

P

On 9 January 2015 at 14:11, Paul Rogers <[email protected]> wrote:

> Thanks  Markus
>
> I will try that and see if it fixes things.
>
> The server has 24GB of memory but only about 1GB free without the nutch
> process running!!
>
> Are the PDFBox files in Tika 1.6 (PDFBox 1.8.6) likely to have fixed this
> or should I go for 1.8.8 on the PDFBox site?
>
> Thanks again
>
> P
>
> On 9 January 2015 at 13:46, Markus Jelsma <[email protected]>
> wrote:
>
>> Do you have enough memory? 50 thtreads and PDF's and and older Tika
>> version will get you in trouble. That PDFBox version eats memory! Try
>> upgrading to the latest PDFBox, you can drop jars manually and reference
>> them in Tika's plugin.xml.
>>
>> M
>>
>>
>>
>> -----Original message-----
>> > From:Paul Rogers <[email protected]>
>> > Sent: Friday 9th January 2015 18:35
>> > To: [email protected]
>> > Subject: Problem with time out on QueueFeeder
>> >
>> > Hi Guys
>> >
>> > I am using nutch 1.8 to fetch pdf documents from an http server.  The
>> jobs
>> > have been running OK until recently when I started getting the following
>> > error:
>> >
>> > -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500
>> > fetching
>> >
>> http://server1/doccontrol/DC-10%20Incoming%20Correspondence(IAE-US)/15C_221427_IAE_LTR_IAE_0845%20Letter%20from%20Alvarez.pdf
>> > (queue crawl delay=5000ms)
>> > -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500
>> > -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500
>> > QueueFeeder finished: total 4655 records + hit by time limit :1184
>> > -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500
>> > * queue: http://ws0895 >> dropping!
>> > -finishing thread FetcherThread, activeThreads=49
>> > -finishing thread FetcherThread, activeThreads=48
>> > .
>> > .
>> > .
>> > .
>> > -finishing thread FetcherThread, activeThreads=3
>> > -finishing thread FetcherThread, activeThreads=2
>> > -finishing thread FetcherThread, activeThreads=1
>> > -finishing thread FetcherThread, activeThreads=0
>> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> > -activeThreads=0
>> > Fetcher: java.io.IOException: Job failed!
>> >         at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>> >         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1340)
>> >         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1376)
>> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1349)
>> >
>> > New pdf's are added to server everynight and the whole of the content
>> then
>> > re-fetched, ie the content is growing so I can understand that a limit
>> > might be reached.
>> >
>> > I have searched on the error and it seems that this behaviour should be
>> > governed by fetcher.timelimit.mins property
>> >
>> > I've checked the nutch-default and nutch-site files and can only find a
>> > single entry:
>> >
>> > <property>
>> >   <name>fetcher.timelimit.mins</name>
>> >   <value>-1</value>
>> >   <description>This is the number of minutes allocated to the fetching.
>> >   Once this value is reached, any remaining entry from the input URL
>> list
>> > is skipped
>> >   and all active queues are emptied. The default value of -1 deactivates
>> > the time limit.
>> >   </description>
>> > </property>
>> >
>> > Should there not therefore be  no time limit?
>> >
>> > Any suggestions on what else might be causing this problem?
>> >
>> > Thanks
>> >
>> > P
>> >
>>
>
>

Reply via email to