RE: Nutch 1.12 get stuck on same document

Markus Jelsma Wed, 01 Feb 2017 03:08:04 -0800

It is truncated because http.content.limit is nog high enough to accomodate the 
PDF. Incraese the value for that setting to 20 MB, and you're good to go, for 
that URL at least.


Markjus
 
-----Original message-----
> From:André Schild <[email protected]>
> Sent: Wednesday 1st February 2017 12:01
> To: [email protected]
> Subject: AW: Nutch 1.12 get stuck on same document
> 
> >It is probably not the fetcher but the parser that gets stuck on the 
> >document. 
> >The http.content.limit must at least be 18 MB or the parser will die trying 
> >to parse it. 
> 
> It does not seem to dye, but just logs this
> 824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf skipped. Content of size 19375003 
> was truncated to 1409024
> 
> But I then do not find any results for that url in solr, so the truncation 
> also drops the whole document, and does not just index the first XY bytes?
> 
> >You might also want to take a look at memory consumption, there is a good 
> >change the JVM gets stuck because of this PDF.
> Yep, that’s known.
> 
> > Finally, you need parser.timeout also to be high enough, but that depends 
> > on CPU and available heap space.
> Currently this works so far
> 
> Thanks
> André
>

RE: Nutch 1.12 get stuck on same document

Reply via email to