AW: Nutch 1.12 get stuck on same document

André Schild Wed, 01 Feb 2017 03:02:10 -0800

>It is probably not the fetcher but the parser that gets stuck on the document. 
>The http.content.limit must at least be 18 MB or the parser will die trying to 
>parse it.


It does not seem to dye, but just logs this
824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf skipped. Content of size 19375003 was 
truncated to 1409024

But I then do not find any results for that url in solr, so the truncation also 
drops the whole document, and does not just index the first XY bytes?

>You might also want to take a look at memory consumption, there is a good 
>change the JVM gets stuck because of this PDF.
Yep, that’s known.

> Finally, you need parser.timeout also to be high enough, but that depends on 
> CPU and available heap space.
Currently this works so far

Thanks
André

AW: Nutch 1.12 get stuck on same document

Reply via email to