>It is probably not the fetcher but the parser that gets stuck on the document. >The http.content.limit must at least be 18 MB or the parser will die trying to >parse it.
It does not seem to dye, but just logs this 824a6f94-aa5f-4dad-8621-5c59add4e7b6.pdf skipped. Content of size 19375003 was truncated to 1409024 But I then do not find any results for that url in solr, so the truncation also drops the whole document, and does not just index the first XY bytes? >You might also want to take a look at memory consumption, there is a good >change the JVM gets stuck because of this PDF. Yep, that’s known. > Finally, you need parser.timeout also to be high enough, but that depends on > CPU and available heap space. Currently this works so far Thanks André

