thanks a lot. I already increased it to around 65MB. All our pdfs about 3 to
Any other suggestions?
> Am 09.08.2017 um 18:50 schrieb Sebastian Nagel <wastl.na...@googlemail.com>:
> Hi David,
> for PDFs you usually need to increase the following property:
> <description>The length limit for downloaded content using the http
> protocol, in bytes. If this value is nonnegative (>=0), content longer
> than it will be truncated; otherwise, no truncation at all. Do not
> confuse this setting with the file.content.limit setting.
> In doubt, also set the equivalent properties ftp.content.limit and
>> On 08/08/2017 03:00 PM, d.ku...@technisat.de wrote:
>> Hey currently,
>> we are on nutch 2.3.1 and using it to crawl our websites.
>> One of our focus is to get all the pdfs on our website crawled. -> Links on
>> different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
>> I tried different things:
>> At the configurations I removed ever occurrence of pdf in
>> regex-urlfilter.txt and added the download url, added parse-tika to
>> nutch-.site.xml in plugins, added application/pdf in default-site.xml in
>> http-accept, added pdf to parse-plugins.xml.
>> But still no pdf link is been fetched.
>> +https://assets.*. mysite.com/asset
>> <mimeType name="application/pdf">
>> <plugin id="parse-tika" />
>> <description>Value of the "Accept" request header field.
>> Is there anything else I have to configure?