Hey Sebastian, thanks a lot. I already increased it to around 65MB. All our pdfs about 3 to 8mb big. Any other suggestions? ;)
Thanks David > Am 09.08.2017 um 18:50 schrieb Sebastian Nagel <[email protected]>: > > Hi David, > > for PDFs you usually need to increase the following property: > > <property> > <name>http.content.limit</name> > <value>65536</value> > <description>The length limit for downloaded content using the http > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting. > </description> > </property> > > In doubt, also set the equivalent properties ftp.content.limit and > file.content.limit > > Best, > Sebastian > >> On 08/08/2017 03:00 PM, [email protected] wrote: >> Hey currently, >> >> we are on nutch 2.3.1 and using it to crawl our websites. >> One of our focus is to get all the pdfs on our website crawled. -> Links on >> different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf >> I tried different things: >> At the configurations I removed ever occurrence of pdf in >> regex-urlfilter.txt and added the download url, added parse-tika to >> nutch-.site.xml in plugins, added application/pdf in default-site.xml in >> http-accept, added pdf to parse-plugins.xml. >> But still no pdf link is been fetched. >> >> regex-urlfilter.txt >> +https://assets.*. mysite.com/asset >> >> parse-plugins.xml >> <mimeType name="application/pdf"> >> <plugin id="parse-tika" /> >> </mimeType> >> >> nutch-site.xml >> <property> >> <name>plugin.includes</name> >> <value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value> >> </property> >> >> default-site.xml >> <property> >> <name>http.accept</name> >> >> <value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value> >> <description>Value of the "Accept" request header field. >> </description> >> </property> >> >> Is there anything else I have to configure? >> >> Thanks >> >> David >> >> >> >

