Hi David, for PDFs you usually need to increase the following property:
<property> <name>http.content.limit</name> <value>65536</value> <description>The length limit for downloaded content using the http protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description> </property> In doubt, also set the equivalent properties ftp.content.limit and file.content.limit Best, Sebastian On 08/08/2017 03:00 PM, [email protected] wrote: > Hey currently, > > we are on nutch 2.3.1 and using it to crawl our websites. > One of our focus is to get all the pdfs on our website crawled. -> Links on > different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf > I tried different things: > At the configurations I removed ever occurrence of pdf in regex-urlfilter.txt > and added the download url, added parse-tika to nutch-.site.xml in plugins, > added application/pdf in default-site.xml in http-accept, added pdf to > parse-plugins.xml. > But still no pdf link is been fetched. > > regex-urlfilter.txt > +https://assets.*. mysite.com/asset > > parse-plugins.xml > <mimeType name="application/pdf"> > <plugin id="parse-tika" /> > </mimeType> > > nutch-site.xml > <property> > <name>plugin.includes</name> > <value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value> > </property> > > default-site.xml > <property> > <name>http.accept</name> > > <value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value> > <description>Value of the "Accept" request header field. > </description> > </property> > > Is there anything else I have to configure? > > Thanks > > David > > >

