Hi David, there are a couple of options to configure how links are followed by the crawler, esp. db.max.outlinks.per.page db.ignore.external.links
It the white space in the URLs intended? > https://assets0.mysite.com/asset /DB_product.pdf >>> +https://assets.*. mysite.com/asset URLs normally require space to be encoded as '%20' (percent encoding) or '+' (form encoding, after ?) In doubt, start debugging: 1. check the logs 2. try $ bin/nutch parsechecker -dumpText http://.../xyz.html (PDFs linked from here) $ bin/nutch parsechecker -dumpText http://.../xyz.pdf $ bin/nutch indexchecker http://.../xyz.pdf 3. inspect the storage (HBase, etc.) what's in for the fetched PDFs or the HTML with the links. Best, Sebastian On 08/09/2017 07:11 PM, d.ku...@technisat.de wrote: > Hey Sebastian, > > thanks a lot. I already increased it to around 65MB. All our pdfs about 3 to > 8mb big. > Any other suggestions? > ;) > > > > Thanks > David > >> Am 09.08.2017 um 18:50 schrieb Sebastian Nagel <wastl.na...@googlemail.com>: >> >> Hi David, >> >> for PDFs you usually need to increase the following property: >> >> <property> >> <name>http.content.limit</name> >> <value>65536</value> >> <description>The length limit for downloaded content using the http >> protocol, in bytes. If this value is nonnegative (>=0), content longer >> than it will be truncated; otherwise, no truncation at all. Do not >> confuse this setting with the file.content.limit setting. >> </description> >> </property> >> >> In doubt, also set the equivalent properties ftp.content.limit and >> file.content.limit >> >> Best, >> Sebastian >> >>> On 08/08/2017 03:00 PM, d.ku...@technisat.de wrote: >>> Hey currently, >>> >>> we are on nutch 2.3.1 and using it to crawl our websites. >>> One of our focus is to get all the pdfs on our website crawled. -> Links >>> on different Websites are like: https://assets0.mysite.com/asset >>> /DB_product.pdf >>> I tried different things: >>> At the configurations I removed ever occurrence of pdf in >>> regex-urlfilter.txt and added the download url, added parse-tika to >>> nutch-.site.xml in plugins, added application/pdf in default-site.xml in >>> http-accept, added pdf to parse-plugins.xml. >>> But still no pdf link is been fetched. >>> >>> regex-urlfilter.txt >>> +https://assets.*. mysite.com/asset >>> >>> parse-plugins.xml >>> <mimeType name="application/pdf"> >>> <plugin id="parse-tika" /> >>> </mimeType> >>> >>> nutch-site.xml >>> <property> >>> <name>plugin.includes</name> >>> <value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value> >>> </property> >>> >>> default-site.xml >>> <property> >>> <name>http.accept</name> >>> >>> <value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value> >>> <description>Value of the "Accept" request header field. >>> </description> >>> </property> >>> >>> Is there anything else I have to configure? >>> >>> Thanks >>> >>> David >>> >>> >>> >>