Hey currently, we are on nutch 2.3.1 and using it to crawl our websites. One of our focus is to get all the pdfs on our website crawled. -> Links on different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf I tried different things: At the configurations I removed ever occurrence of pdf in regex-urlfilter.txt and added the download url, added parse-tika to nutch-.site.xml in plugins, added application/pdf in default-site.xml in http-accept, added pdf to parse-plugins.xml. But still no pdf link is been fetched.
regex-urlfilter.txt +https://assets.*. mysite.com/asset parse-plugins.xml <mimeType name="application/pdf"> <plugin id="parse-tika" /> </mimeType> nutch-site.xml <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value> </property> default-site.xml <property> <name>http.accept</name> <value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value> <description>Value of the "Accept" request header field. </description> </property> Is there anything else I have to configure? Thanks David