Hey currently,

we are on nutch 2.3.1 and using it to crawl our websites. 
One of our focus is to get all the pdfs on our website crawled.  -> Links on 
different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
I tried different things:
At the configurations I removed ever occurrence of pdf in regex-urlfilter.txt 
and added the download url, added  parse-tika to nutch-.site.xml in plugins, 
added application/pdf in default-site.xml in http-accept, added pdf to 
parse-plugins.xml.
But still no pdf link is been fetched. 

regex-urlfilter.txt
+https://assets.*. mysite.com/asset

parse-plugins.xml
<mimeType name="application/pdf">
               <plugin id="parse-tika" />
        </mimeType>

nutch-site.xml
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value>
</property>

default-site.xml
<property>
  <name>http.accept</name>
  
<value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
  <description>Value of the "Accept" request header field.
  </description>
</property>

Is there anything else I have to configure?

Thanks

David



Reply via email to