Re: fetching pdfs from our website

Sebastian Nagel Wed, 09 Aug 2017 09:50:55 -0700

Hi David,

for PDFs you usually need to increase the following property:


<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

In doubt, also set the equivalent properties ftp.content.limit and 
file.content.limit

Best,
Sebastian

On 08/08/2017 03:00 PM, [email protected] wrote:
> Hey currently,
> 
> we are on nutch 2.3.1 and using it to crawl our websites. 
> One of our focus is to get all the pdfs on our website crawled.  -> Links on 
> different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
> I tried different things:
> At the configurations I removed ever occurrence of pdf in regex-urlfilter.txt 
> and added the download url, added  parse-tika to nutch-.site.xml in plugins, 
> added application/pdf in default-site.xml in http-accept, added pdf to 
> parse-plugins.xml.
> But still no pdf link is been fetched. 
> 
> regex-urlfilter.txt
> +https://assets.*. mysite.com/asset
> 
> parse-plugins.xml
> <mimeType name="application/pdf">
>              <plugin id="parse-tika" />
>       </mimeType>
> 
> nutch-site.xml
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value>
> </property>
> 
> default-site.xml
> <property>
>   <name>http.accept</name>
>   
> <value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
>   <description>Value of the "Accept" request header field.
>   </description>
> </property>
> 
> Is there anything else I have to configure?
> 
> Thanks
> 
> David
> 
> 
>

Re: fetching pdfs from our website

Reply via email to