Re: fetching pdfs from our website

[email protected] Wed, 09 Aug 2017 10:12:38 -0700

Hey Sebastian,

thanks a lot. I already increased it to around 65MB. All our pdfs about 3 to 
8mb big.
Any other suggestions?
;)




Thanks
David

> Am 09.08.2017 um 18:50 schrieb Sebastian Nagel <[email protected]>:
> 
> Hi David,
> 
> for PDFs you usually need to increase the following property:
> 
> <property>
>  <name>http.content.limit</name>
>  <value>65536</value>
>  <description>The length limit for downloaded content using the http
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.
>  </description>
> </property>
> 
> In doubt, also set the equivalent properties ftp.content.limit and 
> file.content.limit
> 
> Best,
> Sebastian
> 
>> On 08/08/2017 03:00 PM, [email protected] wrote:
>> Hey currently,
>> 
>> we are on nutch 2.3.1 and using it to crawl our websites. 
>> One of our focus is to get all the pdfs on our website crawled.  -> Links on 
>> different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
>> I tried different things:
>> At the configurations I removed ever occurrence of pdf in 
>> regex-urlfilter.txt and added the download url, added  parse-tika to 
>> nutch-.site.xml in plugins, added application/pdf in default-site.xml in 
>> http-accept, added pdf to parse-plugins.xml.
>> But still no pdf link is been fetched. 
>> 
>> regex-urlfilter.txt
>> +https://assets.*. mysite.com/asset
>> 
>> parse-plugins.xml
>> <mimeType name="application/pdf">
>>           <plugin id="parse-tika" />
>>    </mimeType>
>> 
>> nutch-site.xml
>> <property>
>> <name>plugin.includes</name>
>> <value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value>
>> </property>
>> 
>> default-site.xml
>> <property>
>>  <name>http.accept</name>
>>  
>> <value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
>>  <description>Value of the "Accept" request header field.
>>  </description>
>> </property>
>> 
>> Is there anything else I have to configure?
>> 
>> Thanks
>> 
>> David
>> 
>> 
>> 
>

Re: fetching pdfs from our website

Reply via email to