Re: fetching pdfs from our website

Sebastian Nagel Thu, 10 Aug 2017 00:22:18 -0700

Hi David,

there are a couple of options to configure how links are followed by the 
crawler, esp.
  db.max.outlinks.per.page
  db.ignore.external.links


It the white space in the URLs intended?
> https://assets0.mysite.com/asset /DB_product.pdf
>>> +https://assets.*. mysite.com/asset

URLs normally require space to be encoded as '%20' (percent encoding)
or '+' (form encoding, after ?)

In doubt, start debugging:

1. check the logs

2. try
   $ bin/nutch parsechecker -dumpText http://.../xyz.html        (PDFs linked 
from here)
   $ bin/nutch parsechecker -dumpText http://.../xyz.pdf
   $ bin/nutch indexchecker http://.../xyz.pdf

3. inspect the storage (HBase, etc.) what's in for the fetched PDFs or the HTML 
with the links.

Best,
Sebastian

On 08/09/2017 07:11 PM, [email protected] wrote:
> Hey Sebastian,
> 
> thanks a lot. I already increased it to around 65MB. All our pdfs about 3 to 
> 8mb big.
> Any other suggestions?
> ;)
> 
> 
> 
> Thanks
> David
> 
>> Am 09.08.2017 um 18:50 schrieb Sebastian Nagel <[email protected]>:
>>
>> Hi David,
>>
>> for PDFs you usually need to increase the following property:
>>
>> <property>
>>  <name>http.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content using the http
>>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>>  than it will be truncated; otherwise, no truncation at all. Do not
>>  confuse this setting with the file.content.limit setting.
>>  </description>
>> </property>
>>
>> In doubt, also set the equivalent properties ftp.content.limit and 
>> file.content.limit
>>
>> Best,
>> Sebastian
>>
>>> On 08/08/2017 03:00 PM, [email protected] wrote:
>>> Hey currently,
>>>
>>> we are on nutch 2.3.1 and using it to crawl our websites. 
>>> One of our focus is to get all the pdfs on our website crawled.  -> Links 
>>> on different Websites are like: https://assets0.mysite.com/asset 
>>> /DB_product.pdf
>>> I tried different things:
>>> At the configurations I removed ever occurrence of pdf in 
>>> regex-urlfilter.txt and added the download url, added  parse-tika to 
>>> nutch-.site.xml in plugins, added application/pdf in default-site.xml in 
>>> http-accept, added pdf to parse-plugins.xml.
>>> But still no pdf link is been fetched. 
>>>
>>> regex-urlfilter.txt
>>> +https://assets.*. mysite.com/asset
>>>
>>> parse-plugins.xml
>>> <mimeType name="application/pdf">
>>>           <plugin id="parse-tika" />
>>>    </mimeType>
>>>
>>> nutch-site.xml
>>> <property>
>>> <name>plugin.includes</name>
>>> <value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value>
>>> </property>
>>>
>>> default-site.xml
>>> <property>
>>>  <name>http.accept</name>
>>>  
>>> <value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
>>>  <description>Value of the "Accept" request header field.
>>>  </description>
>>> </property>
>>>
>>> Is there anything else I have to configure?
>>>
>>> Thanks
>>>
>>> David
>>>
>>>
>>>
>>

Re: fetching pdfs from our website

Reply via email to