AW: fetching pdfs from our website

[email protected] Thu, 10 Aug 2017 00:43:35 -0700

Hey Sebastian,

I already change the value of : db.max.outlinks.per.page to "-1" .
And db.ignore.external.links is set to "false" as the assets are sometimes at 
another domains.


Sry, about the whitespace --> Copy paste mistake. Normally there is no 
whitespace or it is encoded the right way. 

thanks

David


-----Ursprüngliche Nachricht-----
Von: Sebastian Nagel [mailto:[email protected]] 
Gesendet: Donnerstag, 10. August 2017 09:22
An: [email protected]
Betreff: Re: fetching pdfs from our website

Hi David,

there are a couple of options to configure how links are followed by the 
crawler, esp.
  db.max.outlinks.per.page
  db.ignore.external.links

It the white space in the URLs intended?
> https://assets0.mysite.com/asset /DB_product.pdf
>>> +https://assets.*. mysite.com/asset

URLs normally require space to be encoded as '%20' (percent encoding) or '+' 
(form encoding, after ?)

In doubt, start debugging:

1. check the logs

2. try
   $ bin/nutch parsechecker -dumpText http://.../xyz.html        (PDFs linked 
from here)
   $ bin/nutch parsechecker -dumpText http://.../xyz.pdf
   $ bin/nutch indexchecker http://.../xyz.pdf

3. inspect the storage (HBase, etc.) what's in for the fetched PDFs or the HTML 
with the links.

Best,
Sebastian

On 08/09/2017 07:11 PM, [email protected] wrote:
> Hey Sebastian,
> 
> thanks a lot. I already increased it to around 65MB. All our pdfs about 3 to 
> 8mb big.
> Any other suggestions?
> ;)
> 
> 
> 
> Thanks
> David
> 
>> Am 09.08.2017 um 18:50 schrieb Sebastian Nagel <[email protected]>:
>>
>> Hi David,
>>
>> for PDFs you usually need to increase the following property:
>>
>> <property>
>>  <name>http.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content using the http  
>> protocol, in bytes. If this value is nonnegative (>=0), content 
>> longer  than it will be truncated; otherwise, no truncation at all. 
>> Do not  confuse this setting with the file.content.limit setting.
>>  </description>
>> </property>
>>
>> In doubt, also set the equivalent properties ftp.content.limit and 
>> file.content.limit
>>
>> Best,
>> Sebastian
>>
>>> On 08/08/2017 03:00 PM, [email protected] wrote:
>>> Hey currently,
>>>
>>> we are on nutch 2.3.1 and using it to crawl our websites. 
>>> One of our focus is to get all the pdfs on our website crawled.  -> 
>>> Links on different Websites are like: https://assets0.mysite.com/asset 
>>> /DB_product.pdf I tried different things:
>>> At the configurations I removed ever occurrence of pdf in 
>>> regex-urlfilter.txt and added the download url, added  parse-tika to 
>>> nutch-.site.xml in plugins, added application/pdf in default-site.xml in 
>>> http-accept, added pdf to parse-plugins.xml.
>>> But still no pdf link is been fetched. 
>>>
>>> regex-urlfilter.txt
>>> +https://assets.*. mysite.com/asset
>>>
>>> parse-plugins.xml
>>> <mimeType name="application/pdf">
>>>           <plugin id="parse-tika" />
>>>    </mimeType>
>>>
>>> nutch-site.xml
>>> <property>
>>> <name>plugin.includes</name>
>>> <value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|inde
>>> x-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json
>>> |xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|in
>>> dexer-solr</value>
>>> </property>
>>>
>>> default-site.xml
>>> <property>
>>>  <name>http.accept</name>
>>>  
>>> <value>application/pdf,text/html,application/xhtml+xml,application/x
>>> ml;q=0.9,*/*;q=0.8</value>  <description>Value of the "Accept" 
>>> request header field.
>>>  </description>
>>> </property>
>>>
>>> Is there anything else I have to configure?
>>>
>>> Thanks
>>>
>>> David
>>>
>>>
>>>
>>

AW: fetching pdfs from our website

Reply via email to