Re: Crawling all file types (images, pdfs, etc...)

Laura McCord Fri, 21 Mar 2014 10:07:42 -0700

Thank you Vangelis, I'll give it a try :)

Laura



On 3/21/14 12:02 PM, Vangelis karv wrote:

Hi Laura! Nutch uses three methods to filter URLS: prefix, regex and domain.

I think if you want to crawl every page you can erase that line
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

or put a + in front.

Also, you need to add
  <property>
   <name>file.content.limit</name>
   <value>-1</value>
   <description>The length limit for downloaded content using the file
    protocol, in bytes. If this value is nonnegative (>=0), content longer
    than it will be truncated; otherwise, no truncation at all. Do not
    confuse this setting with the http.content.limit setting.
   </description>
</property>

so no url gets truncated or missed!

I primarily use urlfilter-domain but I think these steps are correct!

Have fun with Nutch,
Vangelis!

Date: Fri, 21 Mar 2014 11:34:59 -0500
From: [email protected]
To: [email protected]
Subject: Crawling all file types (images, pdfs, etc...)

Hi,

I am new to Nutch as of this morning after just setting up Nutch and
Solr. I was going through an example and I was wondering how do I parse
everything in a given site? I need to gather all the images, pdfs, html,
forms, autocad files, etc....

I did some configuring of nutch-site.xml and regex-urlfilter.txt based
on the tutorial.

In particular, I noticed this line in regex-urlfilter.txt, which I'm
think I need to do away with in order to get everything, right?

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

Also, in nutch-site.xml, I'm also thinking I need to add this or broaden
it more:

<property>
<name>http.accept</name>
<value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value>
<description>Value of the "Accept" request header field.
</description>
</property>

Unless, is there a configuration that I can use that just assumes.."get
everything"?

Thanks,
Laura

Re: Crawling all file types (images, pdfs, etc...)

Reply via email to