RE: Crawling all file types (images, pdfs, etc...)

Vangelis karv Fri, 21 Mar 2014 10:03:23 -0700

Hi Laura! Nutch uses three methods to filter URLS: prefix, regex and domain.


I think if you want to crawl every page you can erase that line
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

or put a + in front.

Also, you need to add
 <property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the file
   protocol, in bytes. If this value is nonnegative (>=0), content longer
   than it will be truncated; otherwise, no truncation at all. Do not
   confuse this setting with the http.content.limit setting.
  </description>
</property>

so no url gets truncated or missed!

I primarily use urlfilter-domain but I think these steps are correct!

Have fun with Nutch, 
Vangelis!

> Date: Fri, 21 Mar 2014 11:34:59 -0500
> From: [email protected]
> To: [email protected]
> Subject: Crawling all file types (images, pdfs, etc...)
> 
> Hi,
> 
> I am new to Nutch as of this morning after just setting up Nutch and 
> Solr. I was going through an example and I was wondering how do I parse 
> everything in a given site? I need to gather all the images, pdfs, html, 
> forms, autocad files, etc....
> 
> I did some configuring of nutch-site.xml and regex-urlfilter.txt based 
> on the tutorial.
> 
> In particular, I noticed this line in regex-urlfilter.txt, which I'm 
> think I need to do away with in order to get everything, right?
> 
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> 
> 
> Also, in nutch-site.xml, I'm also thinking I need to add this or broaden 
> it more:
> 
> <property>
> <name>http.accept</name>
> <value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value>
> <description>Value of the "Accept" request header field.
> </description>
> </property>
> 
> Unless, is there a configuration that I can use that just assumes.."get 
> everything"?
> 
> Thanks,
> Laura
>

RE: Crawling all file types (images, pdfs, etc...)

Reply via email to