Crawling all file types (images, pdfs, etc...)

Laura McCord Fri, 21 Mar 2014 09:35:52 -0700

Hi,

I am new to Nutch as of this morning after just setting up Nutch andSolr. I was going through an example and I was wondering how do I parseeverything in a given site? I need to gather all the images, pdfs, html,forms, autocad files, etc....

I did some configuring of nutch-site.xml and regex-urlfilter.txt basedon the tutorial.

In particular, I noticed this line in regex-urlfilter.txt, which I'mthink I need to do away with in order to get everything, right?


# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

Also, in nutch-site.xml, I'm also thinking I need to add this or broadenit more:


<property>
<name>http.accept</name>
<value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value>
<description>Value of the "Accept" request header field.
</description>
</property>

Unless, is there a configuration that I can use that just assumes.."geteverything"?


Thanks,
Laura

Crawling all file types (images, pdfs, etc...)

Reply via email to