Hi,

I am new to Nutch as of this morning after just setting up Nutch and Solr. I was going through an example and I was wondering how do I parse everything in a given site? I need to gather all the images, pdfs, html, forms, autocad files, etc....

I did some configuring of nutch-site.xml and regex-urlfilter.txt based on the tutorial.

In particular, I noticed this line in regex-urlfilter.txt, which I'm think I need to do away with in order to get everything, right?

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$


Also, in nutch-site.xml, I'm also thinking I need to add this or broaden it more:

<property>
<name>http.accept</name>
<value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value>
<description>Value of the "Accept" request header field.
</description>
</property>

Unless, is there a configuration that I can use that just assumes.."get everything"?

Thanks,
Laura

Reply via email to