Hi,
I am new to Nutch as of this morning after just setting up Nutch and
Solr. I was going through an example and I was wondering how do I parse
everything in a given site? I need to gather all the images, pdfs, html,
forms, autocad files, etc....
I did some configuring of nutch-site.xml and regex-urlfilter.txt based
on the tutorial.
In particular, I noticed this line in regex-urlfilter.txt, which I'm
think I need to do away with in order to get everything, right?
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
Also, in nutch-site.xml, I'm also thinking I need to add this or broaden
it more:
<property>
<name>http.accept</name>
<value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value>
<description>Value of the "Accept" request header field.
</description>
</property>
Unless, is there a configuration that I can use that just assumes.."get
everything"?
Thanks,
Laura