Hi Laura! Nutch uses three methods to filter URLS: prefix, regex and domain.
I think if you want to crawl every page you can erase that line -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ or put a + in front. Also, you need to add <property> <name>file.content.limit</name> <value>-1</value> <description>The length limit for downloaded content using the file protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. </description> </property> so no url gets truncated or missed! I primarily use urlfilter-domain but I think these steps are correct! Have fun with Nutch, Vangelis! > Date: Fri, 21 Mar 2014 11:34:59 -0500 > From: [email protected] > To: [email protected] > Subject: Crawling all file types (images, pdfs, etc...) > > Hi, > > I am new to Nutch as of this morning after just setting up Nutch and > Solr. I was going through an example and I was wondering how do I parse > everything in a given site? I need to gather all the images, pdfs, html, > forms, autocad files, etc.... > > I did some configuring of nutch-site.xml and regex-urlfilter.txt based > on the tutorial. > > In particular, I noticed this line in regex-urlfilter.txt, which I'm > think I need to do away with in order to get everything, right? > > # skip image and other suffixes we can't yet parse > # for a more extensive coverage use the urlfilter-suffix plugin > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > > Also, in nutch-site.xml, I'm also thinking I need to add this or broaden > it more: > > <property> > <name>http.accept</name> > <value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value> > <description>Value of the "Accept" request header field. > </description> > </property> > > Unless, is there a configuration that I can use that just assumes.."get > everything"? > > Thanks, > Laura >

