Thank you Vangelis, I'll give it a try :) Laura
On 3/21/14 12:02 PM, Vangelis karv wrote:
Hi Laura! Nutch uses three methods to filter URLS: prefix, regex and domain. I think if you want to crawl every page you can erase that line -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ or put a + in front. Also, you need to add <property> <name>file.content.limit</name> <value>-1</value> <description>The length limit for downloaded content using the file protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. </description> </property> so no url gets truncated or missed! I primarily use urlfilter-domain but I think these steps are correct! Have fun with Nutch, Vangelis!Date: Fri, 21 Mar 2014 11:34:59 -0500 From: [email protected] To: [email protected] Subject: Crawling all file types (images, pdfs, etc...) Hi, I am new to Nutch as of this morning after just setting up Nutch and Solr. I was going through an example and I was wondering how do I parse everything in a given site? I need to gather all the images, pdfs, html, forms, autocad files, etc.... I did some configuring of nutch-site.xml and regex-urlfilter.txt based on the tutorial. In particular, I noticed this line in regex-urlfilter.txt, which I'm think I need to do away with in order to get everything, right? # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ Also, in nutch-site.xml, I'm also thinking I need to add this or broaden it more: <property> <name>http.accept</name> <value>text/html,application/xhtml+xml,application/xml;application/pdf;q=0.9,*/*;q=0.8</value> <description>Value of the "Accept" request header field. </description> </property> Unless, is there a configuration that I can use that just assumes.."get everything"? Thanks, Laura

