Hi everyone, I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version 1.0.1. I can run a local filesystem crawl with Nutch, and it returns what I'd expect. However, I need to take advantage of the mapreduce functionality, since I want to crawl a local filesystem with many GB of files. I'm going to put all of these files on an apache server so they can be crawled. First, though, I want to just crawl a simple website, and I can't make it work.
My urls/seed.txt is on hdfs and is this: http://lucene.apache.org I run this command: sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl Sometimes, it fetches the URL, but does not go beyond depth 1... and when I examine the CrawlDatum that's in /user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the seed url as the key, and the value of the CrawlDatum is _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError: org/apache/tika/mime/MimeTypeException Okay, so I tried running the command again with -libjars nutch1.5.1.jar, and it fails with an ArrayIndexOutOfBoundsException. I tried running it with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with: 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected for fetching, exiting ... 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to fetch. 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed list and URL filters. 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib, and still 0 URLs are fetched. I'm totally at a loss. can someone help? Here's my regex-urlfilter: # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. here's my nutch-site.xml: <configuration> <property> <name>http.agent.name</name> <value>nutchtest</value> </property> <property> <name>plugin.folders</name> <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value> </property> </configuration> which also does not work if I include this part: <property> <name>plugin.includes</name> <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor) |query-(basic|site|url)|response-(json|xml)|addhdfskey</value> </property>

