problem running Nutch 1.5.1 in distributed mode- simple crawl

Casey McTaggart Sat, 15 Sep 2012 16:23:14 -0700

Hi everyone,

I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version
1.0.1. I can run a local filesystem crawl with Nutch, and it returns what
I'd expect. However, I need to take advantage of the mapreduce
functionality, since I want to crawl a local filesystem with many GB of
files. I'm going to put all of these files on an apache server so they can
be crawled. First, though, I want to just crawl a simple website, and I
can't make it work.


My urls/seed.txt is on hdfs and is this:
http://lucene.apache.org

I run this command:
sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl

Sometimes, it fetches the URL, but does not go beyond depth 1... and when I
examine the CrawlDatum that's in
/user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the
seed url as the key, and the value of the CrawlDatum is
_pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError:
org/apache/tika/mime/MimeTypeException

Okay, so I tried running the command again with -libjars nutch1.5.1.jar,
and it fails with an ArrayIndexOutOfBoundsException. I tried running it
with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with:

12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected for
fetching, exiting ...
12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to
fetch.
12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed list
and URL filters.
12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl

I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib, and
still 0 URLs are fetched.

I'm totally at a loss. can someone help?

Here's my regex-urlfilter:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ
|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+.


here's my nutch-site.xml:

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>nutchtest</value>
  </property>
  <property>
    <name>plugin.folders</name>

<value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value>
  </property>
</configuration>


which also does not work if I include this part:

<property>
    <name>plugin.includes</name>

<value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor)
|query-(basic|site|url)|response-(json|xml)|addhdfskey</value>
  </property>

problem running Nutch 1.5.1 in distributed mode- simple crawl

Reply via email to