I would also like to add that I can run the same crawl locally and it's
successful. So, it's just the distributed mode that's not working. can
anyone offer any advice? Do you think it might be something with CDH4?

On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart
<[email protected]>wrote:

> Hi everyone,
>
> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version
> 1.0.1. I can run a local filesystem crawl with Nutch, and it returns what
> I'd expect. However, I need to take advantage of the mapreduce
> functionality, since I want to crawl a local filesystem with many GB of
> files. I'm going to put all of these files on an apache server so they can
> be crawled. First, though, I want to just crawl a simple website, and I
> can't make it work.
>
> My urls/seed.txt is on hdfs and is this:
> http://lucene.apache.org
>
> I run this command:
> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl
>
> Sometimes, it fetches the URL, but does not go beyond depth 1... and when
> I examine the CrawlDatum that's in
> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the
> seed url as the key, and the value of the CrawlDatum is
> _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError:
> org/apache/tika/mime/MimeTypeException
>
> Okay, so I tried running the command again with -libjars nutch1.5.1.jar,
> and it fails with an ArrayIndexOutOfBoundsException. I tried running it
> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with:
>
> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected for
> fetching, exiting ...
> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to
> fetch.
> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed
> list and URL filters.
> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl
>
> I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib, and
> still 0 URLs are fetched.
>
> I'm totally at a loss. can someone help?
>
> Here's my regex-urlfilter:
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ
> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> # accept anything else
> +.
>
>
> here's my nutch-site.xml:
>
> <configuration>
>   <property>
>     <name>http.agent.name</name>
>     <value>nutchtest</value>
>   </property>
>   <property>
>     <name>plugin.folders</name>
>
> <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value>
>   </property>
> </configuration>
>
>
> which also does not work if I include this part:
>
> <property>
>     <name>plugin.includes</name>
>
> <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor)
> |query-(basic|site|url)|response-(json|xml)|addhdfskey</value>
>   </property>
>
>

Reply via email to