thanks Walter, I still am unable to get anything to run- I think it's because Hadoop is for some reason not finding the tika jar. I tried running Hadoop with -libjars and including both the Nutch jar and the Tika jar, and when I do this it gives me 0 URLs - it doesn't even fetch the seed list! When I don't run it with -libjars, it fetches the seed list, then stops with the ClassNotFound exception in the CrawlDatum.
I'll try your solution that you just posted. But, any idea why this is happening? thanks! Casey On Mon, Sep 17, 2012 at 11:30 AM, Walter Tietze <[email protected]> wrote: > > > Hi, > > I had the same problems and couldn't get around in a proper way > satisfyingly. > > I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without > MR_v1 and couldn't make it simply work. > > > But I found a workaround to make nutch 1.5.1 work on CDH4. > > > Since MR_v2 it is no longer allowed to pack a project as *nutch*.job > altogether and since the former TaskManager is divided into > the ResourceManager and the NodeManager, the NodeManager seems not to > be able to handle the packed nutch-project. > > (see also: > > http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ > ) > > > Something one can do, is to unpack the job in the Nodemanager manually > and to load the classes from within the code into the current > classloader. > > I modified the org/apache/nutch/plugin/PluginManifestParser.java > slightly and everything works fine at least for the moment. > > > I attached the modified file. > > > Please remark, I don't have experience yet, if CDH4 removes the > application directories and the unpacked files properly. > You should consider to check the directories, if they are still > needed after the crawl succeeded. > > > > Hope this helps, cheers, Walter > > > > > Am 17.09.2012 18:31, schrieb Casey McTaggart: > > I would also like to add that I can run the same crawl locally and it's > > successful. So, it's just the distributed mode that's not working. can > > anyone offer any advice? Do you think it might be something with CDH4? > > > > On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart > > <[email protected]>wrote: > > > >> Hi everyone, > >> > >> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version > >> 1.0.1. I can run a local filesystem crawl with Nutch, and it returns > what > >> I'd expect. However, I need to take advantage of the mapreduce > >> functionality, since I want to crawl a local filesystem with many GB of > >> files. I'm going to put all of these files on an apache server so they > can > >> be crawled. First, though, I want to just crawl a simple website, and I > >> can't make it work. > >> > >> My urls/seed.txt is on hdfs and is this: > >> http://lucene.apache.org > >> > >> I run this command: > >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job > >> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl > >> > >> Sometimes, it fetches the URL, but does not go beyond depth 1... and > when > >> I examine the CrawlDatum that's in > >> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the > >> seed url as the key, and the value of the CrawlDatum is > >> _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError: > >> org/apache/tika/mime/MimeTypeException > >> > >> Okay, so I tried running the command again with -libjars nutch1.5.1.jar, > >> and it fails with an ArrayIndexOutOfBoundsException. I tried running it > >> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with: > >> > >> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected > for > >> fetching, exiting ... > >> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs > to > >> fetch. > >> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed > >> list and URL filters. > >> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl > >> > >> I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib, > and > >> still 0 URLs are fetched. > >> > >> I'm totally at a loss. can someone help? > >> > >> Here's my regex-urlfilter: > >> > >> # skip file: ftp: and mailto: urls > >> -^(file|ftp|mailto): > >> # skip image and other suffixes we can't yet parse > >> # for a more extensive coverage use the urlfilter-suffix plugin > >> > >> > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ > >> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > >> # skip URLs containing certain characters as probable queries, etc. > >> -[?*!@=] > >> # skip URLs with slash-delimited segment that repeats 3+ times, to break > >> loops > >> -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > >> # accept anything else > >> +. > >> > >> > >> here's my nutch-site.xml: > >> > >> <configuration> > >> <property> > >> <name>http.agent.name</name> > >> <value>nutchtest</value> > >> </property> > >> <property> > >> <name>plugin.folders</name> > >> > >> > <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value> > >> </property> > >> </configuration> > >> > >> > >> which also does not work if I include this part: > >> > >> <property> > >> <name>plugin.includes</name> > >> > >> > <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor) > >> |query-(basic|site|url)|response-(json|xml)|addhdfskey</value> > >> </property> > >> > >> > > > > > -- > > -------------------------------- > Walter Tietze > Senior Softwareengineer > Research > > Neofonie GmbH > Robert-Koch-Platz 4 > 10115 Berlin > > T +49.30 24627 318 > F +49.30 24627 120 > > [email protected] > http://www.neofonie.de > > Handelsregister > Berlin-Charlottenburg: HRB 67460 > > Geschäftsführung: > Thomas Kitlitschko > -------------------------------- > >

