Am 18.09.2012 18:46, schrieb Casey McTaggart: > thanks Walter, I still am unable to get anything to run- I think it's > because Hadoop is for some reason not finding the tika jar. I tried > running Hadoop with -libjars and including both the Nutch jar and the > Tika jar, and when I do this it gives me 0 URLs - it doesn't even fetch > the seed list! When I don't run it with -libjars, it fetches the seed > list, then stops with the ClassNotFound exception in the CrawlDatum. > > I'll try your solution that you just posted. But, any idea why this is > happening? > thanks! > Casey >
Hi Casey, sry, but I think the changes I mentioned were really all changes I made. I'll try to check my code again, if I forgot something to post. Remark: I also tried to insert the workaround with the nutch-2.0 code base, but was unable to make it work, because nutch-2.0 uses already the new Mapreduce classes and seems not to implement the same loading mechanism for the plugin repository. Any other ideas? Cheers, Walter > On Mon, Sep 17, 2012 at 11:30 AM, Walter Tietze <[email protected] > <mailto:[email protected]>> wrote: > > > > Hi, > > I had the same problems and couldn't get around in a proper way > satisfyingly. > > I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without > MR_v1 and couldn't make it simply work. > > > But I found a workaround to make nutch 1.5.1 work on CDH4. > > > Since MR_v2 it is no longer allowed to pack a project as *nutch*.job > altogether and since the former TaskManager is divided into > the ResourceManager and the NodeManager, the NodeManager seems not to > be able to handle the packed nutch-project. > > (see also: > > http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ > ) > > > Something one can do, is to unpack the job in the Nodemanager manually > and to load the classes from within the code into the current > classloader. > > I modified the org/apache/nutch/plugin/PluginManifestParser.java > slightly and everything works fine at least for the moment. > > > I attached the modified file. > > > Please remark, I don't have experience yet, if CDH4 removes the > application directories and the unpacked files properly. > You should consider to check the directories, if they are still > needed after the crawl succeeded. > > > > Hope this helps, cheers, Walter > > > > > Am 17.09.2012 18:31, schrieb Casey McTaggart: > > I would also like to add that I can run the same crawl locally and > it's > > successful. So, it's just the distributed mode that's not working. can > > anyone offer any advice? Do you think it might be something with CDH4? > > > > On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart > > <[email protected] <mailto:[email protected]>>wrote: > > > >> Hi everyone, > >> > >> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's > version > >> 1.0.1. I can run a local filesystem crawl with Nutch, and it > returns what > >> I'd expect. However, I need to take advantage of the mapreduce > >> functionality, since I want to crawl a local filesystem with many > GB of > >> files. I'm going to put all of these files on an apache server so > they can > >> be crawled. First, though, I want to just crawl a simple website, > and I > >> can't make it work. > >> > >> My urls/seed.txt is on hdfs and is this: > >> http://lucene.apache.org > >> > >> I run this command: > >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job > >> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl > >> > >> Sometimes, it fetches the URL, but does not go beyond depth 1... > and when > >> I examine the CrawlDatum that's in > >> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one > entry: the > >> seed url as the key, and the value of the CrawlDatum is > >> _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError: > >> org/apache/tika/mime/MimeTypeException > >> > >> Okay, so I tried running the command again with -libjars > nutch1.5.1.jar, > >> and it fails with an ArrayIndexOutOfBoundsException. I tried > running it > >> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with: > >> > >> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records > selected for > >> fetching, exiting ... > >> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more > URLs to > >> fetch. > >> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your > seed > >> list and URL filters. > >> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl > >> > >> I tried copying lib/tika-core-1.1.jar to > /usr/local/hadoop-1.0.1/lib, and > >> still 0 URLs are fetched. > >> > >> I'm totally at a loss. can someone help? > >> > >> Here's my regex-urlfilter: > >> > >> # skip file: ftp: and mailto: urls > >> -^(file|ftp|mailto): > >> # skip image and other suffixes we can't yet parse > >> # for a more extensive coverage use the urlfilter-suffix plugin > >> > >> > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ > >> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > >> # skip URLs containing certain characters as probable queries, etc. > >> -[?*!@=] > >> # skip URLs with slash-delimited segment that repeats 3+ times, > to break > >> loops > >> -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > >> # accept anything else > >> +. > >> > >> > >> here's my nutch-site.xml: > >> > >> <configuration> > >> <property> > >> <name>http.agent.name <http://http.agent.name></name> > >> <value>nutchtest</value> > >> </property> > >> <property> > >> <name>plugin.folders</name> > >> > >> > > <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value> > >> </property> > >> </configuration> > >> > >> > >> which also does not work if I include this part: > >> > >> <property> > >> <name>plugin.includes</name> > >> > >> > > <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor) > >> |query-(basic|site|url)|response-(json|xml)|addhdfskey</value> > >> </property> > >> > >> > > > > > -- > > -------------------------------- > Walter Tietze > Senior Softwareengineer > Research > > Neofonie GmbH > Robert-Koch-Platz 4 > 10115 Berlin > > T +49.30 24627 318 <tel:%2B49.30%2024627%20318> > F +49.30 24627 120 <tel:%2B49.30%2024627%20120> > > [email protected] <mailto:[email protected]> > http://www.neofonie.de > > Handelsregister > Berlin-Charlottenburg: HRB 67460 > > Geschäftsführung: > Thomas Kitlitschko > -------------------------------- > > -- -------------------------------- Walter Tietze Senior Softwareengineer Research Neofonie GmbH Robert-Koch-Platz 4 10115 Berlin T +49.30 24627 318 F +49.30 24627 120 [email protected] http://www.neofonie.de Handelsregister Berlin-Charlottenburg: HRB 67460 Geschäftsführung: Thomas Kitlitschko --------------------------------

