including /plugins/classes in plugin.folders made it work. thank you!!! On Tue, Sep 18, 2012 at 10:58 AM, Walter Tietze <[email protected]> wrote:
> Am 18.09.2012 18:46, schrieb Casey McTaggart: > > thanks Walter, I still am unable to get anything to run- I think it's > > because Hadoop is for some reason not finding the tika jar. I tried > > running Hadoop with -libjars and including both the Nutch jar and the > > Tika jar, and when I do this it gives me 0 URLs - it doesn't even fetch > > the seed list! When I don't run it with -libjars, it fetches the seed > > list, then stops with the ClassNotFound exception in the CrawlDatum. > > > > I'll try your solution that you just posted. But, any idea why this is > > happening? > > thanks! > > Casey > > > > Hi Casey, > > > > sry, but I think the changes I mentioned were really all changes I made. > > I'll try to check my code again, if I forgot something to post. > > > Remark: I also tried to insert the workaround with the nutch-2.0 code > base, but was unable to make it work, because nutch-2.0 uses already > the new Mapreduce classes and seems not to implement the same loading > mechanism for the plugin repository. > > > > Any other ideas? > > > > Cheers, Walter > > > > On Mon, Sep 17, 2012 at 11:30 AM, Walter Tietze <[email protected] > > <mailto:[email protected]>> wrote: > > > > > > > > Hi, > > > > I had the same problems and couldn't get around in a proper way > > satisfyingly. > > > > I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without > > MR_v1 and couldn't make it simply work. > > > > > > But I found a workaround to make nutch 1.5.1 work on CDH4. > > > > > > Since MR_v2 it is no longer allowed to pack a project as *nutch*.job > > altogether and since the former TaskManager is divided into > > the ResourceManager and the NodeManager, the NodeManager seems not to > > be able to handle the packed nutch-project. > > > > (see also: > > > http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ > > ) > > > > > > Something one can do, is to unpack the job in the Nodemanager > manually > > and to load the classes from within the code into the current > > classloader. > > > > I modified the org/apache/nutch/plugin/PluginManifestParser.java > > slightly and everything works fine at least for the moment. > > > > > > I attached the modified file. > > > > > > Please remark, I don't have experience yet, if CDH4 removes the > > application directories and the unpacked files properly. > > You should consider to check the directories, if they are still > > needed after the crawl succeeded. > > > > > > > > Hope this helps, cheers, Walter > > > > > > > > > > Am 17.09.2012 18:31, schrieb Casey McTaggart: > > > I would also like to add that I can run the same crawl locally and > > it's > > > successful. So, it's just the distributed mode that's not working. > can > > > anyone offer any advice? Do you think it might be something with > CDH4? > > > > > > On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart > > > <[email protected] <mailto:[email protected] > >>wrote: > > > > > >> Hi everyone, > > >> > > >> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's > > version > > >> 1.0.1. I can run a local filesystem crawl with Nutch, and it > > returns what > > >> I'd expect. However, I need to take advantage of the mapreduce > > >> functionality, since I want to crawl a local filesystem with many > > GB of > > >> files. I'm going to put all of these files on an apache server so > > they can > > >> be crawled. First, though, I want to just crawl a simple website, > > and I > > >> can't make it work. > > >> > > >> My urls/seed.txt is on hdfs and is this: > > >> http://lucene.apache.org > > >> > > >> I run this command: > > >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job > > >> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl > > >> > > >> Sometimes, it fetches the URL, but does not go beyond depth 1... > > and when > > >> I examine the CrawlDatum that's in > > >> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one > > entry: the > > >> seed url as the key, and the value of the CrawlDatum is > > >> _pst_=exception(16), lastModified=0: > java.lang.NoClassDefFoundError: > > >> org/apache/tika/mime/MimeTypeException > > >> > > >> Okay, so I tried running the command again with -libjars > > nutch1.5.1.jar, > > >> and it fails with an ArrayIndexOutOfBoundsException. I tried > > running it > > >> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails > with: > > >> > > >> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records > > selected for > > >> fetching, exiting ... > > >> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more > > URLs to > > >> fetch. > > >> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your > > seed > > >> list and URL filters. > > >> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl > > >> > > >> I tried copying lib/tika-core-1.1.jar to > > /usr/local/hadoop-1.0.1/lib, and > > >> still 0 URLs are fetched. > > >> > > >> I'm totally at a loss. can someone help? > > >> > > >> Here's my regex-urlfilter: > > >> > > >> # skip file: ftp: and mailto: urls > > >> -^(file|ftp|mailto): > > >> # skip image and other suffixes we can't yet parse > > >> # for a more extensive coverage use the urlfilter-suffix plugin > > >> > > >> > > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ > > >> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > >> # skip URLs containing certain characters as probable queries, > etc. > > >> -[?*!@=] > > >> # skip URLs with slash-delimited segment that repeats 3+ times, > > to break > > >> loops > > >> -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > >> # accept anything else > > >> +. > > >> > > >> > > >> here's my nutch-site.xml: > > >> > > >> <configuration> > > >> <property> > > >> <name>http.agent.name <http://http.agent.name></name> > > >> <value>nutchtest</value> > > >> </property> > > >> <property> > > >> <name>plugin.folders</name> > > >> > > >> > > > <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value> > > >> </property> > > >> </configuration> > > >> > > >> > > >> which also does not work if I include this part: > > >> > > >> <property> > > >> <name>plugin.includes</name> > > >> > > >> > > > <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor) > > >> |query-(basic|site|url)|response-(json|xml)|addhdfskey</value> > > >> </property> > > >> > > >> > > > > > > > > > -- > > > > -------------------------------- > > Walter Tietze > > Senior Softwareengineer > > Research > > > > Neofonie GmbH > > Robert-Koch-Platz 4 > > 10115 Berlin > > > > T +49.30 24627 318 <tel:%2B49.30%2024627%20318> > > F +49.30 24627 120 <tel:%2B49.30%2024627%20120> > > > > [email protected] <mailto:[email protected]> > > http://www.neofonie.de > > > > Handelsregister > > Berlin-Charlottenburg: HRB 67460 > > > > Geschäftsführung: > > Thomas Kitlitschko > > -------------------------------- > > > > > > > -- > > -------------------------------- > Walter Tietze > Senior Softwareengineer > Research > > Neofonie GmbH > Robert-Koch-Platz 4 > 10115 Berlin > > T +49.30 24627 318 > F +49.30 24627 120 > > [email protected] > http://www.neofonie.de > > Handelsregister > Berlin-Charlottenburg: HRB 67460 > > Geschäftsführung: > Thomas Kitlitschko > -------------------------------- > >

