I've got a tar.gz file that has many 3rd party jars in it that my MR job requires. This tar.gz file is located on hdfs. When configuring my MR job, I call DistributedCache.addArchiveToClassPath(), passing in the hdfs path to the tar.gz file. When the Mapper executes I get a ClassNotFoundException because the Mapper process can't find one of the jars, but the jar was in the tar.gz archive file I've added to the class path via the DistributedCache.
I looked at the TaskTracker logs and saw entries that the tar.gz file was extracted (see below), and when I look at the extraction folder, I see the individual jar files. I looked in the hadoop source, and the TaskDistributedCacheManager class takes the path to where the archive was unpacked, and passes it to DistributedCache.addLocalArchives. I assume that in later processing this path is pulled from the configuration object and adds the path to the class path for the mapper process. So on the surface it looks like everything is correct. The tar.gz file is passed to the task tracker, un-packaged, and the folder it is un-packaged into is passed into the task configuration object. But the Mapper still can't find the jars it needs. Also, I invoke my MR job programmatically with Job.waitForCompletion, so using the -libjars arg from the cmd line isn't an option here. And I'd really rather not create unpackaged jars that have all dependent jars unpackaged in them. Any idea what I'd doing wrong with passing an archive file into distributed cache to be placed in the class path? LOG ENTRIES: INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating thirdParty.tar.gz in /var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/distcache/-6590247392468543587_1242751660_34490521/ubuntu/dir1/dir2/dir3/thirdParty.tar.gz-work-2060878421779901666 with rwxr-xr-x INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Extracting /var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/distcache/-6590247392468543587_1242751660_34490521/ubuntu/dir1/dir2/dir3/thirdParty.tar.gz-work-2060878421779901666/thirdParty.tar.gz to /var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/distcache/-6590247392468543587_1242751660_34490521/ubuntu/dir1/dir2/dir3/thirdParty.tar.gz-work-2060878421779901666 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cached hdfs://ubuntu:8020/dir1/dir2/dir3/thirdParty.tar.gz as /var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/distcache/-6590247392468543587_1242751660_34490521/ubuntu/dir1/dir2/dir3/thirdParty.tar.gz -- Thanks, John C
