What's the best practice for managing Hadoop dependencie?

Fengyun RAO Sun, 09 Mar 2014 20:34:48 -0700

First of all, I want to claim that I used CDH5 beta, and managed project
using maven, and I googled and read a lot, e.g.
https://issues.apache.org/jira/browse/MAPREDUCE-1700
http://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/


I believe the problem is quite common, when we write an MR job, we need
lots of dependencies,
which may not exist in or conflict with HADDOP_CLASSPATH.
There are several options, e.g.
1. add all libraries to my own JAR, and set HADOOP_USER_CLASSPATH_FIRST=true
   This is what I do, which makes the jar very big, and still it doesn't
work.
   e.g. I already packaged guava-16.0.jar in my package, but it still use
guava-11.0.2.jar in the HADDOP_CLASSPATH.
   below is my build configuration.
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>xxx.xxx.xxx.Runner</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

2. distinguish which library is not present in HADDOP_CLASSPATH, and put it
into DistributedCache
    I think it's hard to distinguish, and still if it conflicts, which
dependency would be precedent?


*What's the best practice, especially using maven?*

What's the best practice for managing Hadoop dependencie?

Reply via email to