Yes, I do have the following dependencies marked as "provided": libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided" libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.3.0" % "provided" libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.3.0" % "provided" libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.3.0" % "provided"
However, spark-streaming-kinesis-asl has a compile time dependency on spark-streaming, so I think that causes it and its dependencies to be pulled into the assembly. I expected that simply excluding spark-streaming in the spark-streaming-kinesis-asl dependency would solve this problem, but it does not. That is, this doesn't work either: libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "1.3.0" exclude("org.apache.spark", "spark-streaming") As I mentioned originally, the following solved some but not all conflicts: libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "1.3.0" excludeAll( ExclusionRule(organization = "org.apache.hadoop"), ExclusionRule(organization = "org.apache.spark", name = "spark-streaming") ) (Note that ExclusionRule(organization = "org.apache.spark") without the "name" attribute does not work because that apparently causes it to exclude even spark-streaming-kinesis-asl.) Jonathan Kelly Elastic MapReduce - SDE Port 99 (SEA35) 08.220.C2 From: Tathagata Das <t...@databricks.com<mailto:t...@databricks.com>> Date: Monday, March 16, 2015 at 12:45 PM To: Jonathan Kelly <jonat...@amazon.com<mailto:jonat...@amazon.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: problems with spark-streaming-kinesis-asl and "sbt assembly" ("different file contents found") If you are creating an assembly, make sure spark-streaming is marked as provided. spark-streaming is already part of the spark installation so will be present at run time. That might solve some of these, may be!? TD On Mon, Mar 16, 2015 at 11:30 AM, Kelly, Jonathan <jonat...@amazon.com<mailto:jonat...@amazon.com>> wrote: I'm attempting to use the Spark Kinesis Connector, so I've added the following dependency in my build.sbt: libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "1.3.0" My app works fine with "sbt run", but I can't seem to get "sbt assembly" to work without failing with "different file contents found" errors due to different versions of various packages getting pulled in to the assembly. This only occurs when I've added spark-streaming-kinesis-asl as a dependency. "sbt assembly" works fine otherwise. Here are the conflicts that I see: com.esotericsoftware.kryo:kryo:2.21 com.esotericsoftware.minlog:minlog:1.2 com.google.guava:guava:15.0 org.apache.spark:spark-network-common_2.10:1.3.0 (Note: The conflict is with javac.sh; why is this even getting included?) org.apache.spark:spark-streaming-kinesis-asl_2.10:1.3.0 org.apache.spark:spark-streaming_2.10:1.3.0 org.apache.spark:spark-core_2.10:1.3.0 org.apache.spark:spark-network-common_2.10:1.3.0 org.apache.spark:spark-network-shuffle_2.10:1.3.0 (Note: I'm actually using my own custom-built version of Spark-1.3.0 where I've upgraded to v1.9.24 of the AWS Java SDK, but that has nothing to do with all of these conflicts, as I upgraded the dependency *because* I was getting all of these conflicts with the Spark 1.3.0 artifacts from the central repo.) com.amazonaws:aws-java-sdk-s3:1.9.24 net.java.dev.jets3t:jets3t:0.9.3 commons-collections:commons-collections:3.2.1 commons-beanutils-commons-beanutils:1.7.0 commons-beanutils:commons-beanutils-core:1.8.0 commons-logging:commons-logging:1.1.3 org.slf4j:jcl-over-slf4j:1.7.10 (Note: The conflict is with a few package-info.class files, which seems really silly.) org.apache.hadoop:hadoop-yarn-common:2.4.0 org.apache.hadoop:hadoop-yarn-api:2.4.0 (Note: The conflict is with org/apache/spark/unused/UnusedStubClass.class, which seems even more silly.) org.apache.spark:spark-streaming-kinesis-asl_2.10:1.3.0 org.apache.spark:spark-streaming_2.10:1.3.0 org.apache.spark:spark-core_2.10:1.3.0 org.apache.spark:spark-network-common_2.10:1.3.0 org.spark-project.spark:unused:1.0.0 (?!?!?!) org.apache.spark:spark-network-shuffle_2.10:1.3.0 I can get rid of some of the conflicts by using excludeAll() to exclude artifacts with organization = "org.apache.hadoop" or organization = "org.apache.spark" and name = "spark-streaming", and I might be able to resolve a few other conflicts this way, but the bottom line is that this is way more complicated than it should be, so either something is really broken or I'm just doing something wrong. Many of these don't even make sense to me. For example, the very first conflict is between classes in com.esotericsoftware.kryo:kryo:2.21 and in com.esotericsoftware.minlog:minlog:1.2, but the former *depends* upon the latter, so ??? It seems wrong to me that one package would contain different versions of the same classes that are included in one of its dependencies. I guess it doesn't make too much difference though if I could only get my assembly to include/exclude the right packages. I of course don't want any of the spark or hadoop dependencies included (other than spark-streaming-kinesis-asl itself), but I want all of spark-streaming-kinesis-asl's dependencies included (such as the AWS Java SDK and its dependencies). That doesn't seem to be possible without what I imagine will become an unruly and fragile exclusion list though. Thanks, Jonathan