[jira] [Created] (SPARK-1698) Improve spark integration
Guoqiang Li created SPARK-1698: -- Summary: Improve spark integration Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of the jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1698) Improve spark integration
[ https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987682#comment-13987682 ] Guoqiang Li commented on SPARK-1698: [~srowen] About [SPARK-1681|https://issues.apache.org/jira/browse/SPARK-1681] there is only one solution: The datanucleus jars is added to the CLASSPATH. Well,there may be other better solution, but I didn't find it I disagree with [PR 610|https://github.com/apache/spark/pull/610],It's not perfect. [The PR 598|https://github.com/apache/spark/pull/598] reference [HADOOP-7939|https://issues.apache.org/jira/browse/HADOOP-7939],I think that is better. There is [another solution|https://github.com/witgo/spark/tree/standalone] reference [Invalid or corrupt JAR File built by Maven shade plugin|http://stackoverflow.com/questions/13021423/invalid-or-corrupt-jar-file-built-by-maven-shade-plugin]. But this involves [SI-6660 REPL: load transitive dependencies of JARs on classpath|https://issues.scala-lang.org/browse/SI-6660] Improve spark integration - Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of the jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-1698) Improve spark integration
[ https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-1698: --- Comment: was deleted (was: [The PR | https://github.com/apache/spark/pull/598]) Improve spark integration - Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of the jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1698) Improve spark integration
[ https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987686#comment-13987686 ] Sean Owen commented on SPARK-1698: -- (Copying an earlier comment that went to the mailing list, but didn't make it here:) #1 and #2 are not relevant the issue of jar size. These can be problems in general, but don't think there have been issues attributable to file clashes. Shading has mechanisms to deal with this anyway. #3 is a problem in general too, but is not specific to shading. Where versions collide, build processes like Maven and shading must be used to resolve them. But this happens regardless of whether you shade a fat jar. #4 is a real problem specific to Java 6. It does seem like it will be important to identify and remove more unnecessary dependencies to work around it. But shading per se is not the problem, and it is important to make a packaged jar for the app. What are you proposing? Dependencies to be removed? Improve spark integration - Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of the jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1698) Improve spark integration
[ https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987698#comment-13987698 ] Sean Owen commented on SPARK-1698: -- What is the suggested change in this particular JIRA? I saw the PR, which seems to replace the shade with assembly plugin. Given the reference to https://issues.scala-lang.org/browse/SI-6660 are you suggesting that your assembly change packages differently, by putting jars in jars? Yes, the issue you link to is exactly the kind of problem that can occur with this approach. It comes up a bit in Hadoop as well. Even though it is in theory a fine way to do things. But is that what you're getting at? Improve spark integration - Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of the jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1698) Improve spark integration
[ https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987699#comment-13987699 ] Guoqiang Li commented on SPARK-1698: [~srowen] In [The PR 598|https://github.com/apache/spark/pull/598] #1,#2,#4 do not occur and #3 is very easy to find Improve spark integration - Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of the jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1698) Improve spark integration
[ https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987708#comment-13987708 ] Guoqiang Li commented on SPARK-1698: [~srowen] In [The PR 598|https://github.com/apache/spark/pull/598] ,The directory structure of a spark similar to hadoop 2.3.0. There are three subcomponents: core,examples,hive,Their path is share/spark/core,share/spark/examples,share/spark/hive Improve spark integration - Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of the jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1698) Improve spark integration
[ https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-1698: --- Description: Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of a jar may co-exist 4. Too big, java 6 does not support was: Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of the jar may co-exist 4. Too big, java 6 does not support Improve spark integration - Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of a jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1698) Improve spark integration
[ https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987699#comment-13987699 ] Guoqiang Li edited comment on SPARK-1698 at 5/2/14 2:32 PM: [~srowen] In [The PR 598|https://github.com/apache/spark/pull/598] #1,#2,#4 do not occur and #3 is very easy to find. The directory structure of spark similar to hadoop 2.3.0. There are three subcomponents: core,examples,hive,The directory structure of spark: {code} +- SPARK_HOME | +- bin | \- * | +- sbin | \- * | +- RELEASE | +- conf | \- * | +- python | \- * | +- share | \- spark |+- core | +- lib | \- *.jar | +- spark-core*.jar | +- spark-repl*.jar | +- spark-yarn*.jar | +- spark-bagel*.jar | +- spark-graphx*.jar | +- spark-sql*.jar | +- spark-catalyst*.jar | +- spark-mllib*.jar | +- spark-streaming*.jar |+- hive | +- lib | \- *.jar | \- spark-hive*.jar |+- examples | +- lib | \- *.jar | \- spark-examples*.jar {code} was (Author: gq): [~srowen] In [The PR 598|https://github.com/apache/spark/pull/598] #1,#2,#4 do not occur and #3 is very easy to find. The directory structure of Spark similar to hadoop 2.3.0. There are three subcomponents: core,examples,hive,their path is share/spark/core,share/spark/examples,share/spark/hive. their dependency path is share/spark/core/lib,share/spark/examples/lib,share/spark/hive/lib Improve spark integration - Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of a jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1698) Improve spark integration
[ https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987699#comment-13987699 ] Guoqiang Li edited comment on SPARK-1698 at 5/2/14 2:36 PM: [~srowen] In [The PR 598|https://github.com/apache/spark/pull/598] #1,#2,#4 do not occur and #3 is very easy to find. The directory structure of spark similar to hadoop 2.3.0. There are three subcomponents: core,examples,hive,The directory structure of spark: {code} +- SPARK_HOME | +- bin | \- * | +- sbin | \- * | +- RELEASE | +- conf | \- * | +- python | \- * | \- share | \- spark |+- core | +- lib | \- *.jar | +- spark-core*.jar | +- spark-repl*.jar | +- spark-yarn*.jar | +- spark-bagel*.jar | +- spark-graphx*.jar | +- spark-sql*.jar | +- spark-catalyst*.jar | +- spark-mllib*.jar | \- spark-streaming*.jar |+- hive | +- lib | \- *.jar | \- spark-hive*.jar |+- examples | +- lib | \- *.jar | \- spark-examples*.jar {code} was (Author: gq): [~srowen] In [The PR 598|https://github.com/apache/spark/pull/598] #1,#2,#4 do not occur and #3 is very easy to find. The directory structure of spark similar to hadoop 2.3.0. There are three subcomponents: core,examples,hive,The directory structure of spark: {code} +- SPARK_HOME | +- bin | \- * | +- sbin | \- * | +- RELEASE | +- conf | \- * | +- python | \- * | +- share | \- spark |+- core | +- lib | \- *.jar | +- spark-core*.jar | +- spark-repl*.jar | +- spark-yarn*.jar | +- spark-bagel*.jar | +- spark-graphx*.jar | +- spark-sql*.jar | +- spark-catalyst*.jar | +- spark-mllib*.jar | \- spark-streaming*.jar |+- hive | +- lib | \- *.jar | \- spark-hive*.jar |+- examples | +- lib | \- *.jar | \- spark-examples*.jar {code} Improve spark integration - Key: SPARK-1698 URL: https://issues.apache.org/jira/browse/SPARK-1698 Project: Spark Issue Type: Improvement Components: Build, Deploy Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 Use the shade plugin to create a big JAR with all the dependencies can cause a few problems 1. Missing jar's meta information 2. Some file is covered, eg: plugin.xml 3. Different versions of a jar may co-exist 4. Too big, java 6 does not support -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1695) java8-tests compiler error: package com.google.common.collections does not exist
[ https://issues.apache.org/jira/browse/SPARK-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1695. Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 611 [https://github.com/apache/spark/pull/611] java8-tests compiler error: package com.google.common.collections does not exist Key: SPARK-1695 URL: https://issues.apache.org/jira/browse/SPARK-1695 Project: Spark Issue Type: Bug Components: Build, Java API Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1701) Inconsistent naming: slice or partition
Daniel Darabos created SPARK-1701: - Summary: Inconsistent naming: slice or partition Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Reporter: Daniel Darabos Priority: Minor Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988412#comment-13988412 ] Daniel Darabos commented on SPARK-1701: --- Some examples are mentioned in http://stackoverflow.com/questions/23436640/what-is-the-different-between-an-rdd-partition-and-a-slice-in-apache-spark Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Reporter: Daniel Darabos Priority: Minor Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1556) jets3t dep doesn't update properly with newer Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1556: --- Summary: jets3t dep doesn't update properly with newer Hadoop versions (was: jets3t dependency is outdated) jets3t dep doesn't update properly with newer Hadoop versions - Key: SPARK-1556 URL: https://issues.apache.org/jira/browse/SPARK-1556 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.1, 0.9.0, 1.0.0 Reporter: Nan Zhu Assignee: Nan Zhu Fix For: 1.0.0 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines S3ServiceException/ServiceException is introduced, however, Spark still relies on Jet3st 0.7.x which has no definition of these classes What I met is that [code] 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.SparkContext.runJob(SparkContext.scala:891) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24) at init(console:26) at .init(console:30) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604) at