[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation
[ https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6391: - Target Version/s: 1.4.0 Fix Version/s: (was: 1.4.0) [~haoyuan] we set Fix Version when the issue is Resolved. At best, set Target Version. Update Tachyon version compatibility documentation -- Key: SPARK-6391 URL: https://issues.apache.org/jira/browse/SPARK-6391 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.0 Reporter: Calvin Jia Tachyon v0.6 has an api change in the client, it would be helpful to document the Tachyon-Spark compatibility across versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6391) Update Tachyon version compatibility documentation
[ https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haoyuan Li updated SPARK-6391: -- Fix Version/s: 1.4.0 Update Tachyon version compatibility documentation -- Key: SPARK-6391 URL: https://issues.apache.org/jira/browse/SPARK-6391 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.0 Reporter: Calvin Jia Fix For: 1.4.0 Tachyon v0.6 has an api change in the client, it would be helpful to document the Tachyon-Spark compatibility across versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6299) ClassNotFoundException in standalone mode when running groupByKey with class defined in REPL.
[ https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385516#comment-14385516 ] Chip Senkbeil commented on SPARK-6299: -- FYI, we had the same issue on Mesos for 1.2.1 when the class was defined through the REPL. So, it was not just limited to standalone mode. ClassNotFoundException in standalone mode when running groupByKey with class defined in REPL. - Key: SPARK-6299 URL: https://issues.apache.org/jira/browse/SPARK-6299 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.1, 1.3.0 Reporter: Kevin (Sangwoo) Kim Assignee: Kevin (Sangwoo) Kim Fix For: 1.3.1, 1.4.0 Anyone can reproduce this issue by the code below (runs well in local mode, got exception with clusters) (it runs well in Spark 1.1.1) {code} case class ClassA(value: String) val rdd = sc.parallelize(List((k1, ClassA(v1)), (k1, ClassA(v2)) )) rdd.groupByKey.collect {code} {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[jira] [Commented] (SPARK-6391) Update Tachyon version compatibility documentation
[ https://issues.apache.org/jira/browse/SPARK-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385584#comment-14385584 ] Haoyuan Li commented on SPARK-6391: --- Thanks [~sowen]. Update Tachyon version compatibility documentation -- Key: SPARK-6391 URL: https://issues.apache.org/jira/browse/SPARK-6391 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.0 Reporter: Calvin Jia Tachyon v0.6 has an api change in the client, it would be helpful to document the Tachyon-Spark compatibility across versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6589) SQLUserDefinedType failed in spark-shell
Benyi Wang created SPARK-6589: - Summary: SQLUserDefinedType failed in spark-shell Key: SPARK-6589 URL: https://issues.apache.org/jira/browse/SPARK-6589 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: CDH 5.3.2 Reporter: Benyi Wang {{DataType.fromJson}} will fail in spark-shell if the schema includes udt. It works if running in an application. This causes that I cannot read a parquet file including a UDT field. {{DataType.fromCaseClass}} does not support UDT. I can load the class which shows that my UDT is in the classpath. {code} scala Class.forName(com.bwang.MyTestUDT) res6: Class[_] = class com.bwang.MyTestUDT {code} But DataType fails: {code} scala DataType.fromJson(json) java.lang.ClassNotFoundException: com.bwang.MyTestUDT at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.spark.sql.catalyst.types.DataType$.parseDataType(dataTypes.scala:77) {code} The reason is DataType.fromJson tries to load {{udtClass}} using this code: {code} case JSortedObject( (class, JString(udtClass)), (pyClass, _), (sqlType, _), (type, JString(udt))) = Class.forName(udtClass).newInstance().asInstanceOf[UserDefinedType[_]] } {code} Unfortunately, my UDT is loaded by {{SparkIMain$TranslatingClassLoader}}, but DataType is loaded by {{Launcher$AppClassLoader}}. {code} scala DataType.getClass.getClassLoader res2: ClassLoader = sun.misc.Launcher$AppClassLoader@6876fb1b scala this.getClass.getClassLoader res3: ClassLoader = org.apache.spark.repl.SparkIMain$TranslatingClassLoader@63d36b29 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5124: --- Assignee: Shixiong Zhu (was: Apache Spark) Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5124: --- Assignee: Apache Spark (was: Shixiong Zhu) Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Apache Spark Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5494) SparkSqlSerializer Ignores KryoRegistrators
[ https://issues.apache.org/jira/browse/SPARK-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5494: --- Assignee: Apache Spark SparkSqlSerializer Ignores KryoRegistrators --- Key: SPARK-5494 URL: https://issues.apache.org/jira/browse/SPARK-5494 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Hamel Ajay Kothari Assignee: Apache Spark We should make SparkSqlSerializer call {{super.newKryo}} before doing any of it's custom stuff in order to make sure it picks up on custom KryoRegistrators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5494) SparkSqlSerializer Ignores KryoRegistrators
[ https://issues.apache.org/jira/browse/SPARK-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5494: --- Assignee: (was: Apache Spark) SparkSqlSerializer Ignores KryoRegistrators --- Key: SPARK-5494 URL: https://issues.apache.org/jira/browse/SPARK-5494 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Hamel Ajay Kothari We should make SparkSqlSerializer call {{super.newKryo}} before doing any of it's custom stuff in order to make sure it picks up on custom KryoRegistrators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5946) Add Python API for Kafka direct stream
[ https://issues.apache.org/jira/browse/SPARK-5946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5946: - Target Version/s: 1.4.0 Add Python API for Kafka direct stream -- Key: SPARK-5946 URL: https://issues.apache.org/jira/browse/SPARK-5946 Project: Spark Issue Type: Improvement Components: PySpark, Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Add the Python API for Kafka direct stream. Currently only adds {{createDirectStream}} API, no {{createRDD}} API, since it needs some Python wraps of Java object, will improve this according to the comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6369) InsertIntoHiveTable should use logic from SparkHadoopWriter
[ https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6369: --- Assignee: Apache Spark (was: Cheng Lian) InsertIntoHiveTable should use logic from SparkHadoopWriter --- Key: SPARK-6369 URL: https://issues.apache.org/jira/browse/SPARK-6369 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Apache Spark Priority: Blocker Right now it is possible that we will corrupt the output if there is a race between competing speculative tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6369) InsertIntoHiveTable should use logic from SparkHadoopWriter
[ https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6369: --- Assignee: Cheng Lian (was: Apache Spark) InsertIntoHiveTable should use logic from SparkHadoopWriter --- Key: SPARK-6369 URL: https://issues.apache.org/jira/browse/SPARK-6369 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Right now it is possible that we will corrupt the output if there is a race between competing speculative tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385616#comment-14385616 ] Nan Zhu commented on SPARK-6592: also cc: [~lian cheng] [~marmbrus] API of Row trait should be presented in Scala doc - Key: SPARK-6592 URL: https://issues.apache.org/jira/browse/SPARK-6592 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Nan Zhu Currently, the API of Row class is not presented in Scaladoc, though we have many chances to use it the reason is that we ignore all files under catalyst directly in SparkBuild.scala when generating Scaladoc, (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6592) API of Row trait should be presented in Scala doc
[ https://issues.apache.org/jira/browse/SPARK-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385617#comment-14385617 ] Reynold Xin commented on SPARK-6592: Can you try change that line to spark/sql/catalyst? then it should only filter out the catalyst package, but not the catalyst module. API of Row trait should be presented in Scala doc - Key: SPARK-6592 URL: https://issues.apache.org/jira/browse/SPARK-6592 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Nan Zhu Currently, the API of Row class is not presented in Scaladoc, though we have many chances to use it the reason is that we ignore all files under catalyst directly in SparkBuild.scala when generating Scaladoc, (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6591) Python data source load options should auto convert common types into strings
Reynold Xin created SPARK-6591: -- Summary: Python data source load options should auto convert common types into strings Key: SPARK-6591 URL: https://issues.apache.org/jira/browse/SPARK-6591 Project: Spark Issue Type: Improvement Components: PySpark, SQL Reporter: Reynold Xin Assignee: Davies Liu See the discussion at : https://github.com/databricks/spark-csv/pull/39 If the caller invokes {code} sqlContext.load(com.databricks.spark.csv, path = cars.csv, header = True) {code} We should automatically turn header into true in string form. We should do this for booleans and numeric values. cc [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6591) Python data source load options should auto convert common types into strings
[ https://issues.apache.org/jira/browse/SPARK-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6591: --- Labels: DataFrame DataSource (was: ) Python data source load options should auto convert common types into strings - Key: SPARK-6591 URL: https://issues.apache.org/jira/browse/SPARK-6591 Project: Spark Issue Type: Improvement Components: PySpark, SQL Reporter: Reynold Xin Assignee: Davies Liu Labels: DataFrame, DataSource See the discussion at : https://github.com/databricks/spark-csv/pull/39 If the caller invokes {code} sqlContext.load(com.databricks.spark.csv, path = cars.csv, header = True) {code} We should automatically turn header into true in string form. We should do this for booleans and numeric values. cc [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6592) API of Row trait should be presented in Scala doc
Nan Zhu created SPARK-6592: -- Summary: API of Row trait should be presented in Scala doc Key: SPARK-6592 URL: https://issues.apache.org/jira/browse/SPARK-6592 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Nan Zhu Currently, the API of Row class is not presented in Scaladoc, though we have many chances to use it the reason is that we ignore all files under catalyst directly in SparkBuild.scala when generating Scaladoc, (https://github.com/apache/spark/blob/f75f633b21faaf911f04aeff847f25749b1ecd89/project/SparkBuild.scala#L369) What's the best approach to fix this? [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2973) Use LocalRelation for all ExecutedCommands, avoid job for take/collect()
[ https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385606#comment-14385606 ] Apache Spark commented on SPARK-2973: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/5247 Use LocalRelation for all ExecutedCommands, avoid job for take/collect() Key: SPARK-2973 URL: https://issues.apache.org/jira/browse/SPARK-2973 Project: Spark Issue Type: Improvement Components: SQL Reporter: Aaron Davidson Assignee: Cheng Lian Priority: Blocker Fix For: 1.2.0 Right now, sql(show tables).collect() will start a Spark job which shows up in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables
[ https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6575: -- Description: Consider a metastore Parquet table that # doesn't have schema evolution issue # has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when converting such a metastore Parquet table. was: Consider a metastore Parquet table that # doesn't have schema evolution issue # has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when coverting such a metastore Parquet table. Add configuration to disable schema merging while converting metastore Parquet tables - Key: SPARK-6575 URL: https://issues.apache.org/jira/browse/SPARK-6575 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Consider a metastore Parquet table that # doesn't have schema evolution issue # has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when converting such a metastore Parquet table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6590) Make DataFrame.where accept a string conditionExpr
[ https://issues.apache.org/jira/browse/SPARK-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6590: Priority: Minor (was: Major) Make DataFrame.where accept a string conditionExpr -- Key: SPARK-6590 URL: https://issues.apache.org/jira/browse/SPARK-6590 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Minor In our doc, we say where is an alias of filter. However, where does not support a conditionExpr in string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6590) Make DataFrame.where accept a string conditionExpr
Yin Huai created SPARK-6590: --- Summary: Make DataFrame.where accept a string conditionExpr Key: SPARK-6590 URL: https://issues.apache.org/jira/browse/SPARK-6590 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yin Huai Assignee: Yin Huai In our doc, we say where is an alias of filter. However, where does not support a conditionExpr in string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6589) SQLUserDefinedType failed in spark-shell
[ https://issues.apache.org/jira/browse/SPARK-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385499#comment-14385499 ] Benyi Wang commented on SPARK-6589: --- I found a method to fix this issue. But I still think DataType should find a better way to find the correct class loader. {code} # put the UDT jar to SPARK_CLASSPATH so that Launcher$AppClassLoader can find it. export SPARK_CLASSPATH=myUDT.jar spark-shell --jars myUDT.jar ... {code} SQLUserDefinedType failed in spark-shell Key: SPARK-6589 URL: https://issues.apache.org/jira/browse/SPARK-6589 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: CDH 5.3.2 Reporter: Benyi Wang {{DataType.fromJson}} will fail in spark-shell if the schema includes udt. It works if running in an application. This causes that I cannot read a parquet file including a UDT field. {{DataType.fromCaseClass}} does not support UDT. I can load the class which shows that my UDT is in the classpath. {code} scala Class.forName(com.bwang.MyTestUDT) res6: Class[_] = class com.bwang.MyTestUDT {code} But DataType fails: {code} scala DataType.fromJson(json) java.lang.ClassNotFoundException: com.bwang.MyTestUDT at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.spark.sql.catalyst.types.DataType$.parseDataType(dataTypes.scala:77) {code} The reason is DataType.fromJson tries to load {{udtClass}} using this code: {code} case JSortedObject( (class, JString(udtClass)), (pyClass, _), (sqlType, _), (type, JString(udt))) = Class.forName(udtClass).newInstance().asInstanceOf[UserDefinedType[_]] } {code} Unfortunately, my UDT is loaded by {{SparkIMain$TranslatingClassLoader}}, but DataType is loaded by {{Launcher$AppClassLoader}}. {code} scala DataType.getClass.getClassLoader res2: ClassLoader = sun.misc.Launcher$AppClassLoader@6876fb1b scala this.getClass.getClassLoader res3: ClassLoader = org.apache.spark.repl.SparkIMain$TranslatingClassLoader@63d36b29 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
Liang-Chi Hsieh created SPARK-6586: -- Summary: Add the capability of retrieving original logical plan of DataFrame Key: SPARK-6586 URL: https://issues.apache.org/jira/browse/SPARK-6586 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan instead of logical plan. However, by doing that we can't know the logical plan of a {{DataFrame}}. But it might be still useful and important to retrieve the original logical plan in some use cases. In this pr, we introduce the capability of retrieving original logical plan of {{DataFrame}}. The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as {{true}}. In {{QueryExecution}}, we keep the original logical plan in the analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to recursively replace the analyzed logical plan with original logical plan and retrieve it. Besides the capability of retrieving original logical plan, this modification also can avoid do plan analysis if it is already analyzed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385261#comment-14385261 ] Apache Spark commented on SPARK-6586: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/5241 Add the capability of retrieving original logical plan of DataFrame --- Key: SPARK-6586 URL: https://issues.apache.org/jira/browse/SPARK-6586 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan instead of logical plan. However, by doing that we can't know the logical plan of a {{DataFrame}}. But it might be still useful and important to retrieve the original logical plan in some use cases. In this pr, we introduce the capability of retrieving original logical plan of {{DataFrame}}. The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as {{true}}. In {{QueryExecution}}, we keep the original logical plan in the analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to recursively replace the analyzed logical plan with original logical plan and retrieve it. Besides the capability of retrieving original logical plan, this modification also can avoid do plan analysis if it is already analyzed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6586) Add the capability of retrieving original logical plan of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6586: --- Assignee: Apache Spark Add the capability of retrieving original logical plan of DataFrame --- Key: SPARK-6586 URL: https://issues.apache.org/jira/browse/SPARK-6586 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Apache Spark Priority: Minor In order to solve a bug, since #5217, {{DataFrame}} now uses analyzed plan instead of logical plan. However, by doing that we can't know the logical plan of a {{DataFrame}}. But it might be still useful and important to retrieve the original logical plan in some use cases. In this pr, we introduce the capability of retrieving original logical plan of {{DataFrame}}. The approach is that we add an {{analyzed}} variable to {{LogicalPlan}}. Once {{Analyzer}} finishes analysis, it sets {{analyzed}} of {{LogicalPlan}} as {{true}}. In {{QueryExecution}}, we keep the original logical plan in the analyzed plan. In {{LogicalPlan}}, a method {{originalPlan}} is added to recursively replace the analyzed logical plan with original logical plan and retrieve it. Besides the capability of retrieving original logical plan, this modification also can avoid do plan analysis if it is already analyzed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4941) Yarn cluster mode does not upload all needed jars to driver node (Spark 1.2.0)
[ https://issues.apache.org/jira/browse/SPARK-4941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4941. -- Resolution: Cannot Reproduce OK, we can reopen if this if typos etc are ruled out, and it is reproducible vs at least 1.3.0. Yarn cluster mode does not upload all needed jars to driver node (Spark 1.2.0) -- Key: SPARK-4941 URL: https://issues.apache.org/jira/browse/SPARK-4941 Project: Spark Issue Type: Bug Components: YARN Reporter: Gurpreet Singh I am specifying additional jars and config xml file with --jars and --files option to be uploaded to driver in the following spark-submit command. However they are not getting uploaded. This results in the the job failure. It was working in spark 1.0.2 build. Spark-Build being used (spark-1.2.0.tgz) $SPARK_HOME/bin/spark-submit \ --class com.ebay.inc.scala.testScalaXML \ --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1--2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop/share/hadoop/common/lib/hadoop--0.1--2.jar:/apache/hive/lib/mysql-connector-java-5.0.8-bin.jar:/apache/hadoop/share/hadoop/common/lib/guava-11.0.2.jar \ --master yarn \ --deploy-mode cluster \ --num-executors 3 \ --driver-memory 1G \ --executor-memory 1G \ /export/home/b_incdata_rw/gurpreetsingh/jar/testscalaxml_2.11-1.0.jar /export/home/b_incdata_rw/gurpreetsingh/sqlFramework.xml next_gen_linking \ --queue hdmi-spark \ --jars /export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-api-jdo-3.2.1.jar,/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-core-3.2.2.jar,/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-rdbms-3.2.1.jar,/apache/hive/lib/mysql-connector-java-5.0.8-bin.jar,/apache/hadoop/share/hadoop/common/lib/hadoop--0.1--2.jar,/apache/hadoop/share/hadoop/common/lib/hadoop-lzo-0.6.0.jar,/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1--2.jar\ --files /export/home/b_incdata_rw/gurpreetsingh/spark-1.0.2-bin-2.4.1/conf/hive-site.xml Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/12/22 23:00:17 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 14/12/22 23:00:17 INFO yarn.Client: Requesting a new application from cluster with 2026 NodeManagers 14/12/22 23:00:17 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (16384 MB per container) 14/12/22 23:00:17 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead 14/12/22 23:00:17 INFO yarn.Client: Setting up container launch context for our AM 14/12/22 23:00:17 INFO yarn.Client: Preparing resources for our AM container 14/12/22 23:00:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/12/22 23:00:18 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 14/12/22 23:00:21 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 6623380 for b_incdata_rw on 10.115.201.75:8020 14/12/22 23:00:21 INFO yarn.Client: Uploading resource file:/home/b_incdata_rw/gurpreetsingh/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar - hdfs://-nn.vip.xxx.com:8020/user/b_incdata_rw/.sparkStaging/application_1419242629195_8432/spark-assembly-1.2.0-hadoop2.4.0.jar 14/12/22 23:00:24 INFO yarn.Client: Uploading resource file:/export/home/b_incdata_rw/gurpreetsingh/jar/firstsparkcode_2.11-1.0.jar - hdfs://-nn.vip.xxx.com:8020:8020/user/b_incdata_rw/.sparkStaging/application_1419242629195_8432/firstsparkcode_2.11-1.0.jar 14/12/22 23:00:25 INFO yarn.Client: Setting up the launch environment for our AM container -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6552) expose start-slave.sh to user and update outdated doc
[ https://issues.apache.org/jira/browse/SPARK-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6552. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5205 [https://github.com/apache/spark/pull/5205] expose start-slave.sh to user and update outdated doc - Key: SPARK-6552 URL: https://issues.apache.org/jira/browse/SPARK-6552 Project: Spark Issue Type: Improvement Components: Deploy, Documentation Reporter: Tao Wang Priority: Minor Fix For: 1.4.0 It would be better to expose start-slave.sh to user to allow starting a worker on single node. As the description for starting a worker in document is in foregroud way, I alse changed it to backgroud way(using start-slave.sh). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6571) MatrixFactorizationModel created by load fails on predictAll
[ https://issues.apache.org/jira/browse/SPARK-6571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385414#comment-14385414 ] Apache Spark commented on SPARK-6571: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5243 MatrixFactorizationModel created by load fails on predictAll Key: SPARK-6571 URL: https://issues.apache.org/jira/browse/SPARK-6571 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Charles Hayden Assignee: Xiangrui Meng This code, adapted from the documentation, fails when using a loaded model. from pyspark.mllib.recommendation import ALS, Rating, MatrixFactorizationModel r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) ratings = sc.parallelize([r1, r2, r3]) model = ALS.trainImplicit(ratings, 1, seed=10) print '(2, 2)', model.predict(2, 2) #0.43... testset = sc.parallelize([(1, 2), (1, 1)]) print 'all', model.predictAll(testset).collect() #[Rating(user=1, product=1, rating=1.0...), Rating(user=1, product=2, rating=1.9...)] import os, tempfile path = tempfile.mkdtemp() model.save(sc, path) sameModel = MatrixFactorizationModel.load(sc, path) print '(2, 2)', sameModel.predict(2,2) sameModel.predictAll(testset).collect() This gives (2, 2) 0.443547642944 all [Rating(user=1, product=1, rating=1.1538351103381217), Rating(user=1, product=2, rating=0.7153473708381739)] (2, 2) 0.443547642944 --- Py4JError Traceback (most recent call last) ipython-input-18-af6612bed9d0 in module() 19 sameModel = MatrixFactorizationModel.load(sc, path) 20 print '(2, 2)', sameModel.predict(2,2) --- 21 sameModel.predictAll(testset).collect() 22 /home/ubuntu/spark/python/pyspark/mllib/recommendation.pyc in predictAll(self, user_product) 104 assert len(first) == 2, user_product should be RDD of (user, product) 105 user_product = user_product.map(lambda (u, p): (int(u), int(p))) -- 106 return self.call(predict, user_product) 107 108 def userFeatures(self): /home/ubuntu/spark/python/pyspark/mllib/common.pyc in call(self, name, *a) 134 def call(self, name, *a): 135 Call method of java_model -- 136 return callJavaFunc(self._sc, getattr(self._java_model, name), *a) 137 138 /home/ubuntu/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, *args) 111 Call Java Function 112 args = [_py2java(sc, a) for a in args] -- 113 return _java2py(sc, func(*args)) 114 115 /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 302 raise Py4JError( 303 'An error occurred while calling {0}{1}{2}. Trace:\n{3}\n'. -- 304 format(target_id, '.', name, value)) 305 else: 306 raise Py4JError( Py4JError: An error occurred while calling o450.predict. Trace: py4j.Py4JException: Method predict([class org.apache.spark.api.java.JavaRDD]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6581) Metadata is missing when saving parquet file using hadoop 1.0.4
[ https://issues.apache.org/jira/browse/SPARK-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6581: -- Target Version/s: 1.4.0 Metadata is missing when saving parquet file using hadoop 1.0.4 --- Key: SPARK-6581 URL: https://issues.apache.org/jira/browse/SPARK-6581 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: hadoop 1.0.4 Reporter: Pei-Lun Lee When saving parquet file with {code}df.save(foo, parquet){code} It generates only _common_data while _metadata is missing: {noformat} -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 250 Mar 27 11:29 _common_metadata* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* {noformat} If saving with {code}df.save(foo, parquet, SaveMode.Overwrite){code} Both _metadata and _common_metadata are missing: {noformat} -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6570) Spark SQL arrays: explode() fails and cannot save array type to Parquet
[ https://issues.apache.org/jira/browse/SPARK-6570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6570: -- Target Version/s: 1.4.0 Spark SQL arrays: explode() fails and cannot save array type to Parquet - Key: SPARK-6570 URL: https://issues.apache.org/jira/browse/SPARK-6570 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Jon Chase {code} @Rule public TemporaryFolder tmp = new TemporaryFolder(); @Test public void testPercentileWithExplode() throws Exception { StructType schema = DataTypes.createStructType(Lists.newArrayList( DataTypes.createStructField(col1, DataTypes.StringType, false), DataTypes.createStructField(col2s, DataTypes.createArrayType(DataTypes.IntegerType, true), true) )); JavaRDDRow rowRDD = sc.parallelize(Lists.newArrayList( RowFactory.create(test, new int[]{1, 2, 3}) )); DataFrame df = sql.createDataFrame(rowRDD, schema); df.registerTempTable(df); df.printSchema(); Listint[] ints = sql.sql(select col2s from df).javaRDD() .map(row - (int[]) row.get(0)).collect(); assertEquals(1, ints.size()); assertArrayEquals(new int[]{1, 2, 3}, ints.get(0)); // fails: lateral view explode does not work: java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq ListInteger explodedInts = sql.sql(select col2 from df lateral view explode(col2s) splode as col2).javaRDD() .map(row - row.getInt(0)).collect(); assertEquals(3, explodedInts.size()); assertEquals(Lists.newArrayList(1, 2, 3), explodedInts); // fails: java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq df.saveAsParquetFile(tmp.getRoot().getAbsolutePath() + /parquet); DataFrame loadedDf = sql.load(tmp.getRoot().getAbsolutePath() + /parquet); loadedDf.registerTempTable(loadedDf); Listint[] moreInts = sql.sql(select col2s from loadedDf).javaRDD() .map(row - (int[]) row.get(0)).collect(); assertEquals(1, moreInts.size()); assertArrayEquals(new int[]{1, 2, 3}, moreInts.get(0)); } {code} {code} root |-- col1: string (nullable = false) |-- col2s: array (nullable = true) ||-- element: integer (containsNull = true) ERROR org.apache.spark.executor.Executor Exception in task 7.0 in stage 1.0 (TID 15) java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq at org.apache.spark.sql.catalyst.expressions.Explode.eval(generators.scala:125) ~[spark-catalyst_2.10-1.3.0.jar:1.3.0] at org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:70) ~[spark-sql_2.10-1.3.0.jar:1.3.0] at org.apache.spark.sql.execution.Generate$$anonfun$2$$anonfun$apply$1.apply(Generate.scala:69) ~[spark-sql_2.10-1.3.0.jar:1.3.0] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$class.foreach(Iterator.scala:727) ~[scala-library-2.10.4.jar:na] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ~[scala-library-2.10.4.jar:na] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6529) Word2Vec transformer
[ https://issues.apache.org/jira/browse/SPARK-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6529: - Fix Version/s: (was: 1.4.0) Word2Vec transformer Key: SPARK-6529 URL: https://issues.apache.org/jira/browse/SPARK-6529 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server
[ https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6209: - Fix Version/s: (was: 1.3.1) (was: 1.4.0) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server - Key: SPARK-6209 URL: https://issues.apache.org/jira/browse/SPARK-6209 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. Here is a simple reproduction: With {code} ./bin/spark-shell --master local-cluster[8,8,512] {code} run the following command: {code} sc.parallelize(1 to 1000, 1000).map { x = try { Class.forName(some.class.that.does.not.Exist) } catch { case e: Exception = // do nothing } x }.count() {code} This job will run 253 tasks, then will completely freeze without any errors or failed tasks. It looks like the driver has 253 threads blocked in socketRead0() calls: {code} [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc 253 759 14674 {code} e.g. {code} qtp1287429402-13 daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable [0x0001159bd000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391) at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227) at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) {code} Jstack on the executors shows blocking in loadClass / findClass, where a single thread is RUNNABLE and waiting to hear back from the driver and other executor threads are BLOCKED on object monitor synchronization at Class.forName0(). Remotely triggering a GC on a hanging executor allows the job to progress and complete more tasks before hanging again. If I repeatedly trigger GC on all of the executors, then the job runs to completion: {code} jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run {code} The culprit is a {{catch}} block that ignores all exceptions and performs no cleanup: https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94 This bug has been present since Spark 1.0.0, but I suspect that we haven't seen it before because it's pretty hard to reproduce. Triggering this error requires a job with tasks that trigger ClassNotFoundExceptions yet are still able to run to completion. It also requires that executors are able to leak enough open connections to exhaust the class server's Jetty thread pool limit, which requires that there are a large number of tasks (253+) and either a large number of executors or a very low amount of GC pressure on those executors (since GC will cause the leaked connections to be closed). The fix here is pretty simple: add proper resource cleanup to this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6350) Make mesosExecutorCores configurable in mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6350: - Fix Version/s: (was: 1.3.1) (was: 1.4.0) Make mesosExecutorCores configurable in mesos fine-grained mode - Key: SPARK-6350 URL: https://issues.apache.org/jira/browse/SPARK-6350 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Jongyoul Lee Assignee: Jongyoul Lee Priority: Minor When spark runs in mesos fine-grained mode, mesos slave launches executor with # of cpus and memories. By the way, # of executor's cores is always CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for running intensive task, mesos executor always consume 5 cores without any running task. This waste resources. We should set executor core as a configuration variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6530) ChiSqSelector transformer
[ https://issues.apache.org/jira/browse/SPARK-6530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6530: - Fix Version/s: (was: 1.4.0) ChiSqSelector transformer - Key: SPARK-6530 URL: https://issues.apache.org/jira/browse/SPARK-6530 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6528) IDF transformer
[ https://issues.apache.org/jira/browse/SPARK-6528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6528: - Fix Version/s: (was: 1.4.0) IDF transformer --- Key: SPARK-6528 URL: https://issues.apache.org/jira/browse/SPARK-6528 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6194) collect() in PySpark will cause memory leak in JVM
[ https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6194: - Fix Version/s: (was: 1.3.1) (was: 1.4.0) (was: 1.2.2) collect() in PySpark will cause memory leak in JVM -- Key: SPARK-6194 URL: https://issues.apache.org/jira/browse/SPARK-6194 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical It could be reproduced by: {code} for i in range(40): sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect() {code} It will fail after 2 or 3 jobs, and run totally successfully if I add `gc.collect()` after each job. We could call _detach() for the JavaList returned by collect in Java, will send out a PR for this. Reported by Michael and commented by Josh: On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen joshro...@databricks.com wrote: Based on Py4J's Memory Model page (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model): Because Java objects on the Python side are involved in a circular reference (JavaObject and JavaMember reference each other), these objects are not immediately garbage collected once the last reference to the object is removed (but they are guaranteed to be eventually collected if the Python garbage collector runs before the Python program exits). In doubt, users can always call the detach function on the Python gateway to explicitly delete a reference on the Java side. A call to gc.collect() also usually works. Maybe we should be manually calling detach() when the Python-side has finished consuming temporary objects from the JVM. Do you have a small workload / configuration that reproduces the OOM which we can use to test a fix? I don't think that I've seen this issue in the past, but this might be because we mistook Java OOMs as being caused by collecting too much data rather than due to memory leaks. On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario mnaza...@palantir.com wrote: Hi Josh, I have a question about how PySpark does memory management in the Py4J bridge between the Java driver and the Python driver. I was wondering if there have been any memory problems in this system because the Python garbage collector does not collect circular references immediately and Py4J has circular references in each object it receives from Java. When I dug through the PySpark code, I seemed to find that most RDD actions return by calling collect. In collect, you end up calling the Java RDD collect and getting an iterator from that. Would this be a possible cause for a Java driver OutOfMemoryException because there are resources in Java which do not get freed up immediately? I have also seen that trying to take a lot of values from a dataset twice in a row can cause the Java driver to OOM (while just once works). Are there some other memory considerations that are relevant in the driver? Thanks, Michael -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6323: - Fix Version/s: (was: 1.4.0) Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6582) Support ssl for this AvroSink in Spark Streaming External
[ https://issues.apache.org/jira/browse/SPARK-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6582: - Component/s: Streaming (Components please) Support ssl for this AvroSink in Spark Streaming External - Key: SPARK-6582 URL: https://issues.apache.org/jira/browse/SPARK-6582 Project: Spark Issue Type: Improvement Components: Streaming Reporter: SaintBacchus AvroSink had already support the *ssl*, so it's better to support *ssl* in the Spark Streaming External Flume. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6060) List type missing for catalyst's package.scala
[ https://issues.apache.org/jira/browse/SPARK-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6060: - Fix Version/s: (was: 1.3.0) List type missing for catalyst's package.scala -- Key: SPARK-6060 URL: https://issues.apache.org/jira/browse/SPARK-6060 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Linux zeno 3.18.5 #1 SMP Sun Feb 1 23:51:17 CET 2015 ppc64 GNU/Linux, java version 1.7.0_65 OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-2) OpenJDK Zero VM (build 24.65-b04, interpreted mode), sbt launcher version 0.13.7 Reporter: Stephan Drescher Priority: Minor Labels: build, error Used command line: build/sbt -mem 1024 -Pyarn -Phive -Dhadoop.version=2.4.0 -Pbigtop-dist -DskipTests assembly Output: [error] while compiling: /home/spark/Developer/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/package.scala [error] during phase: jvm [error] library version: version 2.10.4 [error] compiler version: version 2.10.4 [error] reconstructed args: -bootclasspath /usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/resources.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/rt.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/jsse.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/jce.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/charsets.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/rhino.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/lib/jfr.jar:/usr/lib/jvm/java-7-openjdk-powerpc/jre/classes:/home/spark/.sbt/boot/scala-2.10.4/lib/scala-library.jar -deprecation -classpath
[jira] [Updated] (SPARK-5880) Change log level of batch pruning string in InMemoryColumnarTableScan from Info to Debug
[ https://issues.apache.org/jira/browse/SPARK-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5880: - Fix Version/s: (was: 1.3.0) Change log level of batch pruning string in InMemoryColumnarTableScan from Info to Debug Key: SPARK-5880 URL: https://issues.apache.org/jira/browse/SPARK-5880 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1, 1.3.0 Reporter: Nitin Goyal Priority: Trivial In InMemoryColumnarTableScan, we make string of the statistics of all the columns and log them at INFO level whenever batch pruning happens. We get a performance hit in case there are a large number of batches and good number of columns and almost every batch gets pruned. We can make the string to evaluate lazily and change log level to DEBUG -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5684) Key not found exception is thrown in case location of added partition to a parquet table is different than a path containing the partition values
[ https://issues.apache.org/jira/browse/SPARK-5684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5684: - Fix Version/s: (was: 1.3.0) Key not found exception is thrown in case location of added partition to a parquet table is different than a path containing the partition values - Key: SPARK-5684 URL: https://issues.apache.org/jira/browse/SPARK-5684 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0 Reporter: Yash Datta Create a partitioned parquet table : create table test_table (dummy string) partitioned by (timestamp bigint) stored as parquet; Add a partition to the table and specify a different location: alter table test_table add partition (timestamp=9) location '/data/pth/different' Run a simple select * query we get an exception : 15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * from db4_mi2mi_binsrc1_default limit 5] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 21, localhost): java .util.NoSuchElementException: key not found: timestamp at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:141) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:128) at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) This happens because in parquet path it is assumed that (key=value) patterns are present in the partition location, which is not always the case! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4558) History Server waits ~10s before starting up
[ https://issues.apache.org/jira/browse/SPARK-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4558. -- Resolution: Duplicate History Server waits ~10s before starting up Key: SPARK-4558 URL: https://issues.apache.org/jira/browse/SPARK-4558 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Priority: Minor After you call `sbin/start-history-server.sh`, it waits about 10s before actually starting up. I suspect this is a subtle bug related to log checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2253) [Core] Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2253: - Fix Version/s: (was: 1.3.0) [Core] Disable partial aggregation automatically when reduction factor is low - Key: SPARK-2253 URL: https://issues.apache.org/jira/browse/SPARK-2253 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Once we see enough number of rows in partial aggregation and don't observe any reduction, Aggregator should just turn off partial aggregation. This reduces memory usage for high cardinality aggregations. This one is for Spark core. There is another ticket tracking this for SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6128) Update Spark Streaming Guide for Spark 1.3
[ https://issues.apache.org/jira/browse/SPARK-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6128. -- Resolution: Fixed Update Spark Streaming Guide for Spark 1.3 -- Key: SPARK-6128 URL: https://issues.apache.org/jira/browse/SPARK-6128 Project: Spark Issue Type: Improvement Components: Documentation, Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Fix For: 1.3.0 Things to update - New Kafka Direct API - Python Kafka API - Add joins to streaming guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6582) Support ssl for this AvroSink in Spark Streaming External
[ https://issues.apache.org/jira/browse/SPARK-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6582: - Fix Version/s: (was: 1.4.0) Support ssl for this AvroSink in Spark Streaming External - Key: SPARK-6582 URL: https://issues.apache.org/jira/browse/SPARK-6582 Project: Spark Issue Type: Improvement Components: Streaming Reporter: SaintBacchus AvroSink had already support the *ssl*, so it's better to support *ssl* in the Spark Streaming External Flume. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6194) collect() in PySpark will cause memory leak in JVM
[ https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6194: - Target Version/s: 1.3.0, 1.0.3, 1.1.2, 1.2.2 (was: 1.0.3, 1.1.2, 1.2.2, 1.3.0) Fix Version/s: 1.4.0 1.3.1 1.2.2 Same, restored Fix versions. I fixed my query now. collect() in PySpark will cause memory leak in JVM -- Key: SPARK-6194 URL: https://issues.apache.org/jira/browse/SPARK-6194 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical Fix For: 1.2.2, 1.3.1, 1.4.0 It could be reproduced by: {code} for i in range(40): sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect() {code} It will fail after 2 or 3 jobs, and run totally successfully if I add `gc.collect()` after each job. We could call _detach() for the JavaList returned by collect in Java, will send out a PR for this. Reported by Michael and commented by Josh: On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen joshro...@databricks.com wrote: Based on Py4J's Memory Model page (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model): Because Java objects on the Python side are involved in a circular reference (JavaObject and JavaMember reference each other), these objects are not immediately garbage collected once the last reference to the object is removed (but they are guaranteed to be eventually collected if the Python garbage collector runs before the Python program exits). In doubt, users can always call the detach function on the Python gateway to explicitly delete a reference on the Java side. A call to gc.collect() also usually works. Maybe we should be manually calling detach() when the Python-side has finished consuming temporary objects from the JVM. Do you have a small workload / configuration that reproduces the OOM which we can use to test a fix? I don't think that I've seen this issue in the past, but this might be because we mistook Java OOMs as being caused by collecting too much data rather than due to memory leaks. On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario mnaza...@palantir.com wrote: Hi Josh, I have a question about how PySpark does memory management in the Py4J bridge between the Java driver and the Python driver. I was wondering if there have been any memory problems in this system because the Python garbage collector does not collect circular references immediately and Py4J has circular references in each object it receives from Java. When I dug through the PySpark code, I seemed to find that most RDD actions return by calling collect. In collect, you end up calling the Java RDD collect and getting an iterator from that. Would this be a possible cause for a Java driver OutOfMemoryException because there are resources in Java which do not get freed up immediately? I have also seen that trying to take a lot of values from a dataset twice in a row can cause the Java driver to OOM (while just once works). Are there some other memory considerations that are relevant in the driver? Thanks, Michael -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server
[ https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6209: - Fix Version/s: 1.4.0 1.3.1 Oops, my bulk change shouldn't have caught this one. I see why it is unresolved but has Fix versions ExecutorClassLoader can leak connections after failing to load classes from the REPL class server - Key: SPARK-6209 URL: https://issues.apache.org/jira/browse/SPARK-6209 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical Fix For: 1.3.1, 1.4.0 ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. Here is a simple reproduction: With {code} ./bin/spark-shell --master local-cluster[8,8,512] {code} run the following command: {code} sc.parallelize(1 to 1000, 1000).map { x = try { Class.forName(some.class.that.does.not.Exist) } catch { case e: Exception = // do nothing } x }.count() {code} This job will run 253 tasks, then will completely freeze without any errors or failed tasks. It looks like the driver has 253 threads blocked in socketRead0() calls: {code} [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc 253 759 14674 {code} e.g. {code} qtp1287429402-13 daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable [0x0001159bd000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391) at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227) at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) {code} Jstack on the executors shows blocking in loadClass / findClass, where a single thread is RUNNABLE and waiting to hear back from the driver and other executor threads are BLOCKED on object monitor synchronization at Class.forName0(). Remotely triggering a GC on a hanging executor allows the job to progress and complete more tasks before hanging again. If I repeatedly trigger GC on all of the executors, then the job runs to completion: {code} jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run {code} The culprit is a {{catch}} block that ignores all exceptions and performs no cleanup: https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94 This bug has been present since Spark 1.0.0, but I suspect that we haven't seen it before because it's pretty hard to reproduce. Triggering this error requires a job with tasks that trigger ClassNotFoundExceptions yet are still able to run to completion. It also requires that executors are able to leak enough open connections to exhaust the class server's Jetty thread pool limit, which requires that there are a large number of tasks (253+) and either a large number of executors or a very low amount of GC pressure on those executors (since GC will cause the leaked connections to be closed). The fix here is pretty simple: add proper resource cleanup to this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6006) Optimize count distinct in case of high cardinality columns
[ https://issues.apache.org/jira/browse/SPARK-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6006: - Fix Version/s: (was: 1.3.0) Optimize count distinct in case of high cardinality columns --- Key: SPARK-6006 URL: https://issues.apache.org/jira/browse/SPARK-6006 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.2.1 Reporter: Yash Datta Priority: Minor In case there are a lot of distinct values, count distinct becomes too slow since it tries to hash partial results to one map. It can be improved by creating buckets/partial maps in an intermediate stage where same key from multiple partial maps of first stage hash to the same bucket. Later we can sum the size of these buckets to get total distinct count. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5720) `Create Table Like` in HiveContext need support `like registered temporary table`
[ https://issues.apache.org/jira/browse/SPARK-5720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5720: - Fix Version/s: (was: 1.3.0) `Create Table Like` in HiveContext need support `like registered temporary table` - Key: SPARK-5720 URL: https://issues.apache.org/jira/browse/SPARK-5720 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Li Sheng Original Estimate: 72h Remaining Estimate: 72h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5192) Parquet fails to parse schema contains '\r'
[ https://issues.apache.org/jira/browse/SPARK-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5192: - Fix Version/s: (was: 1.3.0) Parquet fails to parse schema contains '\r' --- Key: SPARK-5192 URL: https://issues.apache.org/jira/browse/SPARK-5192 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: windows7 + Intellj idea 13.0.2 Reporter: cen yuhai Priority: Minor I think this is actually a bug in parquet, when i debuged 'ParquetTestData', i found a exception as below. So i download the source of MessageTypeParser, the funtion 'isWhitespace' do not check for '\r' private boolean isWhitespace(String t) { return t.equals( ) || t.equals(\t) || t.equals(\n); } So I replace all '\r' to work around this issue. val subTestSchema = message myrecord { optional boolean myboolean; optional int64 mylong; } .replaceAll(\r,) at line 0: message myrecord { at parquet.schema.MessageTypeParser.asRepetition(MessageTypeParser.java:203) at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:101) at parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96) at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89) at parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79) at org.apache.spark.sql.parquet.ParquetTestData$.writeFile(ParquetTestData.scala:221) at org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:92) at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) at org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:85) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) at org.apache.spark.sql.parquet.ParquetQuerySuite.run(ParquetQuerySuite.scala:85) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5264) Support `drop temporary table [if exists]` DDL command
[ https://issues.apache.org/jira/browse/SPARK-5264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5264: - Fix Version/s: (was: 1.3.0) Support `drop temporary table [if exists]` DDL command --- Key: SPARK-5264 URL: https://issues.apache.org/jira/browse/SPARK-5264 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.3.0 Reporter: Li Sheng Priority: Minor Original Estimate: 72h Remaining Estimate: 72h Support `drop table` DDL command i.e DROP [TEMPORARY] TABLE [IF EXISTS]tbl_name -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4752) Classifier based on artificial neural network
[ https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4752: - Fix Version/s: (was: 1.3.0) Classifier based on artificial neural network - Key: SPARK-4752 URL: https://issues.apache.org/jira/browse/SPARK-4752 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: Alexander Ulanov Original Estimate: 168h Remaining Estimate: 168h Implement classifier based on artificial neural network (ANN). Requirements: 1) Use the existing artificial neural network implementation https://issues.apache.org/jira/browse/SPARK-2352, https://github.com/apache/spark/pull/1290 2) Extend MLlib ClassificationModel trait, 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training, 4) Be able to return the ANN model -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches
[ https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5362: - Fix Version/s: (was: 1.3.0) Gradient and Optimizer to support generic output (instead of label) and data batches Key: SPARK-5362 URL: https://issues.apache.org/jira/browse/SPARK-5362 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Original Estimate: 24h Remaining Estimate: 24h Currently, Gradient and Optimizer interfaces support data in form of RDD[Double, Vector] which refers to label and features. This limits its application to classification problems. For example, artificial neural network demands Vector as output (instead of label: Double). Moreover, current interface does not support data batches. I propose to replace label: Double with output: Vector. It enables passing generic output instead of label and also passing data and output batches stored in corresponding vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5522) Accelerate the History Server start
[ https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5522. -- Resolution: Fixed Looks resolved by https://github.com/apache/spark/pull/4525 but just never got marked as such. Accelerate the History Server start --- Key: SPARK-5522 URL: https://issues.apache.org/jira/browse/SPARK-5522 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 1.0.0 Reporter: Liangliang Gu Assignee: Liangliang Gu Fix For: 1.4.0 When starting the history server, all the log files will be fetched and parsed in order to get the applications' meta data e.g. App Name, Start Time, Duration, etc. In our production cluster, there exist 2600 log files (160G) in HDFS and it costs 3 hours to restart the history server, which is a little bit too long for us. It would be better, if the history server can show logs with missing information during start-up and fill the missing information after fetching and parsing a log file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests
[ https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6470: --- Assignee: Sandy Ryza (was: Apache Spark) Allow Spark apps to put YARN node labels in their requests -- Key: SPARK-6470 URL: https://issues.apache.org/jira/browse/SPARK-6470 Project: Spark Issue Type: Improvement Components: YARN Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests
[ https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385363#comment-14385363 ] Apache Spark commented on SPARK-6470: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/5242 Allow Spark apps to put YARN node labels in their requests -- Key: SPARK-6470 URL: https://issues.apache.org/jira/browse/SPARK-6470 Project: Spark Issue Type: Improvement Components: YARN Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests
[ https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6470: --- Assignee: Apache Spark (was: Sandy Ryza) Allow Spark apps to put YARN node labels in their requests -- Key: SPARK-6470 URL: https://issues.apache.org/jira/browse/SPARK-6470 Project: Spark Issue Type: Improvement Components: YARN Reporter: Sandy Ryza Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6587) Inferring schema for case class hierarchy fails with mysterious message
Spiro Michaylov created SPARK-6587: -- Summary: Inferring schema for case class hierarchy fails with mysterious message Key: SPARK-6587 URL: https://issues.apache.org/jira/browse/SPARK-6587 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: At least Windows 8, Scala 2.11.2. Reporter: Spiro Michaylov (Don't know if this is a functionality bug, error reporting bug or an RFE ...) I define the following hierarchy: {code} private abstract class MyHolder private case class StringHolder(s: String) extends MyHolder private case class IntHolder(i: Int) extends MyHolder private case class BooleanHolder(b: Boolean) extends MyHolder {code} and a top level case class: {code} private case class Thing(key: Integer, foo: MyHolder) {code} When I try to convert it: {code} val things = Seq( Thing(1, IntHolder(42)), Thing(2, StringHolder(hello)), Thing(3, BooleanHolder(false)) ) val thingsDF = sc.parallelize(things, 4).toDF() thingsDF.registerTempTable(things) val all = sqlContext.sql(SELECT * from things) {code} I get the following stack trace: {quote} Exception in thread main scala.MatchError: sql.CaseClassSchemaProblem.MyHolder (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:159) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:157) at scala.collection.immutable.List.map(List.scala:276) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:157) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:107) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:312) at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:250) at sql.CaseClassSchemaProblem$.main(CaseClassSchemaProblem.scala:35) at sql.CaseClassSchemaProblem.main(CaseClassSchemaProblem.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {quote} I wrote this to answer [a question on StackOverflow|http://stackoverflow.com/questions/29310405/what-is-the-right-way-to-represent-an-any-type-in-spark-sql] which uses a much simpler approach and suffers the same problem. Looking at what seems to me to be the [relevant unit test suite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/ScalaReflectionRelationSuite.scala] I see that this case is not covered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-2426: Affects Version/s: (was: 1.3.0) 1.4.0 Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6588) Private VPC's and subnets currently don't work with the Spark ec2 script
Michelangelo D'Agostino created SPARK-6588: -- Summary: Private VPC's and subnets currently don't work with the Spark ec2 script Key: SPARK-6588 URL: https://issues.apache.org/jira/browse/SPARK-6588 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Michelangelo D'Agostino Priority: Minor The spark_ec2.py script currently references the ip_address and public_dns_name attributes of an instance. On private networks, these fields aren't set, so we have problems. The solution, which I've just finished coding up, is to introduce a --private-ips flag that instead refers to the private_ip_address attribute in both cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6588) Private VPC's and subnets currently don't work with the Spark ec2 script
[ https://issues.apache.org/jira/browse/SPARK-6588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6588. -- Resolution: Duplicate Also SPARK-5246, SPARK-6220. Have a look at the existing JIRAs and see if you can resolve one of them to this effect. Private VPC's and subnets currently don't work with the Spark ec2 script Key: SPARK-6588 URL: https://issues.apache.org/jira/browse/SPARK-6588 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Michelangelo D'Agostino Priority: Minor The spark_ec2.py script currently references the ip_address and public_dns_name attributes of an instance. On private networks, these fields aren't set, so we have problems. The solution, which I've just finished coding up, is to introduce a --private-ips flag that instead refers to the private_ip_address attribute in both cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6588) Private VPC's and subnets currently don't work with the Spark ec2 script
[ https://issues.apache.org/jira/browse/SPARK-6588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385435#comment-14385435 ] Apache Spark commented on SPARK-6588: - User 'mdagost' has created a pull request for this issue: https://github.com/apache/spark/pull/5244 Private VPC's and subnets currently don't work with the Spark ec2 script Key: SPARK-6588 URL: https://issues.apache.org/jira/browse/SPARK-6588 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.3.0 Reporter: Michelangelo D'Agostino Priority: Minor The spark_ec2.py script currently references the ip_address and public_dns_name attributes of an instance. On private networks, these fields aren't set, so we have problems. The solution, which I've just finished coding up, is to introduce a --private-ips flag that instead refers to the private_ip_address attribute in both cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5894) Add PolynomialMapper
[ https://issues.apache.org/jira/browse/SPARK-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5894: --- Assignee: Apache Spark Add PolynomialMapper Key: SPARK-5894 URL: https://issues.apache.org/jira/browse/SPARK-5894 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Apache Spark `PolynomialMapper` takes a vector column and outputs a vector column with polynomial feature mapping. {code} val poly = new PolynomialMapper() .setInputCol(features) .setDegree(2) .setOutputCols(polyFeatures) {code} It should handle the output feature names properly. Maybe we can make a better name for it instead of calling it `PolynomialMapper`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5894) Add PolynomialMapper
[ https://issues.apache.org/jira/browse/SPARK-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385446#comment-14385446 ] Apache Spark commented on SPARK-5894: - User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/5245 Add PolynomialMapper Key: SPARK-5894 URL: https://issues.apache.org/jira/browse/SPARK-5894 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng `PolynomialMapper` takes a vector column and outputs a vector column with polynomial feature mapping. {code} val poly = new PolynomialMapper() .setInputCol(features) .setDegree(2) .setOutputCols(polyFeatures) {code} It should handle the output feature names properly. Maybe we can make a better name for it instead of calling it `PolynomialMapper`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5894) Add PolynomialMapper
[ https://issues.apache.org/jira/browse/SPARK-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5894: --- Assignee: (was: Apache Spark) Add PolynomialMapper Key: SPARK-5894 URL: https://issues.apache.org/jira/browse/SPARK-5894 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng `PolynomialMapper` takes a vector column and outputs a vector column with polynomial feature mapping. {code} val poly = new PolynomialMapper() .setInputCol(features) .setDegree(2) .setOutputCols(polyFeatures) {code} It should handle the output feature names properly. Maybe we can make a better name for it instead of calling it `PolynomialMapper`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6582) Support ssl for this AvroSink in Spark Streaming External
SaintBacchus created SPARK-6582: --- Summary: Support ssl for this AvroSink in Spark Streaming External Key: SPARK-6582 URL: https://issues.apache.org/jira/browse/SPARK-6582 Project: Spark Issue Type: Improvement Reporter: SaintBacchus Fix For: 1.4.0 AvroSink had already support the *ssl*, so it's better to support *ssl* in the Spark Streaming External Flume. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6583) Support aggregated function in order by
Yadong Qi created SPARK-6583: Summary: Support aggregated function in order by Key: SPARK-6583 URL: https://issues.apache.org/jira/browse/SPARK-6583 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yadong Qi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6584) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location.
[ https://issues.apache.org/jira/browse/SPARK-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6584: --- Assignee: Apache Spark Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location. --- Key: SPARK-6584 URL: https://issues.apache.org/jira/browse/SPARK-6584 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.4.0 Reporter: SaintBacchus Assignee: Apache Spark The function *RDD.getPreferredLocations* can only be set the host awareness prefer locations. If some *RDD* wants to be scheduled by executor(such as BlockRDD), spark can do nothing for this. So I want to provide *ExecutorPrefixTaskLocation* to support the rdd which can be aware of partition's executor location. This mechanism can avoid data transfor in the case of many executor in the same host. I think it's very useful especially for *SparkStreaming* since the *Receriver* save data into the *BlockManger* and then become a BlockRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6584) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location.
[ https://issues.apache.org/jira/browse/SPARK-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385221#comment-14385221 ] Apache Spark commented on SPARK-6584: - User 'SaintBacchus' has created a pull request for this issue: https://github.com/apache/spark/pull/5240 Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location. --- Key: SPARK-6584 URL: https://issues.apache.org/jira/browse/SPARK-6584 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.4.0 Reporter: SaintBacchus The function *RDD.getPreferredLocations* can only be set the host awareness prefer locations. If some *RDD* wants to be scheduled by executor(such as BlockRDD), spark can do nothing for this. So I want to provide *ExecutorPrefixTaskLocation* to support the rdd which can be aware of partition's executor location. This mechanism can avoid data transfor in the case of many executor in the same host. I think it's very useful especially for *SparkStreaming* since the *Receriver* save data into the *BlockManger* and then become a BlockRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6584) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location.
[ https://issues.apache.org/jira/browse/SPARK-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6584: --- Assignee: (was: Apache Spark) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location. --- Key: SPARK-6584 URL: https://issues.apache.org/jira/browse/SPARK-6584 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.4.0 Reporter: SaintBacchus The function *RDD.getPreferredLocations* can only be set the host awareness prefer locations. If some *RDD* wants to be scheduled by executor(such as BlockRDD), spark can do nothing for this. So I want to provide *ExecutorPrefixTaskLocation* to support the rdd which can be aware of partition's executor location. This mechanism can avoid data transfor in the case of many executor in the same host. I think it's very useful especially for *SparkStreaming* since the *Receriver* save data into the *BlockManger* and then become a BlockRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd
[ https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6464: --- Assignee: (was: Apache Spark) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd --- Key: SPARK-6464 URL: https://issues.apache.org/jira/browse/SPARK-6464 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: SaintBacchus Attachments: screenshot-1.png Nowadays, the transformation *coalesce* was always used to expand or reduce the number of the partition in order to gain a good performance. But *coalesce* can't make sure that the child partition will be executed in the same executor as the parent partition. And this will lead to have a large network transfer. In some scenario such as I mentioned in the title +small and cached rdd+, we want to coalesce all the partition in the same executor into one partition and make sure the child partition will be executed in this executor. It can avoid network transfer and reduce the scheduler of the Tasks and also can reused the cpu core to do other job. In this scenario, our performance had improved 20% than before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd
[ https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6464: --- Assignee: Apache Spark Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd --- Key: SPARK-6464 URL: https://issues.apache.org/jira/browse/SPARK-6464 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: SaintBacchus Assignee: Apache Spark Attachments: screenshot-1.png Nowadays, the transformation *coalesce* was always used to expand or reduce the number of the partition in order to gain a good performance. But *coalesce* can't make sure that the child partition will be executed in the same executor as the parent partition. And this will lead to have a large network transfer. In some scenario such as I mentioned in the title +small and cached rdd+, we want to coalesce all the partition in the same executor into one partition and make sure the child partition will be executed in this executor. It can avoid network transfer and reduce the scheduler of the Tasks and also can reused the cpu core to do other job. In this scenario, our performance had improved 20% than before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5338) Support cluster mode with Mesos
[ https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5338: --- Assignee: (was: Apache Spark) Support cluster mode with Mesos --- Key: SPARK-5338 URL: https://issues.apache.org/jira/browse/SPARK-5338 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Currently using Spark with Mesos, the only supported deployment is client mode. It is also useful to have a cluster mode deployment that can be shared and long running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6584) Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location.
SaintBacchus created SPARK-6584: --- Summary: Provide ExecutorPrefixTaskLocation to support the rdd which can be aware of partition's executor location. Key: SPARK-6584 URL: https://issues.apache.org/jira/browse/SPARK-6584 Project: Spark Issue Type: Sub-task Affects Versions: 1.4.0 Reporter: SaintBacchus The function *RDD.getPreferredLocations* can only be set the host awareness prefer locations. If some *RDD* wants to be scheduled by executor(such as BlockRDD), spark can do nothing for this. So I want to provide *ExecutorPrefixTaskLocation* to support the rdd which can be aware of partition's executor location. This mechanism can avoid data transfor in the case of many executor in the same host. I think it's very useful especially for *SparkStreaming* since the *Receriver* save data into the *BlockManger* and then become a BlockRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5338) Support cluster mode with Mesos
[ https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5338: --- Assignee: Apache Spark Support cluster mode with Mesos --- Key: SPARK-5338 URL: https://issues.apache.org/jira/browse/SPARK-5338 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Apache Spark Currently using Spark with Mesos, the only supported deployment is client mode. It is also useful to have a cluster mode deployment that can be shared and long running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6585) FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn.
[ https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6585: --- Assignee: Apache Spark FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn. - Key: SPARK-6585 URL: https://issues.apache.org/jira/browse/SPARK-6585 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: June Assignee: Apache Spark Priority: Minor In my test machine, FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) case throw SSLException not SSLHandshakeException, suggest change to catch SSLException to improve test case 's robustness. [info] - HttpFileServer should not work with SSL when the server is untrusted *** FAILED *** (69 milliseconds) [info] Expected exception javax.net.ssl.SSLHandshakeException to be thrown, but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.intercept(Assertions.scala:1004) [info] at org.scalatest.FunSuite.intercept(FunSuite.scala:1555) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34) [info] at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators
[ https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5277: --- Assignee: Apache Spark SparkSqlSerializer does not register user specified KryoRegistrators - Key: SPARK-5277 URL: https://issues.apache.org/jira/browse/SPARK-5277 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1, 1.3.0 Reporter: Max Seiden Assignee: Apache Spark Although the SparkSqlSerializer class extends the KryoSerializer in core, it's overridden newKryo() does not call super.newKryo(). This results in inconsistent serializer behaviors depending on whether a KryoSerializer instance or a SparkSqlSerializer instance is used. This may also be related to the TODO in KryoResourcePool, which uses KryoSerializer instead of SparkSqlSerializer due to yet-to-be-investigated test failures. An example of the divergence in behavior: The Exchange operator creates a new SparkSqlSerializer instance (with an empty conf; another issue) when it is constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the resource pool (see above). The result is that the serialized in-memory columns are created using the user provided serializers / registrators, while serialization during exchange does not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators
[ https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385172#comment-14385172 ] Apache Spark commented on SPARK-5277: - User 'mhseiden' has created a pull request for this issue: https://github.com/apache/spark/pull/5237 SparkSqlSerializer does not register user specified KryoRegistrators - Key: SPARK-5277 URL: https://issues.apache.org/jira/browse/SPARK-5277 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1, 1.3.0 Reporter: Max Seiden Although the SparkSqlSerializer class extends the KryoSerializer in core, it's overridden newKryo() does not call super.newKryo(). This results in inconsistent serializer behaviors depending on whether a KryoSerializer instance or a SparkSqlSerializer instance is used. This may also be related to the TODO in KryoResourcePool, which uses KryoSerializer instead of SparkSqlSerializer due to yet-to-be-investigated test failures. An example of the divergence in behavior: The Exchange operator creates a new SparkSqlSerializer instance (with an empty conf; another issue) when it is constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the resource pool (see above). The result is that the serialized in-memory columns are created using the user provided serializers / registrators, while serialization during exchange does not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators
[ https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5277: --- Assignee: (was: Apache Spark) SparkSqlSerializer does not register user specified KryoRegistrators - Key: SPARK-5277 URL: https://issues.apache.org/jira/browse/SPARK-5277 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1, 1.3.0 Reporter: Max Seiden Although the SparkSqlSerializer class extends the KryoSerializer in core, it's overridden newKryo() does not call super.newKryo(). This results in inconsistent serializer behaviors depending on whether a KryoSerializer instance or a SparkSqlSerializer instance is used. This may also be related to the TODO in KryoResourcePool, which uses KryoSerializer instead of SparkSqlSerializer due to yet-to-be-investigated test failures. An example of the divergence in behavior: The Exchange operator creates a new SparkSqlSerializer instance (with an empty conf; another issue) when it is constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the resource pool (see above). The result is that the serialized in-memory columns are created using the user provided serializers / registrators, while serialization during exchange does not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6577) SparseMatrix should be supported in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385183#comment-14385183 ] Joseph K. Bradley commented on SPARK-6577: -- Good point. [~mengxr], do we want to require scipy and add UDTs which handle numpy and scipy dense and sparse vectors and matrices? Or do we want to add our own SparseMatrix? SparseMatrix should be supported in PySpark --- Key: SPARK-6577 URL: https://issues.apache.org/jira/browse/SPARK-6577 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Manoj Kumar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6585) FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn.
[ https://issues.apache.org/jira/browse/SPARK-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6585: --- Assignee: (was: Apache Spark) FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn. - Key: SPARK-6585 URL: https://issues.apache.org/jira/browse/SPARK-6585 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: June Priority: Minor In my test machine, FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) case throw SSLException not SSLHandshakeException, suggest change to catch SSLException to improve test case 's robustness. [info] - HttpFileServer should not work with SSL when the server is untrusted *** FAILED *** (69 milliseconds) [info] Expected exception javax.net.ssl.SSLHandshakeException to be thrown, but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.intercept(Assertions.scala:1004) [info] at org.scalatest.FunSuite.intercept(FunSuite.scala:1555) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34) [info] at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators
[ https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Seiden updated SPARK-5277: -- Affects Version/s: (was: 1.2.0) 1.2.1 1.3.0 SparkSqlSerializer does not register user specified KryoRegistrators - Key: SPARK-5277 URL: https://issues.apache.org/jira/browse/SPARK-5277 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1, 1.3.0 Reporter: Max Seiden Although the SparkSqlSerializer class extends the KryoSerializer in core, it's overridden newKryo() does not call super.newKryo(). This results in inconsistent serializer behaviors depending on whether a KryoSerializer instance or a SparkSqlSerializer instance is used. This may also be related to the TODO in KryoResourcePool, which uses KryoSerializer instead of SparkSqlSerializer due to yet-to-be-investigated test failures. An example of the divergence in behavior: The Exchange operator creates a new SparkSqlSerializer instance (with an empty conf; another issue) when it is constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the resource pool (see above). The result is that the serialized in-memory columns are created using the user provided serializers / registrators, while serialization during exchange does not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd
[ https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-6464: Affects Version/s: (was: 1.3.0) 1.4.0 Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd --- Key: SPARK-6464 URL: https://issues.apache.org/jira/browse/SPARK-6464 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: SaintBacchus Attachments: screenshot-1.png Nowadays, the transformation *coalesce* was always used to expand or reduce the number of the partition in order to gain a good performance. But *coalesce* can't make sure that the child partition will be executed in the same executor as the parent partition. And this will lead to have a large network transfer. In some scenario such as I mentioned in the title +small and cached rdd+, we want to coalesce all the partition in the same executor into one partition and make sure the child partition will be executed in this executor. It can avoid network transfer and reduce the scheduler of the Tasks and also can reused the cpu core to do other job. In this scenario, our performance had improved 20% than before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6585) FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn.
June created SPARK-6585: --- Summary: FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) failed is some evn. Key: SPARK-6585 URL: https://issues.apache.org/jira/browse/SPARK-6585 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: June Priority: Minor In my test machine, FileServerSuite.test (HttpFileServer should not work with SSL when the server is untrusted) case throw SSLException not SSLHandshakeException, suggest change to catch SSLException to add test case robustness. [info] - HttpFileServer should not work with SSL when the server is untrusted *** FAILED *** (69 milliseconds) [info] Expected exception javax.net.ssl.SSLHandshakeException to be thrown, but javax.net.ssl.SSLException was thrown. (FileServerSuite.scala:231) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.intercept(Assertions.scala:1004) [info] at org.scalatest.FunSuite.intercept(FunSuite.scala:1555) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply$mcV$sp(FileServerSuite.scala:231) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) [info] at org.apache.spark.FileServerSuite$$anonfun$15.apply(FileServerSuite.scala:224) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.apache.spark.FileServerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(FileServerSuite.scala:34) [info] at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6577) SparseMatrix should be supported in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385223#comment-14385223 ] Manoj Kumar commented on SPARK-6577: Ah, I just noticed that SciPy is an optional dependency. In any case, I believe in that any case having a if _have_scipy, else clause, would lead to more lines of code to maintain. We could either have SciPy has a hard dependency, which would mean SparseMatrix would be a wrapper to scipy.CSR routines or we could just implement our own methods. SparseMatrix should be supported in PySpark --- Key: SPARK-6577 URL: https://issues.apache.org/jira/browse/SPARK-6577 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Manoj Kumar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org