[jira] [Updated] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument
[ https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaik Idris Ali updated SPARK-7706: --- Labels: oozie yarn (was: ) Allow setting YARN_CONF_DIR from spark argument --- Key: SPARK-7706 URL: https://issues.apache.org/jira/browse/SPARK-7706 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.1 Reporter: Shaik Idris Ali Labels: oozie, yarn Currently in SparkSubmitArguments.scala when master is set to yarn (yarn-cluster mode) https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN. However we should additionally allow passing YARN_CONF_DIR from command line argument this is particularly handy when Spark is being launched from schedulers like OOZIE or FALCON. Reason being, oozie launcher App starts in one of the container assigned by Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in cluster. Just passing the argument like -yarnconfdir with conf dir (ex: /etc/hadoop/conf) should avoid setting the ENV variable. This is blocking us to onboard spark from oozie or falcon. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7704) Updating Programming Guides per SPARK-4397
[ https://issues.apache.org/jira/browse/SPARK-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7704: --- Assignee: (was: Apache Spark) Updating Programming Guides per SPARK-4397 -- Key: SPARK-7704 URL: https://issues.apache.org/jira/browse/SPARK-7704 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.3.0 Reporter: Daisuke Kobayashi The change per SPARK-4397 makes implicit objects in SparkContext to be found by the compiler automatically. So that we don't need to import SparkContext._ explicitly any more and can remove the statement of implicit conversions from the latest Programming Guides (1.3.0 and higher) http://spark.apache.org/docs/latest/programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument
Shaik Idris Ali created SPARK-7706: -- Summary: Allow setting YARN_CONF_DIR from spark argument Key: SPARK-7706 URL: https://issues.apache.org/jira/browse/SPARK-7706 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.1 Reporter: Shaik Idris Ali Currently in SparkSubmitArguments.scala when master is set to yarn (yarn-cluster mode) https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN. However we should additionally allow passing YARN_CONF_DIR from command line argument this is particularly handy when Spark is being launched from schedulers like OOZIE or FALCON. Reason being, oozie launcher App starts in one of the container assigned by Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in cluster. Just passing the argument like -yarnconfdir with conf dir (ex: /etc/hadoop/conf) should avoid setting the ENV variable. This is blocking us to onboard spark from oozie or falcon. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument
[ https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548106#comment-14548106 ] Sean Owen commented on SPARK-7706: -- Can't you just use {{YARN_CONF_DIR=... command ...}}? Allow setting YARN_CONF_DIR from spark argument --- Key: SPARK-7706 URL: https://issues.apache.org/jira/browse/SPARK-7706 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.1 Reporter: Shaik Idris Ali Labels: oozie, yarn Currently in SparkSubmitArguments.scala when master is set to yarn (yarn-cluster mode) https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN. However we should additionally allow passing YARN_CONF_DIR from command line argument this is particularly handy when Spark is being launched from schedulers like OOZIE or FALCON. Reason being, oozie launcher App starts in one of the container assigned by Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in cluster. Just passing the argument like -yarnconfdir with conf dir (ex: /etc/hadoop/conf) should avoid setting the ENV variable. This is blocking us to onboard spark from oozie or falcon. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6657) Fix Python doc build warnings
[ https://issues.apache.org/jira/browse/SPARK-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6657. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6221 [https://github.com/apache/spark/pull/6221] Fix Python doc build warnings - Key: SPARK-6657 URL: https://issues.apache.org/jira/browse/SPARK-6657 Project: Spark Issue Type: Documentation Components: Documentation, MLlib, PySpark, SQL, Streaming Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Xiangrui Meng Priority: Trivial Fix For: 1.4.0 Reported by [~rxin] {code} /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends without a blank line; unexpected unindent. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected indentation. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase reference start-string without end-string. /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title underline too short. pyspark.streaming.kafka module {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547956#comment-14547956 ] meiyoula commented on SPARK-7699: - If so, I think the spark.dynamicAllocation.initialExecutors doesn't be needed more . Config spark.dynamicAllocation.initialExecutors has no effect Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7705) Cleanup of .sparkStaging directory fails if application is killed
[ https://issues.apache.org/jira/browse/SPARK-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547954#comment-14547954 ] Wilfred Spiegelenburg commented on SPARK-7705: -- I think the limitation that we currently set in the ApplicationMaster.scala on line [#120|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L120] is far to limiting. The {{cleanupStagingDir(fs)}} should be moved out of the {{if (!unregistered)}}. I have not tested this yet but this seems to be far more logical. Since we're in the shutdown hook it should also catch our case: {code} if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { // we only want to unregister if we don't want the RM to retry if (!unregistered) { unregister(finalStatus, finalMsg) } // Since we're done we should clean up the staging directory cleanupStagingDir(fs) } {code} Not sure how to create a PR to check and discuss this change Cleanup of .sparkStaging directory fails if application is killed - Key: SPARK-7705 URL: https://issues.apache.org/jira/browse/SPARK-7705 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Wilfred Spiegelenburg Priority: Minor When a streaming application is killed while running on YARN the .sparkStaging directory is not cleaned up. Setting spark.yarn.preserve.staging.files=false does not help and still leaves the files around. The changes in SPARK-7503 do not catch this case since there is no exception in the shutdown. When the application gets killed the AM gets told to shutdown and the shutdown hook is run but the clean up is not triggered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7565) Broken maps in jsonRDD
[ https://issues.apache.org/jira/browse/SPARK-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547957#comment-14547957 ] Paul Colomiets commented on SPARK-7565: --- The pull request is pretty trivial. I've tested it and it solves the problem for me. Broken maps in jsonRDD -- Key: SPARK-7565 URL: https://issues.apache.org/jira/browse/SPARK-7565 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Paul Colomiets When I use the following JSON: {code} {obj: {a: hello}} {code} And load it with the following python code: {code} tf = sc.textFile('test.json') v = sqlContext.jsonRDD(tf, StructType([StructField(obj, MapType(StringType(), StringType()), True)])) v.save('test.parquet', mode='overwrite') {code} I get the following error in spark master branch: {code} Py4JJavaError: An error occurred while calling o78.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 1 times, most recent failure: Lost task 1.0 in stage 5.0 (TID 11, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.types.UTF8String at org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:201) at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:192) at org.apache.spark.sql.parquet.RowWriteSupport$$anonfun$writeMap$2.apply(ParquetTableSupport.scala:284) at org.apache.spark.sql.parquet.RowWriteSupport$$anonfun$writeMap$2.apply(ParquetTableSupport.scala:281) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.immutable.Map$Map1.foreach(Map.scala:109) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.sql.parquet.RowWriteSupport.writeMap(ParquetTableSupport.scala:281) at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:699) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:717) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:717) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} This worked well in spark 1.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7705) Cleanup of .sparkStaging directory fails if application is killed
[ https://issues.apache.org/jira/browse/SPARK-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7705: - Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) Minor since this just amounts to leaving around directories. Cleanup of .sparkStaging directory fails if application is killed - Key: SPARK-7705 URL: https://issues.apache.org/jira/browse/SPARK-7705 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Wilfred Spiegelenburg Priority: Minor When a streaming application is killed while running on YARN the .sparkStaging directory is not cleaned up. Setting spark.yarn.preserve.staging.files=false does not help and still leaves the files around. The changes in SPARK-7503 do not catch this case since there is no exception in the shutdown. When the application gets killed the AM gets told to shutdown and the shutdown hook is run but the clean up is not triggered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-5265) Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master
[ https://issues.apache.org/jira/browse/SPARK-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-5265: -- Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master -- Key: SPARK-5265 URL: https://issues.apache.org/jira/browse/SPARK-5265 Project: Spark Issue Type: Bug Components: Deploy Reporter: Roque Vassal'lo Labels: cluster, spark-submit, standalone, zookeeper Hi, this is my first JIRA here, so I hope it is clear enough. I'm using Spark 1.2.0 and trying to submit an application on a Spark Standalone cluster in cluster deploy mode with supervise. Standalone cluster is running in high availability mode, using Zookeeper to provide leader election between three available Masters (named master1, master2 and master3). As read at Spark's documentation, to register a Worker to the Standalone cluster, I provide complete cluster info as the spark route. I mean, spark://master1:7077,master2:7077,master3:7077 and that route is parsed and three attempts are launched, first one to master1:7077, second one to master2:7077 and third one to master3:7077. This works great! But if I try to do the same while submitting applications, it fails. I mean, if I provide complete cluster info as the --master option to spark-submit script, it throws an exception because it tries to connect as it was a single node. Example: spark-submit --class org.apache.spark.examples.SparkPi --master spark://master1:7077,master2:7077,master3:7077 --deploy-mode cluster --supervise examples.jar 100 This is the output I got: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/01/14 17:02:11 INFO SecurityManager: Changing view acls to: mytest 15/01/14 17:02:11 INFO SecurityManager: Changing modify acls to: mytest 15/01/14 17:02:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mytest); users with modify permissions: Set(mytest) 15/01/14 17:02:11 INFO Slf4jLogger: Slf4jLogger started 15/01/14 17:02:11 INFO Utils: Successfully started service 'driverClient' on port 53930. 15/01/14 17:02:11 ERROR OneForOneStrategy: Invalid master URL: spark://master1:7077,master2:7077,master3:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://master1:7077,master2:7077,master3:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) at akka.actor.ActorCell.create(ActorCell.scala:580) ... 9 more Shouldn't it parse it as on Worker registration? It will not force client to know which is the current active Master of the Standalone cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7704) Updating Programming Guides per SPARK-4397
[ https://issues.apache.org/jira/browse/SPARK-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7704: --- Assignee: Apache Spark Updating Programming Guides per SPARK-4397 -- Key: SPARK-7704 URL: https://issues.apache.org/jira/browse/SPARK-7704 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.3.0 Reporter: Daisuke Kobayashi Assignee: Apache Spark The change per SPARK-4397 makes implicit objects in SparkContext to be found by the compiler automatically. So that we don't need to import SparkContext._ explicitly any more and can remove the statement of implicit conversions from the latest Programming Guides (1.3.0 and higher) http://spark.apache.org/docs/latest/programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7705) Cleanup of .sparkStaging directory fails if application is killed
Wilfred Spiegelenburg created SPARK-7705: Summary: Cleanup of .sparkStaging directory fails if application is killed Key: SPARK-7705 URL: https://issues.apache.org/jira/browse/SPARK-7705 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Wilfred Spiegelenburg When a streaming application is killed while running on YARN the .sparkStaging directory is not cleaned up. Setting spark.yarn.preserve.staging.files=false does not help and still leaves the files around. The changes in SPARK-7503 do not catch this case since there is no exception in the shutdown. When the application gets killed the AM gets told to shutdown and the shutdown hook is run but the clean up is not triggered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5265) Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master
[ https://issues.apache.org/jira/browse/SPARK-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547963#comment-14547963 ] Roque Vassal'lo commented on SPARK-5265: Yes, SPARK-6443 and this jira are the same. Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master -- Key: SPARK-5265 URL: https://issues.apache.org/jira/browse/SPARK-5265 Project: Spark Issue Type: Bug Components: Deploy Reporter: Roque Vassal'lo Labels: cluster, spark-submit, standalone, zookeeper Hi, this is my first JIRA here, so I hope it is clear enough. I'm using Spark 1.2.0 and trying to submit an application on a Spark Standalone cluster in cluster deploy mode with supervise. Standalone cluster is running in high availability mode, using Zookeeper to provide leader election between three available Masters (named master1, master2 and master3). As read at Spark's documentation, to register a Worker to the Standalone cluster, I provide complete cluster info as the spark route. I mean, spark://master1:7077,master2:7077,master3:7077 and that route is parsed and three attempts are launched, first one to master1:7077, second one to master2:7077 and third one to master3:7077. This works great! But if I try to do the same while submitting applications, it fails. I mean, if I provide complete cluster info as the --master option to spark-submit script, it throws an exception because it tries to connect as it was a single node. Example: spark-submit --class org.apache.spark.examples.SparkPi --master spark://master1:7077,master2:7077,master3:7077 --deploy-mode cluster --supervise examples.jar 100 This is the output I got: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/01/14 17:02:11 INFO SecurityManager: Changing view acls to: mytest 15/01/14 17:02:11 INFO SecurityManager: Changing modify acls to: mytest 15/01/14 17:02:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mytest); users with modify permissions: Set(mytest) 15/01/14 17:02:11 INFO Slf4jLogger: Slf4jLogger started 15/01/14 17:02:11 INFO Utils: Successfully started service 'driverClient' on port 53930. 15/01/14 17:02:11 ERROR OneForOneStrategy: Invalid master URL: spark://master1:7077,master2:7077,master3:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://master1:7077,master2:7077,master3:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) at akka.actor.ActorCell.create(ActorCell.scala:580) ... 9 more Shouldn't it parse it as on Worker registration? It will not force client to know which is the current active Master of the Standalone cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547974#comment-14547974 ] Sean Owen commented on SPARK-7699: -- It would make a difference if the program immediately executed an operation that needed more than the minimum number of executors, but a spark-shell idling doesn't do that. Config spark.dynamicAllocation.initialExecutors has no effect Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5265) Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master
[ https://issues.apache.org/jira/browse/SPARK-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5265. -- Resolution: Duplicate Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master -- Key: SPARK-5265 URL: https://issues.apache.org/jira/browse/SPARK-5265 Project: Spark Issue Type: Bug Components: Deploy Reporter: Roque Vassal'lo Labels: cluster, spark-submit, standalone, zookeeper Hi, this is my first JIRA here, so I hope it is clear enough. I'm using Spark 1.2.0 and trying to submit an application on a Spark Standalone cluster in cluster deploy mode with supervise. Standalone cluster is running in high availability mode, using Zookeeper to provide leader election between three available Masters (named master1, master2 and master3). As read at Spark's documentation, to register a Worker to the Standalone cluster, I provide complete cluster info as the spark route. I mean, spark://master1:7077,master2:7077,master3:7077 and that route is parsed and three attempts are launched, first one to master1:7077, second one to master2:7077 and third one to master3:7077. This works great! But if I try to do the same while submitting applications, it fails. I mean, if I provide complete cluster info as the --master option to spark-submit script, it throws an exception because it tries to connect as it was a single node. Example: spark-submit --class org.apache.spark.examples.SparkPi --master spark://master1:7077,master2:7077,master3:7077 --deploy-mode cluster --supervise examples.jar 100 This is the output I got: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/01/14 17:02:11 INFO SecurityManager: Changing view acls to: mytest 15/01/14 17:02:11 INFO SecurityManager: Changing modify acls to: mytest 15/01/14 17:02:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mytest); users with modify permissions: Set(mytest) 15/01/14 17:02:11 INFO Slf4jLogger: Slf4jLogger started 15/01/14 17:02:11 INFO Utils: Successfully started service 'driverClient' on port 53930. 15/01/14 17:02:11 ERROR OneForOneStrategy: Invalid master URL: spark://master1:7077,master2:7077,master3:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://master1:7077,master2:7077,master3:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) at akka.actor.ActorCell.create(ActorCell.scala:580) ... 9 more Shouldn't it parse it as on Worker registration? It will not force client to know which is the current active Master of the Standalone cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7700) Spark 1.3.0 on YARN: Application failed 2 times due to AM Container
[ https://issues.apache.org/jira/browse/SPARK-7700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547641#comment-14547641 ] Kaveen Raajan commented on SPARK-7700: -- Hi [~srowen] I'm sure there is no space available at my hadoop and spark path. Since I running in windows environment I replace spark-env.sh to spark-env.cmd which contains _SET SPARK_JAR=hdfs://master:9000/user/spark/jar_ *Note:* spark jar files are moved to hdfs specified location. Also spark classpath are added to hadoop-config.cmd and HADOOP_CONF_DIR are set at enviroment variable. And also I didn't put any JVM arg in class belongs. While seeing at launch-container.cmd file this line are executed {panel} @C:\Hadoop\bin\winutils.exe symlink __spark__.jar \tmp\hadoop-HDFS\nm-local-dir\filecache\10\jar @C:\Hadoop\bin\winutils.exe symlink __app__.jar \tmp\hadoop-HDFS\nm-local-dir\usercache\HDFS\filecache\10\spark-examples-1.3.1-hadoop2.5.2.jar @call %JAVA_HOME%/bin/java -server -Xmx4096m -Djava.io.tmpdir=%PWD%/tmp '-Dspark.executor.memory=2g' '-Dspark.app.name=org.apache.spark.examples.SparkPi' '-Dspark.master=yarn-cluster' -Dspark.yarn.app.container.log.dir=C:/Hadoop/logs/userlogs/application_1431924261044_0002/container_1431924261044_0002_01_01 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar file:/C:/Spark/lib/spark-examples-1.3.1-hadoop2.5.2.jar --arg '10' --executor-memory 2048m --executor-cores 1 --num-executors 3 1 C:/Hadoop/logs/userlogs/application_1431924261044_0002/container_1431924261044_0002_01_01/stdout 2 C:/Hadoop/logs/userlogs/application_1431924261044_0002/container_1431924261044_0002_01_01/stderr {panel} Spark 1.3.0 on YARN: Application failed 2 times due to AM Container --- Key: SPARK-7700 URL: https://issues.apache.org/jira/browse/SPARK-7700 Project: Spark Issue Type: Story Components: Build Affects Versions: 1.3.1 Environment: windows 8 Single language Hadoop-2.5.2 Protocol Buffer-2.5.0 Scala-2.11 Reporter: Kaveen Raajan I build SPARK on yarn mode by giving following command. Build got succeeded. {panel} mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0 -Phive-thriftserver -DskipTests clean package {panel} I set following property at spark-env.cmd file {panel} SET SPARK_JAR=hdfs://master:9000/user/spark/jar {panel} *Note:* spark jar files are moved to hdfs specified location. Also spark classpath are added to hadoop-config.cmd and HADOOP_CONF_DIR are set at enviroment variable. I tried to execute following SparkPi example in yarn-cluster mode. {panel} spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue default S:\Hadoop\Spark\spark-1.3.1\examples\target\spark-examples_2.10-1.3.1.jar 10 {panel} My job able to submit at hadoop cluster, but it always in accepted state and Failed with following error {panel} 15/05/14 13:00:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 15/05/14 13:00:51 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers 15/05/14 13:00:51 INFO yarn.Client: Verifying our application has not requestedmore than the maximum memory capability of the cluster (8048 MB per container) 15/05/14 13:00:51 INFO yarn.Client: Will allocate AM container, with 4480 MB memory including 384 MB overhead 15/05/14 13:00:51 INFO yarn.Client: Setting up container launch context for ourAM 15/05/14 13:00:51 INFO yarn.Client: Preparing resources for our AM container 15/05/14 13:00:52 INFO yarn.Client: Source and destination file systems are thesame. Not copying hdfs://master:9000/user/spark/jar 15/05/14 13:00:52 INFO yarn.Client: Uploading resource file:/S:/Hadoop/Spark/spark-1.3.1/examples/target/spark-examples_2.10-1.3.1.jar - hdfs://master:9000/user/HDFS/.sparkStaging/application_1431587916618_0003/spark-examples_2.10-1.3.1.jar 15/05/14 13:00:52 INFO yarn.Client: Setting up the launch environment for our AM container 15/05/14 13:00:52 INFO spark.SecurityManager: Changing view acls to: HDFS 15/05/14 13:00:52 INFO spark.SecurityManager: Changing modify acls to: HDFS 15/05/14 13:00:52 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(HDFS); users with modify permissions: Set(HDFS) 15/05/14 13:00:52 INFO yarn.Client: Submitting application 3 to ResourceManager 15/05/14 13:00:52 INFO impl.YarnClientImpl: Submitted application application_1431587916618_0003 15/05/14 13:00:53 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:53 INFO
[jira] [Updated] (SPARK-7661) Support for dynamic allocation of resources in Kinesis Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Murtaza Kanchwala updated SPARK-7661: - Summary: Support for dynamic allocation of resources in Kinesis Spark Streaming (was: Support for dynamic allocation of executors in Kinesis Spark Streaming) Support for dynamic allocation of resources in Kinesis Spark Streaming -- Key: SPARK-7661 URL: https://issues.apache.org/jira/browse/SPARK-7661 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.3.1 Environment: AWS-EMR Reporter: Murtaza Kanchwala Currently the no. of cores is (N + 1), where N is no. of shards in a Kinesis Stream. My Requirement is that if I use this Resharding util for Amazon Kinesis : Amazon Kinesis Resharding : https://github.com/awslabs/amazon-kinesis-scaling-utils Then there should be some way to allocate executors on the basis of no. of shards directly (for Spark Streaming only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7700) Spark 1.3.0 on YARN: Application failed 2 times due to AM Container
[ https://issues.apache.org/jira/browse/SPARK-7700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7700. -- Resolution: Duplicate Given you're on Windows, this sounds almost exactly like SPARK-5754 Spark 1.3.0 on YARN: Application failed 2 times due to AM Container --- Key: SPARK-7700 URL: https://issues.apache.org/jira/browse/SPARK-7700 Project: Spark Issue Type: Story Components: Build Affects Versions: 1.3.1 Environment: windows 8 Single language Hadoop-2.5.2 Protocol Buffer-2.5.0 Scala-2.11 Reporter: Kaveen Raajan I build SPARK on yarn mode by giving following command. Build got succeeded. {panel} mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0 -Phive-thriftserver -DskipTests clean package {panel} I set following property at spark-env.cmd file {panel} SET SPARK_JAR=hdfs://master:9000/user/spark/jar {panel} *Note:* spark jar files are moved to hdfs specified location. Also spark classpath are added to hadoop-config.cmd and HADOOP_CONF_DIR are set at enviroment variable. I tried to execute following SparkPi example in yarn-cluster mode. {panel} spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue default S:\Hadoop\Spark\spark-1.3.1\examples\target\spark-examples_2.10-1.3.1.jar 10 {panel} My job able to submit at hadoop cluster, but it always in accepted state and Failed with following error {panel} 15/05/14 13:00:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 15/05/14 13:00:51 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers 15/05/14 13:00:51 INFO yarn.Client: Verifying our application has not requestedmore than the maximum memory capability of the cluster (8048 MB per container) 15/05/14 13:00:51 INFO yarn.Client: Will allocate AM container, with 4480 MB memory including 384 MB overhead 15/05/14 13:00:51 INFO yarn.Client: Setting up container launch context for ourAM 15/05/14 13:00:51 INFO yarn.Client: Preparing resources for our AM container 15/05/14 13:00:52 INFO yarn.Client: Source and destination file systems are thesame. Not copying hdfs://master:9000/user/spark/jar 15/05/14 13:00:52 INFO yarn.Client: Uploading resource file:/S:/Hadoop/Spark/spark-1.3.1/examples/target/spark-examples_2.10-1.3.1.jar - hdfs://master:9000/user/HDFS/.sparkStaging/application_1431587916618_0003/spark-examples_2.10-1.3.1.jar 15/05/14 13:00:52 INFO yarn.Client: Setting up the launch environment for our AM container 15/05/14 13:00:52 INFO spark.SecurityManager: Changing view acls to: HDFS 15/05/14 13:00:52 INFO spark.SecurityManager: Changing modify acls to: HDFS 15/05/14 13:00:52 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(HDFS); users with modify permissions: Set(HDFS) 15/05/14 13:00:52 INFO yarn.Client: Submitting application 3 to ResourceManager 15/05/14 13:00:52 INFO impl.YarnClientImpl: Submitted application application_1431587916618_0003 15/05/14 13:00:53 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:53 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1431588652790 final status: UNDEFINED tracking URL: http://master:8088/proxy/application_1431587916618_0003/ user: HDFS 15/05/14 13:00:54 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:55 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:56 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:57 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:58 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:59 INFO yarn.Client: Application report for application_1431587916618_0003 (state: FAILED) 15/05/14 13:00:59 INFO yarn.Client: client token: N/A diagnostics: Application application_1431587916618_0003 failed 2 times due to AM Container for appattempt_1431587916618_0003_02 exited with exitCode: 1 For more detailed output, check application tracking page:http://master:8088/proxy/application_1431587916618_0003/Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id:
[jira] [Resolved] (SPARK-7299) saving Oracle-source DataFrame to Hive changes scale
[ https://issues.apache.org/jira/browse/SPARK-7299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7299. Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Liang-Chi Hsieh saving Oracle-source DataFrame to Hive changes scale Key: SPARK-7299 URL: https://issues.apache.org/jira/browse/SPARK-7299 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Ken Geis Assignee: Liang-Chi Hsieh Fix For: 1.4.0 When I load data from Oracle, save it to a table, then select from it, the scale is changed. For example, I have a column defined as NUMBER(12, 2). I insert 1 into the column. When I write that to a table and select from it, the result is 199.99. Some databases (e.g. H2) will return this as 1.00, but Oracle returns it as 1. I believe that when the file is written out to parquet, the scale information is taken from the schema, not the value. In an Oracle (at least) JDBC DataFrame, the scale may be different from row to row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7150) SQLContext.range()
[ https://issues.apache.org/jira/browse/SPARK-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547710#comment-14547710 ] Apache Spark commented on SPARK-7150: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/6230 SQLContext.range() -- Key: SPARK-7150 URL: https://issues.apache.org/jira/browse/SPARK-7150 Project: Spark Issue Type: Sub-task Components: ML, SQL Reporter: Joseph K. Bradley Assignee: Adrian Wang Priority: Minor Labels: starter It would be handy to have easy ways to construct random columns for DataFrames. Proposed API: {code} class SQLContext { // Return a DataFrame with a single column named id that has consecutive value from 0 to n. def range(n: Long): DataFrame def range(n: Long, numPartitions: Int): DataFrame } {code} Usage: {code} // uniform distribution ctx.range(1000).select(rand()) // normal distribution ctx.range(1000).select(randn()) {code} We should add an RangeIterator that supports long start/stop position, and then use it to create an RDD as the basis for this DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6416) RDD.fold() requires the operator to be commutative
[ https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547718#comment-14547718 ] Apache Spark commented on SPARK-6416: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/6231 RDD.fold() requires the operator to be commutative -- Key: SPARK-6416 URL: https://issues.apache.org/jira/browse/SPARK-6416 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Reporter: Josh Rosen Priority: Critical Spark's {{RDD.fold}} operation has some confusing behaviors when a non-commutative reduce function is used. Here's an example, which was originally reported on StackOverflow (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output): {code} sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 ) 8 {code} To understand what's going on here, let's look at the definition of Spark's `fold` operation. I'm going to show the Python version of the code, but the Scala version exhibits the exact same behavior (you can also [browse the source on GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]: {code} def fold(self, zeroValue, op): Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral zero value. The function C{op(t1, t2)} is allowed to modify C{t1} and return it as its result value to avoid object allocation; however, it should not modify C{t2}. from operator import add sc.parallelize([1, 2, 3, 4, 5]).fold(0, add) 15 def func(iterator): acc = zeroValue for obj in iterator: acc = op(obj, acc) yield acc vals = self.mapPartitions(func).collect() return reduce(op, vals, zeroValue) {code} (For comparison, see the [Scala implementation of `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]). Spark's `fold` operates by first folding each partition and then folding the results. The problem is that an empty partition gets folded down to the zero element, so the final driver-side fold ends up folding one value for _every_ partition rather than one value for each _non-empty_ partition. This means that the result of `fold` is sensitive to the number of partitions: {code} sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 ) 100 sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 ) 50 sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 ) 1 {code} In this last case, what's happening is that the single partition is being folded down to the correct value, then that value is folded with the zero-value at the driver to yield 1. I think the underlying problem here is that our fold() operation implicitly requires the operator to be commutative in addition to associative, but this isn't documented anywhere. Due to ordering non-determinism elsewhere in Spark, such as SPARK-5750, I don't think there's an easy way to fix this. Therefore, I think we should update the documentation and examples to clarify this requirement and explain that our fold acts more like a reduce with a default value than the type of ordering-sensitive fold() that users may expect in functional languages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6416) RDD.fold() requires the operator to be commutative
[ https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6416: --- Assignee: (was: Apache Spark) RDD.fold() requires the operator to be commutative -- Key: SPARK-6416 URL: https://issues.apache.org/jira/browse/SPARK-6416 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Reporter: Josh Rosen Priority: Critical Spark's {{RDD.fold}} operation has some confusing behaviors when a non-commutative reduce function is used. Here's an example, which was originally reported on StackOverflow (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output): {code} sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 ) 8 {code} To understand what's going on here, let's look at the definition of Spark's `fold` operation. I'm going to show the Python version of the code, but the Scala version exhibits the exact same behavior (you can also [browse the source on GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]: {code} def fold(self, zeroValue, op): Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral zero value. The function C{op(t1, t2)} is allowed to modify C{t1} and return it as its result value to avoid object allocation; however, it should not modify C{t2}. from operator import add sc.parallelize([1, 2, 3, 4, 5]).fold(0, add) 15 def func(iterator): acc = zeroValue for obj in iterator: acc = op(obj, acc) yield acc vals = self.mapPartitions(func).collect() return reduce(op, vals, zeroValue) {code} (For comparison, see the [Scala implementation of `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]). Spark's `fold` operates by first folding each partition and then folding the results. The problem is that an empty partition gets folded down to the zero element, so the final driver-side fold ends up folding one value for _every_ partition rather than one value for each _non-empty_ partition. This means that the result of `fold` is sensitive to the number of partitions: {code} sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 ) 100 sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 ) 50 sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 ) 1 {code} In this last case, what's happening is that the single partition is being folded down to the correct value, then that value is folded with the zero-value at the driver to yield 1. I think the underlying problem here is that our fold() operation implicitly requires the operator to be commutative in addition to associative, but this isn't documented anywhere. Due to ordering non-determinism elsewhere in Spark, such as SPARK-5750, I don't think there's an easy way to fix this. Therefore, I think we should update the documentation and examples to clarify this requirement and explain that our fold acts more like a reduce with a default value than the type of ordering-sensitive fold() that users may expect in functional languages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect
meiyoula created SPARK-7699: --- Summary: Config spark.dynamicAllocation.initialExecutors has no effect Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7700) Spark 1.3.0 on YARN: Application failed 2 times due to AM Container
[ https://issues.apache.org/jira/browse/SPARK-7700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7700. -- Resolution: Not A Problem This sounds like a problem with your configuration, like you've put a JVM arg somewhere where a class belongs. What is spark-env.sh? Another random question -- do your paths have a space in them anywhere? Spark 1.3.0 on YARN: Application failed 2 times due to AM Container --- Key: SPARK-7700 URL: https://issues.apache.org/jira/browse/SPARK-7700 Project: Spark Issue Type: Story Components: Build Affects Versions: 1.3.1 Environment: windows 8 Single language Hadoop-2.5.2 Protocol Buffer-2.5.0 Scala-2.11 Reporter: Kaveen Raajan I build SPARK on yarn mode by giving following command. Build got succeeded. {panel} mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0 -Phive-thriftserver -DskipTests clean package {panel} I set following property at spark-env.cmd file {panel} SET SPARK_JAR=hdfs://master:9000/user/spark/jar {panel} *Note:* spark jar files are moved to hdfs specified location. Also spark classpath are added to hadoop-config.cmd and HADOOP_CONF_DIR are set at enviroment variable. I tried to execute following SparkPi example in yarn-cluster mode. {panel} spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue default S:\Hadoop\Spark\spark-1.3.1\examples\target\spark-examples_2.10-1.3.1.jar 10 {panel} My job able to submit at hadoop cluster, but it always in accepted state and Failed with following error {panel} 15/05/14 13:00:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 15/05/14 13:00:51 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers 15/05/14 13:00:51 INFO yarn.Client: Verifying our application has not requestedmore than the maximum memory capability of the cluster (8048 MB per container) 15/05/14 13:00:51 INFO yarn.Client: Will allocate AM container, with 4480 MB memory including 384 MB overhead 15/05/14 13:00:51 INFO yarn.Client: Setting up container launch context for ourAM 15/05/14 13:00:51 INFO yarn.Client: Preparing resources for our AM container 15/05/14 13:00:52 INFO yarn.Client: Source and destination file systems are thesame. Not copying hdfs://master:9000/user/spark/jar 15/05/14 13:00:52 INFO yarn.Client: Uploading resource file:/S:/Hadoop/Spark/spark-1.3.1/examples/target/spark-examples_2.10-1.3.1.jar - hdfs://master:9000/user/HDFS/.sparkStaging/application_1431587916618_0003/spark-examples_2.10-1.3.1.jar 15/05/14 13:00:52 INFO yarn.Client: Setting up the launch environment for our AM container 15/05/14 13:00:52 INFO spark.SecurityManager: Changing view acls to: HDFS 15/05/14 13:00:52 INFO spark.SecurityManager: Changing modify acls to: HDFS 15/05/14 13:00:52 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(HDFS); users with modify permissions: Set(HDFS) 15/05/14 13:00:52 INFO yarn.Client: Submitting application 3 to ResourceManager 15/05/14 13:00:52 INFO impl.YarnClientImpl: Submitted application application_1431587916618_0003 15/05/14 13:00:53 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:53 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1431588652790 final status: UNDEFINED tracking URL: http://master:8088/proxy/application_1431587916618_0003/ user: HDFS 15/05/14 13:00:54 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:55 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:56 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:57 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:58 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:59 INFO yarn.Client: Application report for application_1431587916618_0003 (state: FAILED) 15/05/14 13:00:59 INFO yarn.Client: client token: N/A diagnostics: Application application_1431587916618_0003 failed 2 times due to AM Container for appattempt_1431587916618_0003_02 exited with exitCode: 1 For more detailed output, check application tracking
[jira] [Commented] (SPARK-7661) Support for dynamic allocation of executors in Kinesis Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547642#comment-14547642 ] Murtaza Kanchwala commented on SPARK-7661: -- Yes, it works I took 4 + 4 = 8 cores, where 4 is my total no. of cores and 4 is my total no .of shards. But there is still an another thing, Now If the Spark Consumer starts up it takes 1 executor and 4 Network Receiver, when I Scale up my Kinesis Stream by 4 more shards,i.e. 8 Shards, then it also works but the Receiver count is still 4. So is there any way to scale up the receivers or not? Support for dynamic allocation of executors in Kinesis Spark Streaming -- Key: SPARK-7661 URL: https://issues.apache.org/jira/browse/SPARK-7661 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.3.1 Environment: AWS-EMR Reporter: Murtaza Kanchwala Currently the no. of cores is (N + 1), where N is no. of shards in a Kinesis Stream. My Requirement is that if I use this Resharding util for Amazon Kinesis : Amazon Kinesis Resharding : https://github.com/awslabs/amazon-kinesis-scaling-utils Then there should be some way to allocate executors on the basis of no. of shards directly (for Spark Streaming only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7663) [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero
[ https://issues.apache.org/jira/browse/SPARK-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7663: --- Assignee: Apache Spark [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero - Key: SPARK-7663 URL: https://issues.apache.org/jira/browse/SPARK-7663 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xusen Yin Assignee: Apache Spark Priority: Minor mllib.feature.Word2Vec at line 442: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L442 uses `.head` to get the vector size. But it would throw an empty iterator error if the `minCount` is large enough to remove all words in the dataset. But due to this is not a common scenario, so maybe we can ignore it. If so, we can close the issue directly. If not, I can add some code to print more elegant error hits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7663) [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero
[ https://issues.apache.org/jira/browse/SPARK-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7663: --- Assignee: (was: Apache Spark) [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero - Key: SPARK-7663 URL: https://issues.apache.org/jira/browse/SPARK-7663 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xusen Yin Priority: Minor mllib.feature.Word2Vec at line 442: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L442 uses `.head` to get the vector size. But it would throw an empty iterator error if the `minCount` is large enough to remove all words in the dataset. But due to this is not a common scenario, so maybe we can ignore it. If so, we can close the issue directly. If not, I can add some code to print more elegant error hits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect
[ https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547665#comment-14547665 ] Sean Owen commented on SPARK-7699: -- I think that's maybe intentional? the logic is detecting that you don't need 3 executors and ramping it down to the minimum. Config spark.dynamicAllocation.initialExecutors has no effect Key: SPARK-7699 URL: https://issues.apache.org/jira/browse/SPARK-7699 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 3 spark.dynamicAllocation.maxExecutors 4 Just run the spark-shell with above configurations, the initial executor number is 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-7700) Spark 1.3.0 on YARN: Application failed 2 times due to AM Container
[ https://issues.apache.org/jira/browse/SPARK-7700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-7700: -- Spark 1.3.0 on YARN: Application failed 2 times due to AM Container --- Key: SPARK-7700 URL: https://issues.apache.org/jira/browse/SPARK-7700 Project: Spark Issue Type: Story Components: Build Affects Versions: 1.3.1 Environment: windows 8 Single language Hadoop-2.5.2 Protocol Buffer-2.5.0 Scala-2.11 Reporter: Kaveen Raajan I build SPARK on yarn mode by giving following command. Build got succeeded. {panel} mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0 -Phive-thriftserver -DskipTests clean package {panel} I set following property at spark-env.cmd file {panel} SET SPARK_JAR=hdfs://master:9000/user/spark/jar {panel} *Note:* spark jar files are moved to hdfs specified location. Also spark classpath are added to hadoop-config.cmd and HADOOP_CONF_DIR are set at enviroment variable. I tried to execute following SparkPi example in yarn-cluster mode. {panel} spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue default S:\Hadoop\Spark\spark-1.3.1\examples\target\spark-examples_2.10-1.3.1.jar 10 {panel} My job able to submit at hadoop cluster, but it always in accepted state and Failed with following error {panel} 15/05/14 13:00:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 15/05/14 13:00:51 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers 15/05/14 13:00:51 INFO yarn.Client: Verifying our application has not requestedmore than the maximum memory capability of the cluster (8048 MB per container) 15/05/14 13:00:51 INFO yarn.Client: Will allocate AM container, with 4480 MB memory including 384 MB overhead 15/05/14 13:00:51 INFO yarn.Client: Setting up container launch context for ourAM 15/05/14 13:00:51 INFO yarn.Client: Preparing resources for our AM container 15/05/14 13:00:52 INFO yarn.Client: Source and destination file systems are thesame. Not copying hdfs://master:9000/user/spark/jar 15/05/14 13:00:52 INFO yarn.Client: Uploading resource file:/S:/Hadoop/Spark/spark-1.3.1/examples/target/spark-examples_2.10-1.3.1.jar - hdfs://master:9000/user/HDFS/.sparkStaging/application_1431587916618_0003/spark-examples_2.10-1.3.1.jar 15/05/14 13:00:52 INFO yarn.Client: Setting up the launch environment for our AM container 15/05/14 13:00:52 INFO spark.SecurityManager: Changing view acls to: HDFS 15/05/14 13:00:52 INFO spark.SecurityManager: Changing modify acls to: HDFS 15/05/14 13:00:52 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(HDFS); users with modify permissions: Set(HDFS) 15/05/14 13:00:52 INFO yarn.Client: Submitting application 3 to ResourceManager 15/05/14 13:00:52 INFO impl.YarnClientImpl: Submitted application application_1431587916618_0003 15/05/14 13:00:53 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:53 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1431588652790 final status: UNDEFINED tracking URL: http://master:8088/proxy/application_1431587916618_0003/ user: HDFS 15/05/14 13:00:54 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:55 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:56 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:57 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:58 INFO yarn.Client: Application report for application_1431587916618_0003 (state: ACCEPTED) 15/05/14 13:00:59 INFO yarn.Client: Application report for application_1431587916618_0003 (state: FAILED) 15/05/14 13:00:59 INFO yarn.Client: client token: N/A diagnostics: Application application_1431587916618_0003 failed 2 times due to AM Container for appattempt_1431587916618_0003_02 exited with exitCode: 1 For more detailed output, check application tracking page:http://master:8088/proxy/application_1431587916618_0003/Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1431587916618_0003_02_01 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at
[jira] [Assigned] (SPARK-7697) Column with an unsigned int should be treated as long in JDBCRDD
[ https://issues.apache.org/jira/browse/SPARK-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7697: --- Assignee: Apache Spark Column with an unsigned int should be treated as long in JDBCRDD Key: SPARK-7697 URL: https://issues.apache.org/jira/browse/SPARK-7697 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: DAITO Teppei Assignee: Apache Spark Columns with an unsigned numeric type in JDBC should be treated as the next 'larger' Java type in JDBCRDD#getCatalystType . https://github.com/apache/spark/blob/517eb37a85e0a28820bcfd5d98c50d02df6521c6/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L49 {code:title=q.sql} create table t1 (id int unsigned); insert into t1 values (4234567890); {code} {code:title=T1.scala} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext object T1 { def main(args: Array[String]) { val sc = new SparkContext(new SparkConf()) val s = new SQLContext(sc) val url = jdbc:mysql://localhost/test val t1 = s.jdbc(url, t1) t1.printSchema() t1.collect().foreach(println) } } {code} This code caused error like below. {noformat} 15/05/18 11:39:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, xxx): com.mysql.jdbc.exceptions.jdbc4.MySQLDataException: '4.23456789E9' in column '1' is outside valid range for the datatype INTEGER. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at com.mysql.jdbc.Util.handleNewInstance(Util.java:377) at com.mysql.jdbc.Util.getInstance(Util.java:360) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:963) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870) at com.mysql.jdbc.ResultSetImpl.throwRangeException(ResultSetImpl.java:7090) at com.mysql.jdbc.ResultSetImpl.parseIntAsDouble(ResultSetImpl.java:6364) at com.mysql.jdbc.ResultSetImpl.getInt(ResultSetImpl.java:2484) at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:344) at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:399) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4852) Hive query plan deserialization failure caused by shaded hive-exec jar file when generating golden answers
[ https://issues.apache.org/jira/browse/SPARK-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547708#comment-14547708 ] Manku Timma commented on SPARK-4852: The following diff could solve the problem. File is sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala. {code} import java.io.{OutputStream, InputStream} - import com.esotericsoftware.kryo.Kryo + import org.apache.hive.com.esotericsoftware.kryo.Kryo import org.apache.spark.util.Utils._ import org.apache.hadoop.hive.ql.exec.Utilities import org.apache.hadoop.hive.ql.exec.UDF {code} Hive query plan deserialization failure caused by shaded hive-exec jar file when generating golden answers -- Key: SPARK-4852 URL: https://issues.apache.org/jira/browse/SPARK-4852 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Cheng Lian Priority: Minor When adding Hive 0.13.1 support for Spark SQL Thrift server in PR [2685|https://github.com/apache/spark/pull/2685], Kryo 2.22 used by original hive-exec-0.13.1.jar was shaded by Kryo 2.21 used by Spark SQL because of dependency hell. Unfortunately, Kryo 2.21 has a known bug that may cause Hive query plan deserialization failure. This bug was fixed in Kryo 2.22. Normally, this issue doesn't affect Spark SQL because we don't even generate Hive query plan. But when running Hive test suites like {{HiveCompatibilitySuite}}, golden answer files must be generated by Hive, and thus triggers this issue. A workaround is to replace {{hive-exec-0.13.1.jar}} under {{$HIVE_HOME/lib}} with Spark's {{hive-exec-0.13.1a.jar}} and {{kryo-2.21.jar}} under {{$SPARK_DEV_HOME/lib_managed/jars}}. Then add {{$HIVE_HOME/lib}} to {{$HADOOP_CLASSPATH}}. Upgrading to some newer version of Kryo which is binary compatible with Kryo 2.22 (if there is one) may fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7697) Column with an unsigned int should be treated as long in JDBCRDD
[ https://issues.apache.org/jira/browse/SPARK-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547709#comment-14547709 ] Apache Spark commented on SPARK-7697: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/6229 Column with an unsigned int should be treated as long in JDBCRDD Key: SPARK-7697 URL: https://issues.apache.org/jira/browse/SPARK-7697 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: DAITO Teppei Columns with an unsigned numeric type in JDBC should be treated as the next 'larger' Java type in JDBCRDD#getCatalystType . https://github.com/apache/spark/blob/517eb37a85e0a28820bcfd5d98c50d02df6521c6/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L49 {code:title=q.sql} create table t1 (id int unsigned); insert into t1 values (4234567890); {code} {code:title=T1.scala} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext object T1 { def main(args: Array[String]) { val sc = new SparkContext(new SparkConf()) val s = new SQLContext(sc) val url = jdbc:mysql://localhost/test val t1 = s.jdbc(url, t1) t1.printSchema() t1.collect().foreach(println) } } {code} This code caused error like below. {noformat} 15/05/18 11:39:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, xxx): com.mysql.jdbc.exceptions.jdbc4.MySQLDataException: '4.23456789E9' in column '1' is outside valid range for the datatype INTEGER. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at com.mysql.jdbc.Util.handleNewInstance(Util.java:377) at com.mysql.jdbc.Util.getInstance(Util.java:360) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:963) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870) at com.mysql.jdbc.ResultSetImpl.throwRangeException(ResultSetImpl.java:7090) at com.mysql.jdbc.ResultSetImpl.parseIntAsDouble(ResultSetImpl.java:6364) at com.mysql.jdbc.ResultSetImpl.getInt(ResultSetImpl.java:2484) at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:344) at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:399) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7697) Column with an unsigned int should be treated as long in JDBCRDD
[ https://issues.apache.org/jira/browse/SPARK-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7697: --- Assignee: (was: Apache Spark) Column with an unsigned int should be treated as long in JDBCRDD Key: SPARK-7697 URL: https://issues.apache.org/jira/browse/SPARK-7697 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: DAITO Teppei Columns with an unsigned numeric type in JDBC should be treated as the next 'larger' Java type in JDBCRDD#getCatalystType . https://github.com/apache/spark/blob/517eb37a85e0a28820bcfd5d98c50d02df6521c6/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L49 {code:title=q.sql} create table t1 (id int unsigned); insert into t1 values (4234567890); {code} {code:title=T1.scala} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext object T1 { def main(args: Array[String]) { val sc = new SparkContext(new SparkConf()) val s = new SQLContext(sc) val url = jdbc:mysql://localhost/test val t1 = s.jdbc(url, t1) t1.printSchema() t1.collect().foreach(println) } } {code} This code caused error like below. {noformat} 15/05/18 11:39:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, xxx): com.mysql.jdbc.exceptions.jdbc4.MySQLDataException: '4.23456789E9' in column '1' is outside valid range for the datatype INTEGER. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at com.mysql.jdbc.Util.handleNewInstance(Util.java:377) at com.mysql.jdbc.Util.getInstance(Util.java:360) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:963) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870) at com.mysql.jdbc.ResultSetImpl.throwRangeException(ResultSetImpl.java:7090) at com.mysql.jdbc.ResultSetImpl.parseIntAsDouble(ResultSetImpl.java:6364) at com.mysql.jdbc.ResultSetImpl.getInt(ResultSetImpl.java:2484) at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:344) at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:399) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6416) RDD.fold() requires the operator to be commutative
[ https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6416: --- Assignee: Apache Spark RDD.fold() requires the operator to be commutative -- Key: SPARK-6416 URL: https://issues.apache.org/jira/browse/SPARK-6416 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Reporter: Josh Rosen Assignee: Apache Spark Priority: Critical Spark's {{RDD.fold}} operation has some confusing behaviors when a non-commutative reduce function is used. Here's an example, which was originally reported on StackOverflow (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output): {code} sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 ) 8 {code} To understand what's going on here, let's look at the definition of Spark's `fold` operation. I'm going to show the Python version of the code, but the Scala version exhibits the exact same behavior (you can also [browse the source on GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]: {code} def fold(self, zeroValue, op): Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral zero value. The function C{op(t1, t2)} is allowed to modify C{t1} and return it as its result value to avoid object allocation; however, it should not modify C{t2}. from operator import add sc.parallelize([1, 2, 3, 4, 5]).fold(0, add) 15 def func(iterator): acc = zeroValue for obj in iterator: acc = op(obj, acc) yield acc vals = self.mapPartitions(func).collect() return reduce(op, vals, zeroValue) {code} (For comparison, see the [Scala implementation of `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]). Spark's `fold` operates by first folding each partition and then folding the results. The problem is that an empty partition gets folded down to the zero element, so the final driver-side fold ends up folding one value for _every_ partition rather than one value for each _non-empty_ partition. This means that the result of `fold` is sensitive to the number of partitions: {code} sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 ) 100 sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 ) 50 sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 ) 1 {code} In this last case, what's happening is that the single partition is being folded down to the correct value, then that value is folded with the zero-value at the driver to yield 1. I think the underlying problem here is that our fold() operation implicitly requires the operator to be commutative in addition to associative, but this isn't documented anywhere. Due to ordering non-determinism elsewhere in Spark, such as SPARK-5750, I don't think there's an easy way to fix this. Therefore, I think we should update the documentation and examples to clarify this requirement and explain that our fold acts more like a reduce with a default value than the type of ordering-sensitive fold() that users may expect in functional languages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7443) MLlib 1.4 QA plan
[ https://issues.apache.org/jira/browse/SPARK-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7443: - Description: TODO: create JIRAs for each task and assign them accordingly. h2. API * Check API compliance using java-compliance-checker (SPARK-7458) * Audit new public APIs (from the generated html doc) ** Scala (do not forget to check the object doc) (SPARK-7537) ** Java compatibility (SPARK-7529) ** Python API coverage (SPARK-7536) * audit Pipeline APIs (SPARK-7535) * graduate spark.ml from alpha ** remove AlphaComponent annotations ** remove mima excludes for spark.ml ** mark concrete classes final wherever reasonable h2. Algorithms and performance *Performance* * _List any other missing performance tests from spark-perf here_ * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * Bernoulli naive Bayes (SPARK-7453) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python (SPARK-7539) *Correctness* * PMML ** scoring using PMML evaluator vs. MLlib models (SPARK-7540) * model save/load (SPARK-7541) h2. Documentation and example code * Create JIRAs for the user guide to each new algorithm and assign them to the corresponding author. Link here as requires ** Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. *** The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. *** We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. * Create example code for major components. Link here as requires ** cross validation in python ** pipeline with complex feature transformations (scala/java/python) ** elastic-net (possibly with cross validation) ** kernel density was: TODO: create JIRAs for each task and assign them accordingly. h2. API * Check API compliance using java-compliance-checker (SPARK-7458) * Audit new public APIs (from the generated html doc) ** Scala (do not forget to check the object doc) (SPARK-7537) ** Java compatibility (SPARK-7529) ** Python API coverage (SPARK-7536) * audit Pipeline APIs (SPARK-7535) * graduate spark.ml from alpha ** remove AlphaComponent annotations ** remove mima excludes for spark.ml ** mark concrete classes final wherever reasonable h2. Algorithms and performance *Performance* * _List any other missing performance tests from spark-perf here_ * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * Bernoulli naive Bayes (SPARK-7453) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python (SPARK-7539) *Correctness* * PMML ** scoring using PMML evaluator vs. MLlib models (SPARK-7540) * model save/load (SPARK-7541) h2. Documentation and example code * Create JIRAs for the user guide to each new algorithm and assign them to the corresponding author. Link here as requires ** Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. *** The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. *** We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. * Create example code for major components. Link here as requires ** cross validation in python ** pipeline with complex feature transformations (scala/java/python) ** elastic-net (possibly with cross validation) MLlib 1.4 QA plan - Key: SPARK-7443 URL: https://issues.apache.org/jira/browse/SPARK-7443 Project: Spark Issue Type: Umbrella Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Joseph K. Bradley Priority: Critical TODO: create JIRAs for each task and assign them accordingly. h2. API * Check API compliance using java-compliance-checker (SPARK-7458) * Audit new public APIs (from the generated html doc) ** Scala (do not forget to check the object doc) (SPARK-7537) ** Java compatibility (SPARK-7529) ** Python API coverage (SPARK-7536) * audit Pipeline APIs (SPARK-7535) * graduate spark.ml from alpha ** remove AlphaComponent annotations ** remove mima excludes for spark.ml ** mark concrete classes final wherever reasonable h2. Algorithms and performance *Performance* * _List any other missing performance tests from
[jira] [Created] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
Akshat Aranya created SPARK-7708: Summary: Incorrect task serialization with Kryo closure serializer Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3334) Spark causes mesos-master memory leak
[ https://issues.apache.org/jira/browse/SPARK-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3334. -- Resolution: Not A Problem Spark causes mesos-master memory leak - Key: SPARK-3334 URL: https://issues.apache.org/jira/browse/SPARK-3334 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.2 Environment: Mesos 0.16.0/0.19.0 CentOS 6.4 Reporter: Iven Hsu The {{akkaFrameSize}} is set to {{Long.MaxValue}} in MesosBackend to workaround SPARK-1112, this causes all serialized task result is sent using Mesos TaskStatus. mesos-master stores TaskStatus in memory, and when running Spark, its memory grows very fast, and will be OOM killed. See MESOS-1746 for more. I've tried to set {{akkaFrameSize}} to 0, mesos-master won't be killed, however, the driver will block after success unless I use {{sc.stop()}} to quit it manually. Not sure if it's related to SPARK-1112. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument
[ https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548242#comment-14548242 ] Sean Owen commented on SPARK-7706: -- I am not sure Spark is designed to be invoked this way. You may need to invoke it in a shell. but why isn't the YARN_CONF_DIR in the environment on a Hadoop cluster not correct and sufficient? that's the idea. Allow setting YARN_CONF_DIR from spark argument --- Key: SPARK-7706 URL: https://issues.apache.org/jira/browse/SPARK-7706 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.1 Reporter: Shaik Idris Ali Labels: oozie, yarn Currently in SparkSubmitArguments.scala when master is set to yarn (yarn-cluster mode) https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN. However we should additionally allow passing YARN_CONF_DIR from command line argument this is particularly handy when Spark is being launched from schedulers like OOZIE or FALCON. Reason being, oozie launcher App starts in one of the container assigned by Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in cluster. Just passing the argument like -yarnconfdir with conf dir (ex: /etc/hadoop/conf) should avoid setting the ENV variable. This is blocking us to onboard spark from oozie or falcon. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument
[ https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548243#comment-14548243 ] Sean Owen commented on SPARK-7706: -- I am not sure Spark is designed to be invoked this way. You may need to invoke it in a shell. but why isn't the YARN_CONF_DIR in the environment on a Hadoop cluster not correct and sufficient? that's the idea. Allow setting YARN_CONF_DIR from spark argument --- Key: SPARK-7706 URL: https://issues.apache.org/jira/browse/SPARK-7706 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.1 Reporter: Shaik Idris Ali Labels: oozie, yarn Currently in SparkSubmitArguments.scala when master is set to yarn (yarn-cluster mode) https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN. However we should additionally allow passing YARN_CONF_DIR from command line argument this is particularly handy when Spark is being launched from schedulers like OOZIE or FALCON. Reason being, oozie launcher App starts in one of the container assigned by Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in cluster. Just passing the argument like -yarnconfdir with conf dir (ex: /etc/hadoop/conf) should avoid setting the ENV variable. This is blocking us to onboard spark from oozie or falcon. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument
[ https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7706: - Comment: was deleted (was: I am not sure Spark is designed to be invoked this way. You may need to invoke it in a shell. but why isn't the YARN_CONF_DIR in the environment on a Hadoop cluster not correct and sufficient? that's the idea.) Allow setting YARN_CONF_DIR from spark argument --- Key: SPARK-7706 URL: https://issues.apache.org/jira/browse/SPARK-7706 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.1 Reporter: Shaik Idris Ali Labels: oozie, yarn Currently in SparkSubmitArguments.scala when master is set to yarn (yarn-cluster mode) https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN. However we should additionally allow passing YARN_CONF_DIR from command line argument this is particularly handy when Spark is being launched from schedulers like OOZIE or FALCON. Reason being, oozie launcher App starts in one of the container assigned by Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in cluster. Just passing the argument like -yarnconfdir with conf dir (ex: /etc/hadoop/conf) should avoid setting the ENV variable. This is blocking us to onboard spark from oozie or falcon. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument
[ https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548322#comment-14548322 ] Sean Owen commented on SPARK-7706: -- YARN_CONF_DIR is a YARN env variable right? not Spark-specific or app-specific. It should point to the cluster's YARN configuration, which is not app-specific either. I think any gateway machine will have this set. Right, or, is this not valid for Oozie somehow? Allow setting YARN_CONF_DIR from spark argument --- Key: SPARK-7706 URL: https://issues.apache.org/jira/browse/SPARK-7706 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.1 Reporter: Shaik Idris Ali Labels: oozie, yarn Currently in SparkSubmitArguments.scala when master is set to yarn (yarn-cluster mode) https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN. However we should additionally allow passing YARN_CONF_DIR from command line argument this is particularly handy when Spark is being launched from schedulers like OOZIE or FALCON. Reason being, oozie launcher App starts in one of the container assigned by Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in cluster. Just passing the argument like -yarnconfdir with conf dir (ex: /etc/hadoop/conf) should avoid setting the ENV variable. This is blocking us to onboard spark from oozie or falcon. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument
[ https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548418#comment-14548418 ] Shaik Idris Ali commented on SPARK-7706: I think the cleaner way in SparkSubmitArguments.class is to support both Class arguments and Env variables, in fact that is how it done for most of the variables. And Spark java cmd can appropriately set these in classpath. Maybe it is uses ENV because these are generic variables. {code} executorMemory = Option(executorMemory) .orElse(sparkProperties.get(spark.executor.memory)) .orElse(env.get(SPARK_EXECUTOR_MEMORY)) .orNull {code} Meanwhile I will check if we have a simpler solution without fixing this in Spark. Thanks. Allow setting YARN_CONF_DIR from spark argument --- Key: SPARK-7706 URL: https://issues.apache.org/jira/browse/SPARK-7706 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.1 Reporter: Shaik Idris Ali Labels: oozie, yarn Currently in SparkSubmitArguments.scala when master is set to yarn (yarn-cluster mode) https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN. However we should additionally allow passing YARN_CONF_DIR from command line argument this is particularly handy when Spark is being launched from schedulers like OOZIE or FALCON. Reason being, oozie launcher App starts in one of the container assigned by Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in cluster. Just passing the argument like -yarnconfdir with conf dir (ex: /etc/hadoop/conf) should avoid setting the ENV variable. This is blocking us to onboard spark from oozie or falcon. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7540) PMML correctness check
[ https://issues.apache.org/jira/browse/SPARK-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548326#comment-14548326 ] Shuo Xiang commented on SPARK-7540: --- Schema and required field verification done using both [~vfed]`s script and manual check. The output of [spark-pmml-exporter-validator]( https://github.com/selvinsource/spark-pmml-exporter-validator) is checked against the required fields for each of the following model. - Linear Regression and Lasso - LR - Clustering - linear SVM. This is implemented as a RegressionModel (functionName=classification ), instead of the SupportVectorMachineModel. The evaluation results on built-in numerical examples look fine. PMML correctness check -- Key: SPARK-7540 URL: https://issues.apache.org/jira/browse/SPARK-7540 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: Shuo Xiang Check correctness of PMML export for MLlib models by using PMML evaluator to load and run the models. This unfortunately needs to be done externally (not in spark-perf) because of licensing. A record of tests run and the results can be posted in this JIRA, as well as a link to the repo hosting the testing code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7693) Remove import scala.concurrent.ExecutionContext.Implicits.global
[ https://issues.apache.org/jira/browse/SPARK-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7693: - Assignee: Shixiong Zhu Remove import scala.concurrent.ExecutionContext.Implicits.global -- Key: SPARK-7693 URL: https://issues.apache.org/jira/browse/SPARK-7693 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.4.0 Learnt a lesson from SPARK-7655: Spark should avoid to use scala.concurrent.ExecutionContext.Implicits.global because the user may submit blocking actions to `scala.concurrent.ExecutionContext.Implicits.global` and exhaust all threads in it. This could crash Spark. So Spark should always use its own thread pools for safety. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7272) User guide update for PMML model export
[ https://issues.apache.org/jira/browse/SPARK-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7272. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6219 [https://github.com/apache/spark/pull/6219] User guide update for PMML model export --- Key: SPARK-7272 URL: https://issues.apache.org/jira/browse/SPARK-7272 Project: Spark Issue Type: Documentation Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Vincenzo Selvaggio Fix For: 1.4.0 Add user guide for PMML model export with some example code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7458) Check 1.3- 1.4 MLlib API compliance using java-compliance-checker
[ https://issues.apache.org/jira/browse/SPARK-7458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-7458: Assignee: Xiangrui Meng Check 1.3- 1.4 MLlib API compliance using java-compliance-checker -- Key: SPARK-7458 URL: https://issues.apache.org/jira/browse/SPARK-7458 Project: Spark Issue Type: Sub-task Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We should do this after 1.4-rc1 is cut. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7707) User guide and example code for Statistics.kernelDensity
Xiangrui Meng created SPARK-7707: Summary: User guide and example code for Statistics.kernelDensity Key: SPARK-7707 URL: https://issues.apache.org/jira/browse/SPARK-7707 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument
[ https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548306#comment-14548306 ] Shaik Idris Ali commented on SPARK-7706: We might have multiple types of applications running on Yarn cluster and it might not be a good idea to setup application specific Env variables in all the nodes of a cluster. I understand that for submitting Ad-hoc jobs from command line it make sense to have YARN_CONF_DIR set in one of the machine which acts like Spark client. We can support both Spark argument and Env variable for such configurations. Allow setting YARN_CONF_DIR from spark argument --- Key: SPARK-7706 URL: https://issues.apache.org/jira/browse/SPARK-7706 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.1 Reporter: Shaik Idris Ali Labels: oozie, yarn Currently in SparkSubmitArguments.scala when master is set to yarn (yarn-cluster mode) https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN. However we should additionally allow passing YARN_CONF_DIR from command line argument this is particularly handy when Spark is being launched from schedulers like OOZIE or FALCON. Reason being, oozie launcher App starts in one of the container assigned by Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in cluster. Just passing the argument like -yarnconfdir with conf dir (ex: /etc/hadoop/conf) should avoid setting the ENV variable. This is blocking us to onboard spark from oozie or falcon. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7709) spark-submit option to quit after submitting in cluster mode
Shay Rojansky created SPARK-7709: Summary: spark-submit option to quit after submitting in cluster mode Key: SPARK-7709 URL: https://issues.apache.org/jira/browse/SPARK-7709 Project: Spark Issue Type: New Feature Components: Deploy Affects Versions: 1.3.1 Reporter: Shay Rojansky Priority: Minor When deploying in cluster mode, spark-submit continues polling the application every second. While this is a useful feature, there should be an option to have spark-submit exit immediately after submission completes. This would allow scripts to figure out that a job was successfully (or unsuccessfully) submitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548282#comment-14548282 ] Akshat Aranya commented on SPARK-7708: -- This happens because the TaskDescription contains a SerializableBuffer, which has no fields from the point of view of Kryo. It only has a transient var and a def. When this is serialized by Kryo, it finds no member fields to serialize, so it doesn't write out the underlying buffer. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7443) MLlib 1.4 QA plan
[ https://issues.apache.org/jira/browse/SPARK-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7443: - Description: TODO: create JIRAs for each task and assign them accordingly. h2. API * Check API compliance using java-compliance-checker (SPARK-7458) * Audit new public APIs (from the generated html doc) ** Scala (do not forget to check the object doc) (SPARK-7537) ** Java compatibility (SPARK-7529) ** Python API coverage (SPARK-7536) * audit Pipeline APIs (SPARK-7535) * graduate spark.ml from alpha ** remove AlphaComponent annotations ** remove mima excludes for spark.ml ** mark concrete classes final wherever reasonable h2. Algorithms and performance *Performance* * _List any other missing performance tests from spark-perf here_ * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * Bernoulli naive Bayes (SPARK-7453) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python (SPARK-7539) *Correctness* * PMML ** scoring using PMML evaluator vs. MLlib models (SPARK-7540) * model save/load (SPARK-7541) h2. Documentation and example code * Create JIRAs for the user guide to each new algorithm and assign them to the corresponding author. Link here as requires ** Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. *** The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. *** We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. * Create example code for major components. Link here as requires ** cross validation in python (SPARK-7387) ** pipeline with complex feature transformations (scala/java/python) (SPARK-7546) ** elastic-net (possibly with cross validation) (SPARK-7547) ** kernel density (SPARK-7707) was: TODO: create JIRAs for each task and assign them accordingly. h2. API * Check API compliance using java-compliance-checker (SPARK-7458) * Audit new public APIs (from the generated html doc) ** Scala (do not forget to check the object doc) (SPARK-7537) ** Java compatibility (SPARK-7529) ** Python API coverage (SPARK-7536) * audit Pipeline APIs (SPARK-7535) * graduate spark.ml from alpha ** remove AlphaComponent annotations ** remove mima excludes for spark.ml ** mark concrete classes final wherever reasonable h2. Algorithms and performance *Performance* * _List any other missing performance tests from spark-perf here_ * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * Bernoulli naive Bayes (SPARK-7453) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python (SPARK-7539) *Correctness* * PMML ** scoring using PMML evaluator vs. MLlib models (SPARK-7540) * model save/load (SPARK-7541) h2. Documentation and example code * Create JIRAs for the user guide to each new algorithm and assign them to the corresponding author. Link here as requires ** Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. *** The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. *** We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. * Create example code for major components. Link here as requires ** cross validation in python ** pipeline with complex feature transformations (scala/java/python) ** elastic-net (possibly with cross validation) ** kernel density MLlib 1.4 QA plan - Key: SPARK-7443 URL: https://issues.apache.org/jira/browse/SPARK-7443 Project: Spark Issue Type: Umbrella Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Joseph K. Bradley Priority: Critical TODO: create JIRAs for each task and assign them accordingly. h2. API * Check API compliance using java-compliance-checker (SPARK-7458) * Audit new public APIs (from the generated html doc) ** Scala (do not forget to check the object doc) (SPARK-7537) ** Java compatibility (SPARK-7529) ** Python API coverage (SPARK-7536) * audit Pipeline APIs (SPARK-7535) * graduate spark.ml from alpha ** remove AlphaComponent annotations ** remove mima excludes for spark.ml ** mark concrete classes final wherever reasonable h2. Algorithms and performance
[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument
[ https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548203#comment-14548203 ] Shaik Idris Ali commented on SPARK-7706: Hi, [~srowen], Thanks for the quick response, sorry I did not get, basically the way actions are launched in oozie or any other scheduler is from a Java program. Which takes the Main class and bunch of arguments to that class. Ex: org.apache.spark.deploy.SparkSubmit.main(args); and we do not require to set anything in System EVN variables. Link to Oozie code: https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org.apache.oozie.action.hadoop/SparkMain.java#L104 Allow setting YARN_CONF_DIR from spark argument --- Key: SPARK-7706 URL: https://issues.apache.org/jira/browse/SPARK-7706 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.1 Reporter: Shaik Idris Ali Labels: oozie, yarn Currently in SparkSubmitArguments.scala when master is set to yarn (yarn-cluster mode) https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN. However we should additionally allow passing YARN_CONF_DIR from command line argument this is particularly handy when Spark is being launched from schedulers like OOZIE or FALCON. Reason being, oozie launcher App starts in one of the container assigned by Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in cluster. Just passing the argument like -yarnconfdir with conf dir (ex: /etc/hadoop/conf) should avoid setting the ENV variable. This is blocking us to onboard spark from oozie or falcon. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3334) Spark causes mesos-master memory leak
[ https://issues.apache.org/jira/browse/SPARK-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548220#comment-14548220 ] Iven Hsu commented on SPARK-3334: - I'm not using Spark currently, you can close it as you wish. Spark causes mesos-master memory leak - Key: SPARK-3334 URL: https://issues.apache.org/jira/browse/SPARK-3334 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.2 Environment: Mesos 0.16.0/0.19.0 CentOS 6.4 Reporter: Iven Hsu The {{akkaFrameSize}} is set to {{Long.MaxValue}} in MesosBackend to workaround SPARK-1112, this causes all serialized task result is sent using Mesos TaskStatus. mesos-master stores TaskStatus in memory, and when running Spark, its memory grows very fast, and will be OOM killed. See MESOS-1746 for more. I've tried to set {{akkaFrameSize}} to 0, mesos-master won't be killed, however, the driver will block after success unless I use {{sc.stop()}} to quit it manually. Not sure if it's related to SPARK-1112. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7110) when use saveAsNewAPIHadoopFile, sometimes it throws Delegation Token can be issued only with kerberos or web authentication
[ https://issues.apache.org/jira/browse/SPARK-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548248#comment-14548248 ] Thomas Graves commented on SPARK-7110: -- Are you using spark1.1.0 as reported in the jira? If so then this is probably issue https://issues.apache.org/jira/browse/SPARK-3778 which was fixed in spark 1.3. Can you try the newer version? Otherwise you could try patching 1.1. Its calling into org.apache.spark.rdd.NewHadoopRDD.getPartitions, which ends up only calling into org.apache.hadoop.fs.FileSystem.addDelegationTokens if the tokens aren't already present. Since that is a NewHadoopRDD instance it should have already populated them at that point. That is why I'm thinking SPARK-3778 might be the issue. Due you have a snippet of code where you are creating the NewHadoopRDD? Are you using newAPIHadoopFile or newAPIHadoopRDD for instance. when use saveAsNewAPIHadoopFile, sometimes it throws Delegation Token can be issued only with kerberos or web authentication -- Key: SPARK-7110 URL: https://issues.apache.org/jira/browse/SPARK-7110 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: gu-chi Assignee: Sean Owen Under yarn-client mode, this issue random occurs. Authentication method is set to kerberos, and use saveAsNewAPIHadoopFile in PairRDDFunctions to save data to HDFS, then exception comes as: org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token can be issued only with kerberos or web authentication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3251) Clarify learning interfaces
[ https://issues.apache.org/jira/browse/SPARK-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3251: --- Assignee: (was: Apache Spark) Clarify learning interfaces Key: SPARK-3251 URL: https://issues.apache.org/jira/browse/SPARK-3251 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0, 1.1.1 Reporter: Christoph Sawade *Make threshold mandatory* Currently, the output of predict for an example is either the score or the class. This side-effect is caused by clearThreshold. To clarify that behaviour three different types of predict (predictScore, predictClass, predictProbabilty) were introduced; the threshold is not longer optional. *Clarify classification interfaces* Currently, some functionality is spreaded over multiple models. In order to clarify the structure and simplify the implementation of more complex models (like multinomial logistic regression), two new classes are introduced: - BinaryClassificationModel: for all models that derives a binary classification from a single weight vector. Comprises the tresholding functionality to derive a prediction from a score. It basically captures SVMModel and LogisticRegressionModel. - ProbabilitistClassificaitonModel: This trait defines the interface for models that return a calibrated confidence score (aka probability). *Misc* - some renaming - add test for probabilistic output -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3251) Clarify learning interfaces
[ https://issues.apache.org/jira/browse/SPARK-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3251: --- Assignee: Apache Spark Clarify learning interfaces Key: SPARK-3251 URL: https://issues.apache.org/jira/browse/SPARK-3251 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0, 1.1.1 Reporter: Christoph Sawade Assignee: Apache Spark *Make threshold mandatory* Currently, the output of predict for an example is either the score or the class. This side-effect is caused by clearThreshold. To clarify that behaviour three different types of predict (predictScore, predictClass, predictProbabilty) were introduced; the threshold is not longer optional. *Clarify classification interfaces* Currently, some functionality is spreaded over multiple models. In order to clarify the structure and simplify the implementation of more complex models (like multinomial logistic regression), two new classes are introduced: - BinaryClassificationModel: for all models that derives a binary classification from a single weight vector. Comprises the tresholding functionality to derive a prediction from a score. It basically captures SVMModel and LogisticRegressionModel. - ProbabilitistClassificaitonModel: This trait defines the interface for models that return a calibrated confidence score (aka probability). *Misc* - some renaming - add test for probabilistic output -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4962) Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period
[ https://issues.apache.org/jira/browse/SPARK-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4962. -- Resolution: Won't Fix Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period --- Key: SPARK-4962 URL: https://issues.apache.org/jira/browse/SPARK-4962 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.0.0 Reporter: YanTang Zhai Priority: Minor When SparkContext object is instantiated, TaskScheduler is started and some resources are allocated from cluster. However, these resources may be not used for the moment. For example, DAGScheduler.JobSubmitted is processing and so on. These resources are wasted in this period. Thus, we want to put TaskScheduler.start back to shorten cluster resources occupation period specially for busy cluster. TaskScheduler could be started just before running stages. We could analyse and compare the resources occupation period before and after optimization. TaskScheduler.start execution time: [time1__] DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_] HadoopRDD.getPartitions execution time: [time3___] Stages execution time: [time4_] The cluster resources occupation period before optimization is [time2_][time3___][time4_]. The cluster resources occupation period after optimization is[time3___][time4_]. In summary, the cluster resources occupation period after optimization is less than before. If HadoopRDD.getPartitions could be put forward (SPARK-4961), the period may be shorten more which is [time4_]. The resources saving is important for busy cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4962) Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period
[ https://issues.apache.org/jira/browse/SPARK-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4962: --- Assignee: (was: Apache Spark) Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period --- Key: SPARK-4962 URL: https://issues.apache.org/jira/browse/SPARK-4962 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.0.0 Reporter: YanTang Zhai Priority: Minor When SparkContext object is instantiated, TaskScheduler is started and some resources are allocated from cluster. However, these resources may be not used for the moment. For example, DAGScheduler.JobSubmitted is processing and so on. These resources are wasted in this period. Thus, we want to put TaskScheduler.start back to shorten cluster resources occupation period specially for busy cluster. TaskScheduler could be started just before running stages. We could analyse and compare the resources occupation period before and after optimization. TaskScheduler.start execution time: [time1__] DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_] HadoopRDD.getPartitions execution time: [time3___] Stages execution time: [time4_] The cluster resources occupation period before optimization is [time2_][time3___][time4_]. The cluster resources occupation period after optimization is[time3___][time4_]. In summary, the cluster resources occupation period after optimization is less than before. If HadoopRDD.getPartitions could be put forward (SPARK-4961), the period may be shorten more which is [time4_]. The resources saving is important for busy cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4962) Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period
[ https://issues.apache.org/jira/browse/SPARK-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4962: --- Assignee: Apache Spark Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period --- Key: SPARK-4962 URL: https://issues.apache.org/jira/browse/SPARK-4962 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.0.0 Reporter: YanTang Zhai Assignee: Apache Spark Priority: Minor When SparkContext object is instantiated, TaskScheduler is started and some resources are allocated from cluster. However, these resources may be not used for the moment. For example, DAGScheduler.JobSubmitted is processing and so on. These resources are wasted in this period. Thus, we want to put TaskScheduler.start back to shorten cluster resources occupation period specially for busy cluster. TaskScheduler could be started just before running stages. We could analyse and compare the resources occupation period before and after optimization. TaskScheduler.start execution time: [time1__] DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_] HadoopRDD.getPartitions execution time: [time3___] Stages execution time: [time4_] The cluster resources occupation period before optimization is [time2_][time3___][time4_]. The cluster resources occupation period after optimization is[time3___][time4_]. In summary, the cluster resources occupation period after optimization is less than before. If HadoopRDD.getPartitions could be put forward (SPARK-4961), the period may be shorten more which is [time4_]. The resources saving is important for busy cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7710) User guide and example code for math/stat functions in DataFrames
Xiangrui Meng created SPARK-7710: Summary: User guide and example code for math/stat functions in DataFrames Key: SPARK-7710 URL: https://issues.apache.org/jira/browse/SPARK-7710 Project: Spark Issue Type: Documentation Components: Documentation, SQL Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Burak Yavuz Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5316) DAGScheduler may make shuffleToMapStage leak if getParentStages failes
[ https://issues.apache.org/jira/browse/SPARK-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5316: --- Assignee: (was: Apache Spark) DAGScheduler may make shuffleToMapStage leak if getParentStages failes -- Key: SPARK-5316 URL: https://issues.apache.org/jira/browse/SPARK-5316 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.0.0 Reporter: YanTang Zhai DAGScheduler may make shuffleToMapStage leak if getParentStages failes. If getParentStages has exception for example input path does not exist, DAGScheduler would fail to handle job submission, while shuffleToMapStage may be put some records when getParentStages. However these records in shuffleToMapStage aren't going to be cleaned. A simple job as follows: {code:java} val inputFile1 = ... // Input path does not exist when this job submits val inputFile2 = ... val outputFile = ... val conf = new SparkConf() val sc = new SparkContext(conf) val rdd1 = sc.textFile(inputFile1) .flatMap(line = line.split( )) .map(word = (word, 1)) .reduceByKey(_ + _, 1) val rdd2 = sc.textFile(inputFile2) .flatMap(line = line.split(,)) .map(word = (word, 1)) .reduceByKey(_ + _, 1) try { val rdd3 = new PairRDDFunctions(rdd1).join(rdd2, 1) rdd3.saveAsTextFile(outputFile) } catch { case e : Exception = logError(e) } // print the information of DAGScheduler's shuffleToMapStage to check // whether it still has uncleaned records. ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5316) DAGScheduler may make shuffleToMapStage leak if getParentStages failes
[ https://issues.apache.org/jira/browse/SPARK-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5316: --- Assignee: Apache Spark DAGScheduler may make shuffleToMapStage leak if getParentStages failes -- Key: SPARK-5316 URL: https://issues.apache.org/jira/browse/SPARK-5316 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.0.0 Reporter: YanTang Zhai Assignee: Apache Spark DAGScheduler may make shuffleToMapStage leak if getParentStages failes. If getParentStages has exception for example input path does not exist, DAGScheduler would fail to handle job submission, while shuffleToMapStage may be put some records when getParentStages. However these records in shuffleToMapStage aren't going to be cleaned. A simple job as follows: {code:java} val inputFile1 = ... // Input path does not exist when this job submits val inputFile2 = ... val outputFile = ... val conf = new SparkConf() val sc = new SparkContext(conf) val rdd1 = sc.textFile(inputFile1) .flatMap(line = line.split( )) .map(word = (word, 1)) .reduceByKey(_ + _, 1) val rdd2 = sc.textFile(inputFile2) .flatMap(line = line.split(,)) .map(word = (word, 1)) .reduceByKey(_ + _, 1) try { val rdd3 = new PairRDDFunctions(rdd1).join(rdd2, 1) rdd3.saveAsTextFile(outputFile) } catch { case e : Exception = logError(e) } // print the information of DAGScheduler's shuffleToMapStage to check // whether it still has uncleaned records. ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6888) Make DriverQuirks editable
[ https://issues.apache.org/jira/browse/SPARK-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6888. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request [https://github.com/apache/spark/pull/] Make DriverQuirks editable -- Key: SPARK-6888 URL: https://issues.apache.org/jira/browse/SPARK-6888 Project: Spark Issue Type: Improvement Components: SQL Reporter: Rene Treffer Priority: Minor Fix For: 1.4.0 JDBC type conversion is currently handled by spark with the help of DriverQuirks (org.apache.spark.sql.jdbc.DriverQuirks). However some cases can't be resolved, e.g. MySQL BIGINT UNSIGNED. (other UNSIGNED conversions won't work either but could be resolved automatically by using the next larger type) An invalid type conversion (e.g. loading an unsigned bigint with the highest bit set as a long value) causes the jdbc driver to throw an exception. The target type is determined automatically and bound to the resulting DataFrame where it's immutable. Alternative solutions: - Subqueries. Produce extra load on the server - SQLContext / jdbc methods with schema support - Making it possible to change the schema of data frames -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548433#comment-14548433 ] Josh Rosen commented on SPARK-4105: --- I just noticed something interesting, although perhaps it's a red herring: in most of the stacktraces posted here, it looks like shuffled data is being processed with CoGroupedRDD. In most (all?) of these cases, it looks like the error is occurring when inserting an iterator of values into an external append-only map inside of CoGroupedRDD. https://github.com/apache/spark/pull/1607 was one of the last PRs to touch this file in 1.1, so I wonder whether there could be some sort of odd corner-case that we're hitting there; might be a lead worth exploring further. FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle - Key: SPARK-4105 URL: https://issues.apache.org/jira/browse/SPARK-4105 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Attachments: JavaObjectToSerialize.java, SparkFailedToUncompressGenerator.scala We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during shuffle read. Here's a sample stacktrace from an executor: {code} 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 33053) java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at
[jira] [Assigned] (SPARK-4991) Worker should reconnect to Master when Master actor restart
[ https://issues.apache.org/jira/browse/SPARK-4991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4991: --- Assignee: Apache Spark Worker should reconnect to Master when Master actor restart --- Key: SPARK-4991 URL: https://issues.apache.org/jira/browse/SPARK-4991 Project: Spark Issue Type: Improvement Components: Deploy, Spark Core Affects Versions: 1.0.0, 1.1.0, 1.2.0 Reporter: Zhang, Liye Assignee: Apache Spark This is a following JIRA of [SPARK-4989|https://issues.apache.org/jira/browse/SPARK-4989]. when Master akka actor encounter an exception, the Master will restart (akka actor restart not JVM restart). And all old information are cleared on Master (including workers, applications, etc). However, the workers are not aware of this at all. The state of the cluster is that: the master is on, and all workers are also on, but master is not aware of the exists of workers, and will ignore all worker's heartbeat because all workers are not registered. So that the whole cluster is not available. For some other information about this part, please refer to [SPARK-3736|https://issues.apache.org/jira/browse/SPARK-3736] and [SPARK-4592|https://issues.apache.org/jira/browse/SPARK-4592] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4991) Worker should reconnect to Master when Master actor restart
[ https://issues.apache.org/jira/browse/SPARK-4991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4991: --- Assignee: (was: Apache Spark) Worker should reconnect to Master when Master actor restart --- Key: SPARK-4991 URL: https://issues.apache.org/jira/browse/SPARK-4991 Project: Spark Issue Type: Improvement Components: Deploy, Spark Core Affects Versions: 1.0.0, 1.1.0, 1.2.0 Reporter: Zhang, Liye This is a following JIRA of [SPARK-4989|https://issues.apache.org/jira/browse/SPARK-4989]. when Master akka actor encounter an exception, the Master will restart (akka actor restart not JVM restart). And all old information are cleared on Master (including workers, applications, etc). However, the workers are not aware of this at all. The state of the cluster is that: the master is on, and all workers are also on, but master is not aware of the exists of workers, and will ignore all worker's heartbeat because all workers are not registered. So that the whole cluster is not available. For some other information about this part, please refer to [SPARK-3736|https://issues.apache.org/jira/browse/SPARK-3736] and [SPARK-4592|https://issues.apache.org/jira/browse/SPARK-4592] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3219: --- Assignee: Apache Spark (was: Derrick Burns) K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Apache Spark Labels: clustering The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3261: --- Assignee: Apache Spark (was: Derrick Burns) KMeans clusterer can return duplicate cluster centers - Key: SPARK-3261 URL: https://issues.apache.org/jira/browse/SPARK-3261 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns Assignee: Apache Spark Labels: clustering This is a bad design choice. I think that it is preferable to produce no duplicate cluster centers. So instead of forcing the number of clusters to be K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3218) K-Means clusterer can fail on degenerate data
[ https://issues.apache.org/jira/browse/SPARK-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3218: --- Assignee: Derrick Burns (was: Apache Spark) K-Means clusterer can fail on degenerate data - Key: SPARK-3218 URL: https://issues.apache.org/jira/browse/SPARK-3218 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns Assignee: Derrick Burns Priority: Minor Labels: clustering The KMeans parallel implementation selects points to be cluster centers with probability weighted by their distance to cluster centers. However, if there are fewer than k DISTINCT points in the data set, this approach will fail. Further, the recent checkin to work around this problem results in selection of the same point repeatedly as a cluster center. The fix is to allow fewer than k cluster centers to be selected. This requires several changes to the code, as the number of cluster centers is woven into the implementation. I have a version of the code that addresses this problem, AND generalizes the distance metric. However, I see that there are literally hundreds of outstanding pull requests. If someone will commit to working with me to sponsor the pull request, I will create it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3261: --- Assignee: Derrick Burns (was: Apache Spark) KMeans clusterer can return duplicate cluster centers - Key: SPARK-3261 URL: https://issues.apache.org/jira/browse/SPARK-3261 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns Assignee: Derrick Burns Labels: clustering This is a bad design choice. I think that it is preferable to produce no duplicate cluster centers. So instead of forcing the number of clusters to be K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3219: --- Assignee: Derrick Burns (was: Apache Spark) K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns Labels: clustering The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3218) K-Means clusterer can fail on degenerate data
[ https://issues.apache.org/jira/browse/SPARK-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3218: --- Assignee: Apache Spark (was: Derrick Burns) K-Means clusterer can fail on degenerate data - Key: SPARK-3218 URL: https://issues.apache.org/jira/browse/SPARK-3218 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns Assignee: Apache Spark Priority: Minor Labels: clustering The KMeans parallel implementation selects points to be cluster centers with probability weighted by their distance to cluster centers. However, if there are fewer than k DISTINCT points in the data set, this approach will fail. Further, the recent checkin to work around this problem results in selection of the same point repeatedly as a cluster center. The fix is to allow fewer than k cluster centers to be selected. This requires several changes to the code, as the number of cluster centers is woven into the implementation. I have a version of the code that addresses this problem, AND generalizes the distance metric. However, I see that there are literally hundreds of outstanding pull requests. If someone will commit to working with me to sponsor the pull request, I will create it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4094) checkpoint should still be available after rdd actions
[ https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4094. -- Resolution: Won't Fix checkpoint should still be available after rdd actions -- Key: SPARK-4094 URL: https://issues.apache.org/jira/browse/SPARK-4094 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Zhang, Liye Assignee: Zhang, Liye rdd.checkpoint() must be called before any actions on this rdd, if there is any other actions before, checkpoint would never succeed. For the following code as example: *rdd = sc.makeRDD(...)* *rdd.collect()* *rdd.checkpoint()* *rdd.count()* This rdd would never be checkpointed. For algorithms that have many iterations would have some problem. Such as graph algorithm, there will have many iterations which will cause the RDD lineage very long. So RDD may need checkpoint after a certain iteration number. And if there is also any action within the iteration loop, the checkpoint() operation will never work for the later iterations after the iteration which calls the action operation. But this would not happen for RDD cache. RDD cache would always make successfully before rdd actions no matter whether there is any actions before cache(). So rdd.checkpoint() should also be with the same behavior with rdd.cache(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4630: --- Assignee: Apache Spark (was: Kostas Sakellis) Dynamically determine optimal number of partitions -- Key: SPARK-4630 URL: https://issues.apache.org/jira/browse/SPARK-4630 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kostas Sakellis Assignee: Apache Spark Partition sizes play a big part in how fast stages execute during a Spark job. There is a direct relationship between the size of partitions to the number of tasks - larger partitions, fewer tasks. For better performance, Spark has a sweet spot for how large partitions should be that get executed by a task. If partitions are too small, then the user pays a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution slows down due to gc pressure and spilling to disk. To increase performance of jobs, users often hand optimize the number(size) of partitions that the next stage gets. Factors that come into play are: Incoming partition sizes from previous stage number of available executors available memory per executor (taking into account spark.shuffle.memoryFraction) Spark has access to this data and so should be able to automatically do the partition sizing for the user. This feature can be turned off/on with a configuration option. To make this happen, we propose modifying the DAGScheduler to take into account partition sizes upon stage completion. Before scheduling the next stage, the scheduler can examine the sizes of the partitions and determine the appropriate number tasks to create. Since this change requires non-trivial modifications to the DAGScheduler, a detailed design doc will be attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4630: --- Assignee: Kostas Sakellis (was: Apache Spark) Dynamically determine optimal number of partitions -- Key: SPARK-4630 URL: https://issues.apache.org/jira/browse/SPARK-4630 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kostas Sakellis Assignee: Kostas Sakellis Partition sizes play a big part in how fast stages execute during a Spark job. There is a direct relationship between the size of partitions to the number of tasks - larger partitions, fewer tasks. For better performance, Spark has a sweet spot for how large partitions should be that get executed by a task. If partitions are too small, then the user pays a disproportionate cost in scheduling overhead. If the partitions are too large, then task execution slows down due to gc pressure and spilling to disk. To increase performance of jobs, users often hand optimize the number(size) of partitions that the next stage gets. Factors that come into play are: Incoming partition sizes from previous stage number of available executors available memory per executor (taking into account spark.shuffle.memoryFraction) Spark has access to this data and so should be able to automatically do the partition sizing for the user. This feature can be turned off/on with a configuration option. To make this happen, we propose modifying the DAGScheduler to take into account partition sizes upon stage completion. Before scheduling the next stage, the scheduler can examine the sizes of the partitions and determine the appropriate number tasks to create. Since this change requires non-trivial modifications to the DAGScheduler, a detailed design doc will be attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7711) startTime() is missing
Sam Steingold created SPARK-7711: Summary: startTime() is missing Key: SPARK-7711 URL: https://issues.apache.org/jira/browse/SPARK-7711 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Reporter: Sam Steingold In PySpark 1.3, there appears to be no [startTime|https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#startTime%28%29]: {code} sc.startTime Traceback (most recent call last): File stdin, line 1, in module AttributeError: 'SparkContext' object has no attribute 'startTime' {code} Scala has the method: {code} scala sc.startTime res1: Long = 1431974499272 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7712) Native Spark Window Functions Performance Improvements
Herman van Hovell tot Westerflier created SPARK-7712: Summary: Native Spark Window Functions Performance Improvements Key: SPARK-7712 URL: https://issues.apache.org/jira/browse/SPARK-7712 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Herman van Hovell tot Westerflier Fix For: 1.5.0 Hi All, After playing with the current spark window implementation, I tried to take this to next level. My main goal is/was to address the following issues: - Native Spark-SQL, the current implementation relies only on Hive UDAFs. The improved implementation uses Spark SQL Aggregates. Hive UDAF's are still supported though. - Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. - Increased optimization opportunities. AggregateEvaluation style optimization should be possible for in frame processing. Tungsten might also provide interesting optimization opportunities. The current work is available at the following location: https://github.com/hvanhovell/spark-window I will try to turn this into a PR in the next couple of days. Meanwhile comments, feedback and other discussion is much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2883. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6194 [https://github.com/apache/spark/pull/6194] Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Priority: Critical Fix For: 1.4.0 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png, orc.diff Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7327) DataFrame show() method doesn't like empty dataframes
[ https://issues.apache.org/jira/browse/SPARK-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Girardot closed SPARK-7327. --- Resolution: Cannot Reproduce I can't seem to reproduce the issue and I did not attach any repro step... Sorry I'm closing this as Cannot reproduce until I find back my failing example. Thanks for your time. DataFrame show() method doesn't like empty dataframes - Key: SPARK-7327 URL: https://issues.apache.org/jira/browse/SPARK-7327 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Olivier Girardot Priority: Minor For an empty DataFrame (for exemple after a filter) any call to show() ends up with : {code} java.util.MissingFormatWidthException: -0s at java.util.Formatter$FormatSpecifier.checkGeneral(Formatter.java:2906) at java.util.Formatter$FormatSpecifier.init(Formatter.java:2680) at java.util.Formatter.parse(Formatter.java:2528) at java.util.Formatter.format(Formatter.java:2469) at java.util.Formatter.format(Formatter.java:2423) at java.lang.String.format(String.java:2790) at org.apache.spark.sql.DataFrame$$anonfun$showString$2$$anonfun$apply$4.apply(DataFrame.scala:200) at org.apache.spark.sql.DataFrame$$anonfun$showString$2$$anonfun$apply$4.apply(DataFrame.scala:199) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.DataFrame$$anonfun$showString$2.apply(DataFrame.scala:199) at org.apache.spark.sql.DataFrame$$anonfun$showString$2.apply(DataFrame.scala:198) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:198) at org.apache.spark.sql.DataFrame.show(DataFrame.scala:314) at org.apache.spark.sql.DataFrame.show(DataFrame.scala:320) {code} If no-one takes it by next friday, I'll fix it, the problem seems to come from the colWidths method : {code} // Compute the width of each column val colWidths = Array.fill(numCols)(0) for (row - rows) { for ((cell, i) - row.zipWithIndex) { colWidths(i) = math.max(colWidths(i), cell.length) } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7711) startTime() is missing
[ https://issues.apache.org/jira/browse/SPARK-7711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7711: - Priority: Minor (was: Major) startTime() is missing -- Key: SPARK-7711 URL: https://issues.apache.org/jira/browse/SPARK-7711 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Reporter: Sam Steingold Priority: Minor In PySpark 1.3, there appears to be no [startTime|https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#startTime%28%29]: {code} sc.startTime Traceback (most recent call last): File stdin, line 1, in module AttributeError: 'SparkContext' object has no attribute 'startTime' {code} Scala has the method: {code} scala sc.startTime res1: Long = 1431974499272 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers
[ https://issues.apache.org/jira/browse/SPARK-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548577#comment-14548577 ] Joseph K. Bradley commented on SPARK-7690: -- +1 We should also check for implementations in other packages to find out what arguments and argument names (including micro and macro) are most common. Perhaps R, Weka, etc. MulticlassClassificationEvaluator for tuning Multiclass Classifiers --- Key: SPARK-7690 URL: https://issues.apache.org/jira/browse/SPARK-7690 Project: Spark Issue Type: Improvement Components: ML Reporter: Ram Sriharsha Assignee: Ram Sriharsha Provide a MulticlassClassificationEvaluator with weighted F1-score to tune multiclass classifiers using Pipeline API. MLLib already provides a MulticlassMetrics functionality which can be wrapped around a MulticlassClassificationEvaluator to expose weighted F1-score as metric. The functionality could be similar to scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) in that we can support micro, macro and weighted versions of the F1-score (with weighted being default) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly
[ https://issues.apache.org/jira/browse/SPARK-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6785: --- Assignee: (was: Apache Spark) DateUtils can not handle date before 1970/01/01 correctly - Key: SPARK-6785 URL: https://issues.apache.org/jira/browse/SPARK-6785 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu {code} scala val d = new Date(100) d: java.sql.Date = 1969-12-31 scala DateUtils.toJavaDate(DateUtils.fromJavaDate(d)) res1: java.sql.Date = 1970-01-01 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly
[ https://issues.apache.org/jira/browse/SPARK-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548584#comment-14548584 ] Apache Spark commented on SPARK-6785: - User 'ckadner' has created a pull request for this issue: https://github.com/apache/spark/pull/6236 DateUtils can not handle date before 1970/01/01 correctly - Key: SPARK-6785 URL: https://issues.apache.org/jira/browse/SPARK-6785 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu {code} scala val d = new Date(100) d: java.sql.Date = 1969-12-31 scala DateUtils.toJavaDate(DateUtils.fromJavaDate(d)) res1: java.sql.Date = 1970-01-01 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7696) Aggregate function's result should be nullable only if the input expression is nullable
[ https://issues.apache.org/jira/browse/SPARK-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7696: --- Assignee: Apache Spark Aggregate function's result should be nullable only if the input expression is nullable --- Key: SPARK-7696 URL: https://issues.apache.org/jira/browse/SPARK-7696 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Haopu Wang Assignee: Apache Spark Priority: Minor In SparkSQL, the aggregate function's result currently is always nullable. It will make sense to change the behavior as: if the input expression is nullable, the result is nullable; Otherwise, the result is non-nullable. Please see the following discussion: From: Olivier Girardot [mailto:ssab...@gmail.com] Sent: Tuesday, May 12, 2015 5:12 AM To: Reynold Xin Cc: Haopu Wang; user Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable? I'll look into it - not sure yet what I can get out of exprs :p Le lun. 11 mai 2015 à 22:35, Reynold Xin r...@databricks.com a écrit : Thanks for catching this. I didn't read carefully enough. It'd make sense to have the udaf result be non-nullable, if the exprs are indeed non-nullable. On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot ssab...@gmail.com wrote: Hi Haopu, actually here `key` is nullable because this is your input's schema : scala result.printSchema root |-- key: string (nullable = true) |-- SUM(value): long (nullable = true) scala df.printSchema root |-- key: string (nullable = true) |-- value: long (nullable = false) I tried it with a schema where the key is not flagged as nullable, and the schema is actually respected. What you can argue however is that SUM(value) should also be not nullable since value is not nullable. @rxin do you think it would be reasonable to flag the Sum aggregation function as nullable (or not) depending on the input expression's schema ? Regards, Olivier. Le lun. 11 mai 2015 à 22:07, Reynold Xin r...@databricks.com a écrit : Not by design. Would you be interested in submitting a pull request? On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote: I try to get the result schema of aggregate functions using DataFrame API. However, I find the result field of groupBy columns are always nullable even the source field is not nullable. I want to know if this is by design, thank you! Below is the simple code to show the issue. == import sqlContext.implicits._ import org.apache.spark.sql.functions._ case class Test(key: String, value: Long) val df = sc.makeRDD(Seq(Test(k1,2),Test(k1,1))).toDF val result = df.groupBy(key).agg($key, sum(value)) // From the output, you can see the key column is nullable, why?? result.printSchema //root // |-- key: string (nullable = true) // |-- SUM(value): long (nullable = true) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7696) Aggregate function's result should be nullable only if the input expression is nullable
[ https://issues.apache.org/jira/browse/SPARK-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548617#comment-14548617 ] Apache Spark commented on SPARK-7696: - User 'ogirardot' has created a pull request for this issue: https://github.com/apache/spark/pull/6237 Aggregate function's result should be nullable only if the input expression is nullable --- Key: SPARK-7696 URL: https://issues.apache.org/jira/browse/SPARK-7696 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Haopu Wang Priority: Minor In SparkSQL, the aggregate function's result currently is always nullable. It will make sense to change the behavior as: if the input expression is nullable, the result is nullable; Otherwise, the result is non-nullable. Please see the following discussion: From: Olivier Girardot [mailto:ssab...@gmail.com] Sent: Tuesday, May 12, 2015 5:12 AM To: Reynold Xin Cc: Haopu Wang; user Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable? I'll look into it - not sure yet what I can get out of exprs :p Le lun. 11 mai 2015 à 22:35, Reynold Xin r...@databricks.com a écrit : Thanks for catching this. I didn't read carefully enough. It'd make sense to have the udaf result be non-nullable, if the exprs are indeed non-nullable. On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot ssab...@gmail.com wrote: Hi Haopu, actually here `key` is nullable because this is your input's schema : scala result.printSchema root |-- key: string (nullable = true) |-- SUM(value): long (nullable = true) scala df.printSchema root |-- key: string (nullable = true) |-- value: long (nullable = false) I tried it with a schema where the key is not flagged as nullable, and the schema is actually respected. What you can argue however is that SUM(value) should also be not nullable since value is not nullable. @rxin do you think it would be reasonable to flag the Sum aggregation function as nullable (or not) depending on the input expression's schema ? Regards, Olivier. Le lun. 11 mai 2015 à 22:07, Reynold Xin r...@databricks.com a écrit : Not by design. Would you be interested in submitting a pull request? On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote: I try to get the result schema of aggregate functions using DataFrame API. However, I find the result field of groupBy columns are always nullable even the source field is not nullable. I want to know if this is by design, thank you! Below is the simple code to show the issue. == import sqlContext.implicits._ import org.apache.spark.sql.functions._ case class Test(key: String, value: Long) val df = sc.makeRDD(Seq(Test(k1,2),Test(k1,1))).toDF val result = df.groupBy(key).agg($key, sum(value)) // From the output, you can see the key column is nullable, why?? result.printSchema //root // |-- key: string (nullable = true) // |-- SUM(value): long (nullable = true) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7696) Aggregate function's result should be nullable only if the input expression is nullable
[ https://issues.apache.org/jira/browse/SPARK-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7696: --- Assignee: (was: Apache Spark) Aggregate function's result should be nullable only if the input expression is nullable --- Key: SPARK-7696 URL: https://issues.apache.org/jira/browse/SPARK-7696 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Haopu Wang Priority: Minor In SparkSQL, the aggregate function's result currently is always nullable. It will make sense to change the behavior as: if the input expression is nullable, the result is nullable; Otherwise, the result is non-nullable. Please see the following discussion: From: Olivier Girardot [mailto:ssab...@gmail.com] Sent: Tuesday, May 12, 2015 5:12 AM To: Reynold Xin Cc: Haopu Wang; user Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable? I'll look into it - not sure yet what I can get out of exprs :p Le lun. 11 mai 2015 à 22:35, Reynold Xin r...@databricks.com a écrit : Thanks for catching this. I didn't read carefully enough. It'd make sense to have the udaf result be non-nullable, if the exprs are indeed non-nullable. On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot ssab...@gmail.com wrote: Hi Haopu, actually here `key` is nullable because this is your input's schema : scala result.printSchema root |-- key: string (nullable = true) |-- SUM(value): long (nullable = true) scala df.printSchema root |-- key: string (nullable = true) |-- value: long (nullable = false) I tried it with a schema where the key is not flagged as nullable, and the schema is actually respected. What you can argue however is that SUM(value) should also be not nullable since value is not nullable. @rxin do you think it would be reasonable to flag the Sum aggregation function as nullable (or not) depending on the input expression's schema ? Regards, Olivier. Le lun. 11 mai 2015 à 22:07, Reynold Xin r...@databricks.com a écrit : Not by design. Would you be interested in submitting a pull request? On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote: I try to get the result schema of aggregate functions using DataFrame API. However, I find the result field of groupBy columns are always nullable even the source field is not nullable. I want to know if this is by design, thank you! Below is the simple code to show the issue. == import sqlContext.implicits._ import org.apache.spark.sql.functions._ case class Test(key: String, value: Long) val df = sc.makeRDD(Seq(Test(k1,2),Test(k1,1))).toDF val result = df.groupBy(key).agg($key, sum(value)) // From the output, you can see the key column is nullable, why?? result.printSchema //root // |-- key: string (nullable = true) // |-- SUM(value): long (nullable = true) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7570) Ignore _temporary folders during partition discovery
[ https://issues.apache.org/jira/browse/SPARK-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7570. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6091 [https://github.com/apache/spark/pull/6091] Ignore _temporary folders during partition discovery Key: SPARK-7570 URL: https://issues.apache.org/jira/browse/SPARK-7570 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical Fix For: 1.4.0 When speculation is turned on, directories named {{_temporary}} may be left in data directories after saving a DataFrame. These directories should be ignored. Currently they simply fail partition discovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7380) Python: Transformer/Estimator should be copyable
[ https://issues.apache.org/jira/browse/SPARK-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7380. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6088 [https://github.com/apache/spark/pull/6088] Python: Transformer/Estimator should be copyable Key: SPARK-7380 URL: https://issues.apache.org/jira/browse/SPARK-7380 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Fix For: 1.4.0 Same as [SPARK-5956] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7631) treenode argString should not print children
[ https://issues.apache.org/jira/browse/SPARK-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7631. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6144 [https://github.com/apache/spark/pull/6144] treenode argString should not print children Key: SPARK-7631 URL: https://issues.apache.org/jira/browse/SPARK-7631 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Fei Wang Priority: Minor Fix For: 1.4.0 spark-sql explain extended select * from ( select key from src union all select key from src) t; the spark plan will print children in argString == Physical Plan == Union[ HiveTableScan [key#1], (MetastoreRelation default, src, None), None, HiveTableScan [key#3], (MetastoreRelation default, src, None), None] HiveTableScan [key#1], (MetastoreRelation default, src, None), None HiveTableScan [key#3], (MetastoreRelation default, src, None), None -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7269) Incorrect aggregation analysis
[ https://issues.apache.org/jira/browse/SPARK-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7269. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6173 [https://github.com/apache/spark/pull/6173] Incorrect aggregation analysis -- Key: SPARK-7269 URL: https://issues.apache.org/jira/browse/SPARK-7269 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Fix For: 1.4.0 In a case insensitive analyzer (HiveContext), the attribute name captial differences will fail the analysis check for aggregation. {code} test(check analysis failed in case in-sensitive) { Seq(1,2,3).map(i = (i, i.toString)).toDF(key, value).registerTempTable(df_analysis) sql(SELECT kEy from df_analysis group by key) } {code} {noformat} expression 'kEy' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() if you don't care which value you get.; org.apache.spark.sql.AnalysisException: expression 'kEy' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() if you don't care which value you get.; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:39) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:85) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$4.apply(CheckAnalysis.scala:101) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$4.apply(CheckAnalysis.scala:101) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:101) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:89) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:39) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1121) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:97) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply$mcV$sp(SQLQuerySuite.scala:408) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply(SQLQuerySuite.scala:406) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply(SQLQuerySuite.scala:406) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly
[ https://issues.apache.org/jira/browse/SPARK-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6785: --- Assignee: Apache Spark DateUtils can not handle date before 1970/01/01 correctly - Key: SPARK-6785 URL: https://issues.apache.org/jira/browse/SPARK-6785 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Apache Spark {code} scala val d = new Date(100) d: java.sql.Date = 1969-12-31 scala DateUtils.toJavaDate(DateUtils.fromJavaDate(d)) res1: java.sql.Date = 1970-01-01 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7497) test_count_by_value_and_window is flaky
[ https://issues.apache.org/jira/browse/SPARK-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548630#comment-14548630 ] Apache Spark commented on SPARK-7497: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/6239 test_count_by_value_and_window is flaky --- Key: SPARK-7497 URL: https://issues.apache.org/jira/browse/SPARK-7497 Project: Spark Issue Type: Bug Components: PySpark, Streaming Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Davies Liu Priority: Critical Labels: flaky-test Saw this test failure in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32268/console {code} == FAIL: test_count_by_value_and_window (__main__.WindowFunctionTests) -- Traceback (most recent call last): File pyspark/streaming/tests.py, line 418, in test_count_by_value_and_window self._test_func(input, func, expected) File pyspark/streaming/tests.py, line 133, in _test_func self.assertEqual(expected, result) AssertionError: Lists differ: [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]] != [[1], [2], [3], [4], [5], [6], [6], [6]] First list contains 2 additional elements. First extra element 8: [6] - [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]] ? -- + [[1], [2], [3], [4], [5], [6], [6], [6]] -- {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7497) test_count_by_value_and_window is flaky
[ https://issues.apache.org/jira/browse/SPARK-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7497: --- Assignee: Davies Liu (was: Apache Spark) test_count_by_value_and_window is flaky --- Key: SPARK-7497 URL: https://issues.apache.org/jira/browse/SPARK-7497 Project: Spark Issue Type: Bug Components: PySpark, Streaming Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Davies Liu Priority: Critical Labels: flaky-test Saw this test failure in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32268/console {code} == FAIL: test_count_by_value_and_window (__main__.WindowFunctionTests) -- Traceback (most recent call last): File pyspark/streaming/tests.py, line 418, in test_count_by_value_and_window self._test_func(input, func, expected) File pyspark/streaming/tests.py, line 133, in _test_func self.assertEqual(expected, result) AssertionError: Lists differ: [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]] != [[1], [2], [3], [4], [5], [6], [6], [6]] First list contains 2 additional elements. First extra element 8: [6] - [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]] ? -- + [[1], [2], [3], [4], [5], [6], [6], [6]] -- {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7497) test_count_by_value_and_window is flaky
[ https://issues.apache.org/jira/browse/SPARK-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7497: --- Assignee: Apache Spark (was: Davies Liu) test_count_by_value_and_window is flaky --- Key: SPARK-7497 URL: https://issues.apache.org/jira/browse/SPARK-7497 Project: Spark Issue Type: Bug Components: PySpark, Streaming Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark Priority: Critical Labels: flaky-test Saw this test failure in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32268/console {code} == FAIL: test_count_by_value_and_window (__main__.WindowFunctionTests) -- Traceback (most recent call last): File pyspark/streaming/tests.py, line 418, in test_count_by_value_and_window self._test_func(input, func, expected) File pyspark/streaming/tests.py, line 133, in _test_func self.assertEqual(expected, result) AssertionError: Lists differ: [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]] != [[1], [2], [3], [4], [5], [6], [6], [6]] First list contains 2 additional elements. First extra element 8: [6] - [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]] ? -- + [[1], [2], [3], [4], [5], [6], [6], [6]] -- {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7673) DataSourceStrategy's buildPartitionedTableScan always list list file status for all data files
[ https://issues.apache.org/jira/browse/SPARK-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-7673. - Resolution: Fixed Fix Version/s: 1.4.0 Issue has been addressed by https://github.com/apache/spark/commit/9dadf019b93038e1e18336ccd06c5eecb4bae32f. DataSourceStrategy's buildPartitionedTableScan always list list file status for all data files --- Key: SPARK-7673 URL: https://issues.apache.org/jira/browse/SPARK-7673 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Cheng Lian Priority: Blocker Fix For: 1.4.0 See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala#L134-141 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7713) Use shared broadcast hadoop conf for partitioned table scan.
Yin Huai created SPARK-7713: --- Summary: Use shared broadcast hadoop conf for partitioned table scan. Key: SPARK-7713 URL: https://issues.apache.org/jira/browse/SPARK-7713 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker While debugging SPARK-7673, we also found that we are broadcasting a hadoop conf for every Partition (backed by a Hadoop RDD). It also causes the performance regression of compiling a query involving a large number of partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6216) Check Python version in worker before run PySpark job
[ https://issues.apache.org/jira/browse/SPARK-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6216. --- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6203 [https://github.com/apache/spark/pull/6203] Check Python version in worker before run PySpark job - Key: SPARK-6216 URL: https://issues.apache.org/jira/browse/SPARK-6216 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.4.0 PySpark can only run with the same major version both in driver and worker ( both of the are 2.6 or 2.7), it will cause random error if it have 2.7 in driver or 2.6 in worker (or vice). For example: {code} davies@localhost:~/work/spark$ PYSPARK_PYTHON=python2.6 PYSPARK_DRIVER_PYTHON=python2.7 bin/pyspark Using Python version 2.7.7 (default, Jun 2 2014 12:48:16) SparkContext available as sc, SQLContext available as sqlCtx. sc.textFile('LICENSE').map(lambda l: l.split()).count() org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/davies/work/spark/python/pyspark/worker.py, line 101, in main process() File /Users/davies/work/spark/python/pyspark/worker.py, line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in pipeline_func return func(split, prev_func(split, iterator)) File /Users/davies/work/spark/python/pyspark/rdd.py, line 281, in func return f(iterator) File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in lambda return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in genexpr return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File stdin, line 1, in lambda TypeError: 'bool' object is not callable at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:136) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:177) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:95) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization
[ https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3267. - Resolution: Cannot Reproduce Deadlock between ScalaReflectionLock and Data type initialization - Key: SPARK-3267 URL: https://issues.apache.org/jira/browse/SPARK-3267 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Aaron Davidson Priority: Critical Deadlock here: {code} Executor task launch worker-0 daemon prio=10 tid=0x7fab50036000 nid=0x27a in Object.wait() [0x7fab60c2e000 ] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:202) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal a:175) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:304) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:314) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:313) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) ... {code} and {code} Executor task launch worker-2 daemon prio=10 tid=0x7fab100f0800 nid=0x27e in Object.wait() [0x7fab0eeec000 ] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250) - locked 0x00064e5d9a48 (a org.apache.spark.sql.catalyst.expressions.Cast) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. scala:139) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. scala:139) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:126) at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:197) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
[jira] [Resolved] (SPARK-4523) Improve handling of serialized schema information
[ https://issues.apache.org/jira/browse/SPARK-4523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4523. - Resolution: Won't Fix We haven't changed this for a few release now, and it seem unlikely that we will so I'm going to close this issue. Improve handling of serialized schema information - Key: SPARK-4523 URL: https://issues.apache.org/jira/browse/SPARK-4523 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Critical There are several issues with our current handling of metadata serialization, which is especially troublesome since this is the only place that we persist information directly using Spark SQL. Moving forward we should do the following: - Relax the parsing so that it does not fail when optional fields are missing (i.e. containsNull or metadata) - Include a regression suite that attempts to read old parquet files written by previous versions of Spark SQL. - Provide better warning messages when various forms of parsing fail (I think that it is silent right now which makes tracking down bugs more difficult than it needs to be). - Deprecate (display a warning) when reading data with the old case class schema representation and eventually remove it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6241) hiveql ANALYZE TABLE doesn't work for external tables
[ https://issues.apache.org/jira/browse/SPARK-6241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6241. - Resolution: Won't Fix Datasource tables have their own mechanism for reporting statistics that does not rely on ANALYZE. Please reopen if that is not working for you. hiveql ANALYZE TABLE doesn't work for external tables - Key: SPARK-6241 URL: https://issues.apache.org/jira/browse/SPARK-6241 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Kai Zeng Priority: Critical ANALYZE TABLE does not collect statistics for external tables, but works well for tables created by CREATE AS SELECT. Also tried to use refresh table to refresh metadata cache, but got NullPointer error: java.util.concurrent.ExecutionException: java.lang.NullPointerException at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344) at com.google.common.cache.LocalCache$Segment$1.run(LocalCache.java:2327) at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) at com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:101) at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:170) at com.google.common.cache.LocalCache$Segment.loadAsync(LocalCache.java:2322) at com.google.common.cache.LocalCache$Segment.refresh(LocalCache.java:2385) at com.google.common.cache.LocalCache.refresh(LocalCache.java:4085) at com.google.common.cache.LocalCache$LocalLoadingCache.refresh(LocalCache.java:4825) at org.apache.spark.sql.hive.HiveMetastoreCatalog.refreshTable(HiveMetastoreCatalog.scala:108) at org.apache.spark.sql.sources.RefreshTable.run(ddl.scala:404) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:117) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org