[jira] [Created] (SPARK-8891) Calling aggregation expressions on null literals fails at runtime
Josh Rosen created SPARK-8891: - Summary: Calling aggregation expressions on null literals fails at runtime Key: SPARK-8891 URL: https://issues.apache.org/jira/browse/SPARK-8891 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.3.1, 1.4.1, 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Minor Queries that call aggregate expressions with null literals, such as {{select avg(null)}} or {{select sum(null)}} fail with various errors due to mishandling of the internal NullType type. For instance, with codegen disabled on a recent 1.5 master: {code} scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:407) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:426) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:426) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:428) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:196) at org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:48) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:268) at org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:48) at org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:147) at org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:536) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:132) at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:125) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} When codegen is enabled, the resulting code fails to compile. The fix for this issue involves changes to Cast and Sum. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8600) Naive Bayes API for spark.ml Pipelines
[ https://issues.apache.org/jira/browse/SPARK-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8600: --- Assignee: Apache Spark (was: Yanbo Liang) Naive Bayes API for spark.ml Pipelines -- Key: SPARK-8600 URL: https://issues.apache.org/jira/browse/SPARK-8600 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Apache Spark Create a NaiveBayes API for the spark.ml Pipelines API. This should wrap the existing NaiveBayes implementation under spark.mllib package. Should also keep the parameter names consistent. The output columns could include both the prediction and confidence scores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8600) Naive Bayes API for spark.ml Pipelines
[ https://issues.apache.org/jira/browse/SPARK-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618219#comment-14618219 ] Apache Spark commented on SPARK-8600: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7284 Naive Bayes API for spark.ml Pipelines -- Key: SPARK-8600 URL: https://issues.apache.org/jira/browse/SPARK-8600 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Yanbo Liang Create a NaiveBayes API for the spark.ml Pipelines API. This should wrap the existing NaiveBayes implementation under spark.mllib package. Should also keep the parameter names consistent. The output columns could include both the prediction and confidence scores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8600) Naive Bayes API for spark.ml Pipelines
[ https://issues.apache.org/jira/browse/SPARK-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8600: --- Assignee: Yanbo Liang (was: Apache Spark) Naive Bayes API for spark.ml Pipelines -- Key: SPARK-8600 URL: https://issues.apache.org/jira/browse/SPARK-8600 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Yanbo Liang Create a NaiveBayes API for the spark.ml Pipelines API. This should wrap the existing NaiveBayes implementation under spark.mllib package. Should also keep the parameter names consistent. The output columns could include both the prediction and confidence scores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7050) Fix Python Kafka test assembly jar not found issue under Maven build
[ https://issues.apache.org/jira/browse/SPARK-7050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7050. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 5632 [https://github.com/apache/spark/pull/5632] Fix Python Kafka test assembly jar not found issue under Maven build Key: SPARK-7050 URL: https://issues.apache.org/jira/browse/SPARK-7050 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.1 Reporter: Saisai Shao Priority: Minor Fix For: 1.5.0 The behavior of {{mvn package}} and {{sbt kafka-assembly/assembly}} under kafka-assembly module is different, sbt will generate an assembly jar under target/scala-version/, while mvn generates this jar under target/, which will make python Kafka streaming unit test fail to find the related jar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8894) Example code errors in SparkR documentation
Sun Rui created SPARK-8894: -- Summary: Example code errors in SparkR documentation Key: SPARK-8894 URL: https://issues.apache.org/jira/browse/SPARK-8894 Project: Spark Issue Type: Bug Components: Documentation, SparkR Affects Versions: 1.4.0 Reporter: Sun Rui There are errors in SparkR related documentation. 1. in https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, for R part, {code} results = sqlContext.sql(FROM src SELECT key, value).collect() {code} should be {code} results - collect(sql(sqlContext, FROM src SELECT key, value)) {code} 2. in https://spark.apache.org/docs/latest/sparkr.html#from-hive-tables, {code} results - hiveContext.sql(FROM src SELECT key, value) {code} should be {code} results - sql(hiveContext, FROM src SELECT key, value) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8895) MetricsSystem.removeSource not called in StreamingContext.stop
Aniket Bhatnagar created SPARK-8895: --- Summary: MetricsSystem.removeSource not called in StreamingContext.stop Key: SPARK-8895 URL: https://issues.apache.org/jira/browse/SPARK-8895 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar Priority: Minor StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named .StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8896) StreamingSource should choose a unique name
Aniket Bhatnagar created SPARK-8896: --- Summary: StreamingSource should choose a unique name Key: SPARK-8896 URL: https://issues.apache.org/jira/browse/SPARK-8896 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7050) Fix Python Kafka test assembly jar not found issue under Maven build
[ https://issues.apache.org/jira/browse/SPARK-7050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7050: - Assignee: Saisai Shao Fix Python Kafka test assembly jar not found issue under Maven build Key: SPARK-7050 URL: https://issues.apache.org/jira/browse/SPARK-7050 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.1 Reporter: Saisai Shao Assignee: Saisai Shao Priority: Minor Fix For: 1.5.0 The behavior of {{mvn package}} and {{sbt kafka-assembly/assembly}} under kafka-assembly module is different, sbt will generate an assembly jar under target/scala-version/, while mvn generates this jar under target/, which will make python Kafka streaming unit test fail to find the related jar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8894) Example code errors in SparkR documentation
[ https://issues.apache.org/jira/browse/SPARK-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8894: --- Assignee: (was: Apache Spark) Example code errors in SparkR documentation --- Key: SPARK-8894 URL: https://issues.apache.org/jira/browse/SPARK-8894 Project: Spark Issue Type: Bug Components: Documentation, SparkR Affects Versions: 1.4.0 Reporter: Sun Rui There are errors in SparkR related documentation. 1. in https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, for R part, {code} results = sqlContext.sql(FROM src SELECT key, value).collect() {code} should be {code} results - collect(sql(sqlContext, FROM src SELECT key, value)) {code} 2. in https://spark.apache.org/docs/latest/sparkr.html#from-hive-tables, {code} results - hiveContext.sql(FROM src SELECT key, value) {code} should be {code} results - sql(hiveContext, FROM src SELECT key, value) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8894) Example code errors in SparkR documentation
[ https://issues.apache.org/jira/browse/SPARK-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618508#comment-14618508 ] Apache Spark commented on SPARK-8894: - User 'sun-rui' has created a pull request for this issue: https://github.com/apache/spark/pull/7287 Example code errors in SparkR documentation --- Key: SPARK-8894 URL: https://issues.apache.org/jira/browse/SPARK-8894 Project: Spark Issue Type: Bug Components: Documentation, SparkR Affects Versions: 1.4.0 Reporter: Sun Rui There are errors in SparkR related documentation. 1. in https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, for R part, {code} results = sqlContext.sql(FROM src SELECT key, value).collect() {code} should be {code} results - collect(sql(sqlContext, FROM src SELECT key, value)) {code} 2. in https://spark.apache.org/docs/latest/sparkr.html#from-hive-tables, {code} results - hiveContext.sql(FROM src SELECT key, value) {code} should be {code} results - sql(hiveContext, FROM src SELECT key, value) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8894) Example code errors in SparkR documentation
[ https://issues.apache.org/jira/browse/SPARK-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8894: --- Assignee: Apache Spark Example code errors in SparkR documentation --- Key: SPARK-8894 URL: https://issues.apache.org/jira/browse/SPARK-8894 Project: Spark Issue Type: Bug Components: Documentation, SparkR Affects Versions: 1.4.0 Reporter: Sun Rui Assignee: Apache Spark There are errors in SparkR related documentation. 1. in https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, for R part, {code} results = sqlContext.sql(FROM src SELECT key, value).collect() {code} should be {code} results - collect(sql(sqlContext, FROM src SELECT key, value)) {code} 2. in https://spark.apache.org/docs/latest/sparkr.html#from-hive-tables, {code} results - hiveContext.sql(FROM src SELECT key, value) {code} should be {code} results - sql(hiveContext, FROM src SELECT key, value) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8896) StreamingSource should choose a unique name
[ https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-8896: Description: If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] was: If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. {quote} [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] {quote} StreamingSource should choose a unique name --- Key: SPARK-8896 URL: https://issues.apache.org/jira/browse/SPARK-8896 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8895) MetricsSystem.removeSource not called in StreamingContext.stop
[ https://issues.apache.org/jira/browse/SPARK-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-8895: Description: StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: ?? [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] ?? was: StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: {{ [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] }} MetricsSystem.removeSource not called in StreamingContext.stop -- Key: SPARK-8895 URL: https://issues.apache.org/jira/browse/SPARK-8895 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar Priority: Minor StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: ?? [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] ?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8896) StreamingSource should choose a unique name
[ https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-8896: Description: If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. {quote} [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] {quote} was: If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] StreamingSource should choose a unique name --- Key: SPARK-8896 URL: https://issues.apache.org/jira/browse/SPARK-8896 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. {quote} [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8068) Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
[ https://issues.apache.org/jira/browse/SPARK-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618403#comment-14618403 ] Apache Spark commented on SPARK-8068: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7286 Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib -- Key: SPARK-8068 URL: https://issues.apache.org/jira/browse/SPARK-8068 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.1 Reporter: Ai He Priority: Minor There is no confusionMatrix method at class MulticlassMetrics in pyspark/mllib. This method is actually implemented in scala mllib. To achieve this, we just need add a function call to the corresponding one in scala mllib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8068) Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
[ https://issues.apache.org/jira/browse/SPARK-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8068: --- Assignee: Apache Spark Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib -- Key: SPARK-8068 URL: https://issues.apache.org/jira/browse/SPARK-8068 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.1 Reporter: Ai He Assignee: Apache Spark Priority: Minor There is no confusionMatrix method at class MulticlassMetrics in pyspark/mllib. This method is actually implemented in scala mllib. To achieve this, we just need add a function call to the corresponding one in scala mllib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8895) MetricsSystem.removeSource not called in StreamingContext.stop
[ https://issues.apache.org/jira/browse/SPARK-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar updated SPARK-8895: Description: StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: {quote} [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named .StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] {quote} was: StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named .StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] MetricsSystem.removeSource not called in StreamingContext.stop -- Key: SPARK-8895 URL: https://issues.apache.org/jira/browse/SPARK-8895 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar Priority: Minor StreamingContext calls env.metricsSystem.registerSource during its construction but does not call env.metricsSystem.removeSource. Therefore, if a user attempts to restart a Streaming job in the same JVM by creating a new instance of StreamingContext with the same application name, it results in exceptions like the following in the log: {quote} [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named .StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8881) Scheduling fails if num_executors num_workers
[ https://issues.apache.org/jira/browse/SPARK-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618313#comment-14618313 ] Sean Owen commented on SPARK-8881: -- Yes, the punchline is that each worker is asked for 48/4 = 12 cores, but 12 is less than the 16 cores each executor needs, so for every worker, 0 executors are allocated. Grabbing cores in chunks of 16 in this case works, as does only considering 3 workers to allocate 3 executors, since the problem is that it never makes sense to try allocating N executors over MN workers. Scheduling fails if num_executors num_workers --- Key: SPARK-8881 URL: https://issues.apache.org/jira/browse/SPARK-8881 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.4.0, 1.5.0 Reporter: Nishkam Ravi Current scheduling algorithm (in Master.scala) has two issues: 1. cores are allocated one at a time instead of spark.executor.cores at a time 2. when spark.cores.max/spark.executor.cores num_workers, executors are not launched and the app hangs (due to 1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8872) Improve FPGrowthSuite with equivalent R code
[ https://issues.apache.org/jira/browse/SPARK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618352#comment-14618352 ] Kashif Rasul commented on SPARK-8872: - Ok should be ready for review. Improve FPGrowthSuite with equivalent R code Key: SPARK-8872 URL: https://issues.apache.org/jira/browse/SPARK-8872 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Priority: Minor Labels: starter Original Estimate: 3h Remaining Estimate: 3h In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the expected output is hard-coded. We can add equivalent R code using the arules package to generate the expect output for validation purpose, similar to https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98 and the test code in https://github.com/apache/spark/pull/7005. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8893) Require positive partition counts in RDD.repartition
[ https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Darabos updated SPARK-8893: -- Description: What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}} 1, it returns {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}. I think the case is pretty clear for {{p}} 0. But the behavior for {{p}} = 0 is also error prone. In fact that's how I found this strange behavior. I used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results. I'm happy to send a pull request for this. was: What does {{sc.parallelize(1 to 3).repartition(x).collect}} return? I would expect {{Array(1, 2, 3)}} regardless of {{x}}. But if {{x}} 1, it returns {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}. I think the case is pretty clear for {{x}} 0. But the behavior for {{x}} = 0 is also error prone. In fact that's how I found this strange behavior. I used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results. I'm happy to send a pull request for this. Require positive partition counts in RDD.repartition Key: SPARK-8893 URL: https://issues.apache.org/jira/browse/SPARK-8893 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: Daniel Darabos Priority: Trivial What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}} 1, it returns {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}. I think the case is pretty clear for {{p}} 0. But the behavior for {{p}} = 0 is also error prone. In fact that's how I found this strange behavior. I used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results. I'm happy to send a pull request for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8866) Use 1 microsecond (us) precision for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618142#comment-14618142 ] Apache Spark commented on SPARK-8866: - User 'yijieshen' has created a pull request for this issue: https://github.com/apache/spark/pull/7283 Use 1 microsecond (us) precision for TimestampType -- Key: SPARK-8866 URL: https://issues.apache.org/jira/browse/SPARK-8866 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Yijie Shen 100ns is slightly weird to compute. Let's use 1us to be more consistent with other systems (e.g. Postgres) and less error prone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8866) Use 1 microsecond (us) precision for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8866: --- Assignee: Apache Spark (was: Yijie Shen) Use 1 microsecond (us) precision for TimestampType -- Key: SPARK-8866 URL: https://issues.apache.org/jira/browse/SPARK-8866 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Apache Spark 100ns is slightly weird to compute. Let's use 1us to be more consistent with other systems (e.g. Postgres) and less error prone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-8864: - Comment: was deleted (was: Thanks for explanation. The design looks good to me now.) Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)
[ https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7917. -- Resolution: Duplicate A-ha, that's the ticket. I knew there was something like this already resolved. Spark doesn't clean up Application Directories (local dirs) Key: SPARK-7917 URL: https://issues.apache.org/jira/browse/SPARK-7917 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zach Fry Priority: Minor Similar to SPARK-4834. Spark does clean up the cache and lock files in the local dirs, however, it doesn't clean up the actual directories. We have to write custom scripts to go back through the local dirs and find directories that don't contain any files and clear those out. Its a pretty simple repro: Run a job that does some shuffling, wait for the shuffle files to get cleaned up, go and look on disk at spark.local.dir and notice that the directory(s) are still there, but there are no files in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8881) Scheduling fails if num_executors num_workers
[ https://issues.apache.org/jira/browse/SPARK-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618362#comment-14618362 ] Nishkam Ravi commented on SPARK-8881: - There's more to it. Consider the following: three workers with num_cores (8, 8, 2). spark.cores.maximum = 12, spark.executor.cores = 4. Core allocation would be (5, 5, 2). num_executors = num_workers and nothing gets launched! Problem isn't that num_workers num_executors (that's just a place it manifests in practice). Problem is we are allocating one core at a time and ignoring spark.executor.cores during allocation. Scheduling fails if num_executors num_workers --- Key: SPARK-8881 URL: https://issues.apache.org/jira/browse/SPARK-8881 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.4.0, 1.5.0 Reporter: Nishkam Ravi Current scheduling algorithm (in Master.scala) has two issues: 1. cores are allocated one at a time instead of spark.executor.cores at a time 2. when spark.cores.max/spark.executor.cores num_workers, executors are not launched and the app hangs (due to 1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8893) Require positive partition counts in RDD.repartition
[ https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618402#comment-14618402 ] Apache Spark commented on SPARK-8893: - User 'darabos' has created a pull request for this issue: https://github.com/apache/spark/pull/7285 Require positive partition counts in RDD.repartition Key: SPARK-8893 URL: https://issues.apache.org/jira/browse/SPARK-8893 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: Daniel Darabos Priority: Trivial What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}} 1, it returns {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}. I think the case is pretty clear for {{p}} 0. But the behavior for {{p}} = 0 is also error prone. In fact that's how I found this strange behavior. I used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results. I'm happy to send a pull request for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8893) Require positive partition counts in RDD.repartition
[ https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8893: --- Assignee: Apache Spark Require positive partition counts in RDD.repartition Key: SPARK-8893 URL: https://issues.apache.org/jira/browse/SPARK-8893 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: Daniel Darabos Assignee: Apache Spark Priority: Trivial What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}} 1, it returns {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}. I think the case is pretty clear for {{p}} 0. But the behavior for {{p}} = 0 is also error prone. In fact that's how I found this strange behavior. I used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results. I'm happy to send a pull request for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8893) Require positive partition counts in RDD.repartition
[ https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8893: --- Assignee: (was: Apache Spark) Require positive partition counts in RDD.repartition Key: SPARK-8893 URL: https://issues.apache.org/jira/browse/SPARK-8893 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: Daniel Darabos Priority: Trivial What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}} 1, it returns {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}. I think the case is pretty clear for {{p}} 0. But the behavior for {{p}} = 0 is also error prone. In fact that's how I found this strange behavior. I used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results. I'm happy to send a pull request for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618172#comment-14618172 ] Qian Huang commented on SPARK-8514: --- [~zhaoxiangyu] Not yet. [~shivaram] shared a communication efficient version, which looks a better version than scalapack's. I am reading the paper and working at this. LU factorization on BlockMatrix --- Key: SPARK-8514 URL: https://issues.apache.org/jira/browse/SPARK-8514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Labels: advanced LU is the most common method to solve a general linear system or inverse a general matrix. A distributed version could in implemented block-wise with pipelining. A reference implementation is provided in ScaLAPACK: http://netlib.org/scalapack/slug/node178.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618200#comment-14618200 ] Cheng Hao commented on SPARK-8864: -- Thanks for explanation. The design looks good to me now. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618201#comment-14618201 ] Cheng Hao commented on SPARK-8864: -- Thanks for explanation. The design looks good to me now. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
[ https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618222#comment-14618222 ] Vinay commented on SPARK-7442: -- Tried and tested-- Steps to submit spark job when jar file resides in S3: Step 1: Add these dependencies in pom file A. hadoop-common.jar(optional if already present in class path) B. hadoop-aws.jar C. aws-java-sdk.jar These steps have to be followed for both master slaves Step 2: export AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY in bash export AWS_ACCESS_KEY_ID= export AWS_SECRET_ACCESS_KEY= Note: These properties can also be set as per AWS environment. Step 3: Add the the below dependencies in spark-env.sh” these steps to be followed in slaves SPARK_CLASSPATH=../lib/hadoop-aws-2.6.0.jar SPARK_CLASSPATH=$SPARK_CLASSPATH:../lib/aws-java-sdk-1.7.4.jar SPARK_CLASSPATH=$SPARK_CLASSPATH:..lib/guava-11.0.2.jar Note: will be aviable in hadoop or else can download it Step 4: When running Spark job append --deploy-mode cluster Sample command to submit spark job: spark-submit --class com.x.y.z --master spark://master:7077 --deploy-mode cluster s3://bucket name/xyz.jar args Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access - Key: SPARK-7442 URL: https://issues.apache.org/jira/browse/SPARK-7442 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.1 Environment: OS X Reporter: Nicholas Chammas # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads page|http://spark.apache.org/downloads.html]. # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}} # Fire up PySpark and try reading from S3 with something like this: {code}sc.textFile('s3n://bucket/file_*').count(){code} # You will get an error like this: {code}py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.io.IOException: No FileSystem for scheme: s3n{code} {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 works. It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 that doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618277#comment-14618277 ] Ma Xiaoyu commented on SPARK-5159: -- I was investigating this issue and it seems doAs in Hiveserver2 code was working. The problem is when it forwarding some event in DAGScheduler, the event goes through different thread and the ticket in receiving side thread is not the same as sending side. The proxy user became the real user who started the hiveserver2 services. Is that the root cause? I might be making patch if so. Thrift server does not respect hive.server2.enable.doAs=true Key: SPARK-5159 URL: https://issues.apache.org/jira/browse/SPARK-5159 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andrew Ray I'm currently testing the spark sql thrift server on a kerberos secured cluster in YARN mode. Currently any user can access any table regardless of HDFS permissions as all data is read as the hive user. In HiveServer2 the property hive.server2.enable.doAs=true causes all access to be done as the submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618155#comment-14618155 ] zhaoxiangyu commented on SPARK-8514: I want to know if you have implemented the LU factorization on BlockMatrix? LU factorization on BlockMatrix --- Key: SPARK-8514 URL: https://issues.apache.org/jira/browse/SPARK-8514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Labels: advanced LU is the most common method to solve a general linear system or inverse a general matrix. A distributed version could in implemented block-wise with pipelining. A reference implementation is provided in ScaLAPACK: http://netlib.org/scalapack/slug/node178.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618191#comment-14618191 ] zhaoxiangyu commented on SPARK-8514: can you give me the link of Shivaram Verkataraman's work? LU factorization on BlockMatrix --- Key: SPARK-8514 URL: https://issues.apache.org/jira/browse/SPARK-8514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Labels: advanced LU is the most common method to solve a general linear system or inverse a general matrix. A distributed version could in implemented block-wise with pipelining. A reference implementation is provided in ScaLAPACK: http://netlib.org/scalapack/slug/node178.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8892) Column.cast(LongType) does not work for large values
Jason Moore created SPARK-8892: -- Summary: Column.cast(LongType) does not work for large values Key: SPARK-8892 URL: https://issues.apache.org/jira/browse/SPARK-8892 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Jason Moore It seems that casting a column from String to Long seems to go through an intermediate step of being cast to a Double (hits Cast.scala line 328 in castToDecimal). The result is that for large values, the wrong value is returned. This test reveals this bug: {code:java} import org.apache.spark.sql.types._ import org.apache.spark.sql.{Row, SQLContext} import org.apache.spark.{SparkConf, SparkContext} import org.scalatest.FlatSpec import scala.util.Random class DataFrameCastBug extends FlatSpec { DataFrame should cast StringType to LongType correctly in { val sc = new SparkContext(new SparkConf().setMaster(local).setAppName(app)) val qc = new SQLContext(sc) val values = Seq.fill(10)(Random.nextLong) val source = qc.createDataFrame( sc.parallelize(values.map(v = Row(v))), StructType(Seq(StructField(value, LongType val result = source.select(source(value), source(value).cast(StringType).cast(LongType).as(castValue)) assert(result.where(result(value) !== result(castValue)).count() === 0) } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8885) libgplcompression.so already loaded in another classloader
[ https://issues.apache.org/jira/browse/SPARK-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8885. -- Resolution: Invalid [~cenyuhai] There is a lot wrong with this JIRA, most importantly that it's a question for user@. Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first before opening a JIRA libgplcompression.so already loaded in another classloader -- Key: SPARK-8885 URL: https://issues.apache.org/jira/browse/SPARK-8885 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: CentOS 6.2 JDK 1.7.0_51 Spark version: 1.4.0 Hadoop version: 2.2.0 hadoop native lib 32bit Reporter: cen yuhai Fix For: 1.4.2, 1.5.0 Hi, all I found an Exception when using spark-sql java.lang.UnsatisfiedLinkError: Native Library /data/lib/native/libgplcompression.so already loaded in another classloader ... I set spark.sql.hive.metastore.jars=. in file spark-defaults.conf It does not happen every time. Who knows why? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8893) Require positive partition counts in RDD.repartition
Daniel Darabos created SPARK-8893: - Summary: Require positive partition counts in RDD.repartition Key: SPARK-8893 URL: https://issues.apache.org/jira/browse/SPARK-8893 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: Daniel Darabos Priority: Trivial What does {{sc.parallelize(1 to 3).repartition(x).collect}} return? I would expect {{Array(1, 2, 3)}} regardless of {{x}}. But if {{x}} 1, it returns {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}. I think the case is pretty clear for {{x}} 0. But the behavior for {{x}} = 0 is also error prone. In fact that's how I found this strange behavior. I used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results. I'm happy to send a pull request for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8068) Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
[ https://issues.apache.org/jira/browse/SPARK-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8068: --- Assignee: (was: Apache Spark) Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib -- Key: SPARK-8068 URL: https://issues.apache.org/jira/browse/SPARK-8068 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.1 Reporter: Ai He Priority: Minor There is no confusionMatrix method at class MulticlassMetrics in pyspark/mllib. This method is actually implemented in scala mllib. To achieve this, we just need add a function call to the corresponding one in scala mllib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8897) SparkR DataFrame fail to return data of float type
Sun Rui created SPARK-8897: -- Summary: SparkR DataFrame fail to return data of float type Key: SPARK-8897 URL: https://issues.apache.org/jira/browse/SPARK-8897 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui This is an issue reported by Evgeny Sinelnikov esinelni...@griddynamics.com in the Spark mailing list: {quote} I'm got a trouble with float type coercion on SparkR with hiveContext. result - sql(hiveContext, SELECT offset, percentage from data limit 100) show(result) DataFrame[offset:float, percentage:float] head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6266) PySpark SparseVector missing doc for size, indices, values
[ https://issues.apache.org/jira/browse/SPARK-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6266: --- Assignee: (was: Apache Spark) PySpark SparseVector missing doc for size, indices, values -- Key: SPARK-6266 URL: https://issues.apache.org/jira/browse/SPARK-6266 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Need to add doc for size, indices, values attributes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6266) PySpark SparseVector missing doc for size, indices, values
[ https://issues.apache.org/jira/browse/SPARK-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6266: --- Assignee: Apache Spark PySpark SparseVector missing doc for size, indices, values -- Key: SPARK-6266 URL: https://issues.apache.org/jira/browse/SPARK-6266 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Minor Need to add doc for size, indices, values attributes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8898) Jets3t hangs with more than 1 core
Daniel Darabos created SPARK-8898: - Summary: Jets3t hangs with more than 1 core Key: SPARK-8898 URL: https://issues.apache.org/jira/browse/SPARK-8898 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.4.0 Environment: S3 Reporter: Daniel Darabos If I have an RDD that reads from S3 ({{newAPIHadoopFile}}), and try to write this to S3 ({{saveAsNewAPIHadoopFile}}), it hangs if I have more than 1 core per executor. It sounds like a race condition, but so far I have seen it trigger 100% of the time. From a race for taking a limited number of connections I would expect it to succeed at least on 1 task at least some of the time. But I never saw a single completed task, except when running with 1-core executors. All executor threads hang with one of the following two stack traces: {noformat:title=Stack trace 1} java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x0007759cae70 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518) - locked 0x0007759cae70 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.jets3t.service.impl.rest.httpclient.RestS3Service.performRequest(RestS3Service.java:342) at org.jets3t.service.impl.rest.httpclient.RestS3Service.performRestHead(RestS3Service.java:718) at org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectImpl(RestS3Service.java:1599) at org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectDetailsImpl(RestS3Service.java:1535) at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1987) at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1332) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:107) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at org.apache.hadoop.fs.s3native.$Proxy8.retrieveMetadata(Unknown Source) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:414) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:341) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:851) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:832) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:731) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1030) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1014) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} {noformat:title=Stack trace 2} java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x0007759cae70 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518) - locked 0x0007759cae70 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618624#comment-14618624 ] Greg Senia commented on SPARK-5159: --- Yes that is the exact issue. It doesnt execute as proxy user.. This works correctly with native hiveserver2 with hive but not with sparksql thriftserver. Thrift server does not respect hive.server2.enable.doAs=true Key: SPARK-5159 URL: https://issues.apache.org/jira/browse/SPARK-5159 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andrew Ray I'm currently testing the spark sql thrift server on a kerberos secured cluster in YARN mode. Currently any user can access any table regardless of HDFS permissions as all data is read as the hive user. In HiveServer2 the property hive.server2.enable.doAs=true causes all access to be done as the submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8897) SparkR DataFrame fail to return data of float type
[ https://issues.apache.org/jira/browse/SPARK-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8897: --- Assignee: Apache Spark SparkR DataFrame fail to return data of float type -- Key: SPARK-8897 URL: https://issues.apache.org/jira/browse/SPARK-8897 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui Assignee: Apache Spark This is an issue reported by Evgeny Sinelnikov esinelni...@griddynamics.com in the Spark mailing list: {quote} I'm got a trouble with float type coercion on SparkR with hiveContext. result - sql(hiveContext, SELECT offset, percentage from data limit 100) show(result) DataFrame[offset:float, percentage:float] head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8896) StreamingSource should choose a unique name
[ https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618567#comment-14618567 ] Sean Owen commented on SPARK-8896: -- Why do you have multiple contexts? I think that's not allowed, right? StreamingSource should choose a unique name --- Key: SPARK-8896 URL: https://issues.apache.org/jira/browse/SPARK-8896 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8897) SparkR DataFrame fail to return data of float type
[ https://issues.apache.org/jira/browse/SPARK-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618607#comment-14618607 ] Apache Spark commented on SPARK-8897: - User 'sun-rui' has created a pull request for this issue: https://github.com/apache/spark/pull/7289 SparkR DataFrame fail to return data of float type -- Key: SPARK-8897 URL: https://issues.apache.org/jira/browse/SPARK-8897 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui This is an issue reported by Evgeny Sinelnikov esinelni...@griddynamics.com in the Spark mailing list: {quote} I'm got a trouble with float type coercion on SparkR with hiveContext. result - sql(hiveContext, SELECT offset, percentage from data limit 100) show(result) DataFrame[offset:float, percentage:float] head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8897) SparkR DataFrame fail to return data of float type
[ https://issues.apache.org/jira/browse/SPARK-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8897: --- Assignee: (was: Apache Spark) SparkR DataFrame fail to return data of float type -- Key: SPARK-8897 URL: https://issues.apache.org/jira/browse/SPARK-8897 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui This is an issue reported by Evgeny Sinelnikov esinelni...@griddynamics.com in the Spark mailing list: {quote} I'm got a trouble with float type coercion on SparkR with hiveContext. result - sql(hiveContext, SELECT offset, percentage from data limit 100) show(result) DataFrame[offset:float, percentage:float] head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8600) Naive Bayes API for spark.ml Pipelines
[ https://issues.apache.org/jira/browse/SPARK-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8600: - Shepherd: Joseph K. Bradley Naive Bayes API for spark.ml Pipelines -- Key: SPARK-8600 URL: https://issues.apache.org/jira/browse/SPARK-8600 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Yanbo Liang Create a NaiveBayes API for the spark.ml Pipelines API. This should wrap the existing NaiveBayes implementation under spark.mllib package. Should also keep the parameter names consistent. The output columns could include both the prediction and confidence scores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8896) StreamingSource should choose a unique name
[ https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618764#comment-14618764 ] Aniket Bhatnagar commented on SPARK-8896: - As per the documentation (scala docs), only one SparkContext per JVM is allowed. However, no such documentation exists for StreamingContext. SparkContext makes an effort to ensure only one SparkContext instance exists in the JVM using contextBeingConstructed variable. However, no such effort is made in StreamingContext. This lead me to believe that multiple StreamingContexts are allowed in a JVM. I can understand why only one SparkContext is allowed per JVM (global state, et al), but I don't think thats true for StreamingContext. There maybe genuine use cases to have 2 StreamingContext instances using the same SparkContext to leverage the same workers and have different batch intervals. StreamingSource should choose a unique name --- Key: SPARK-8896 URL: https://issues.apache.org/jira/browse/SPARK-8896 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Aniket Bhatnagar If 2 instances of StreamingContext are created and run using the same SparkContext, it results the following exception in the logs and causes the latter StreamingContext's metrics to go unreported. [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already registered java.lang.IllegalArgumentException: A metric named AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0] at org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) ~[spark-core_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) [spark-streaming_2.11-1.4.0.jar:1.4.0] at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) [spark-streaming_2.11-1.4.0.jar:1.4.0] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8899) remove duplicated equals method for Row
[ https://issues.apache.org/jira/browse/SPARK-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618744#comment-14618744 ] Apache Spark commented on SPARK-8899: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7291 remove duplicated equals method for Row --- Key: SPARK-8899 URL: https://issues.apache.org/jira/browse/SPARK-8899 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8899) remove duplicated equals method for Row
[ https://issues.apache.org/jira/browse/SPARK-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8899: --- Assignee: (was: Apache Spark) remove duplicated equals method for Row --- Key: SPARK-8899 URL: https://issues.apache.org/jira/browse/SPARK-8899 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8899) remove duplicated equals method for Row
[ https://issues.apache.org/jira/browse/SPARK-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8899: --- Assignee: Apache Spark remove duplicated equals method for Row --- Key: SPARK-8899 URL: https://issues.apache.org/jira/browse/SPARK-8899 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8898) Jets3t hangs with more than 1 core
[ https://issues.apache.org/jira/browse/SPARK-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618747#comment-14618747 ] Sean Owen commented on SPARK-8898: -- Yeah, this is a jets3t problem. You will have to manage to update it in your build or get EC2 + Hadoop 2 to work, which I think can be done. At least, this is just a subset of EC2 should support Hadoop 2 and/or that the EC2 support should move out of Spark anyway. I don't know there's another action to take in Spark. Jets3t hangs with more than 1 core -- Key: SPARK-8898 URL: https://issues.apache.org/jira/browse/SPARK-8898 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.4.0 Environment: S3 Reporter: Daniel Darabos If I have an RDD that reads from S3 ({{newAPIHadoopFile}}), and try to write this to S3 ({{saveAsNewAPIHadoopFile}}), it hangs if I have more than 1 core per executor. It sounds like a race condition, but so far I have seen it trigger 100% of the time. From a race for taking a limited number of connections I would expect it to succeed at least on 1 task at least some of the time. But I never saw a single completed task, except when running with 1-core executors. All executor threads hang with one of the following two stack traces: {noformat:title=Stack trace 1} java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x0007759cae70 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518) - locked 0x0007759cae70 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.jets3t.service.impl.rest.httpclient.RestS3Service.performRequest(RestS3Service.java:342) at org.jets3t.service.impl.rest.httpclient.RestS3Service.performRestHead(RestS3Service.java:718) at org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectImpl(RestS3Service.java:1599) at org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectDetailsImpl(RestS3Service.java:1535) at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1987) at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1332) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:107) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at org.apache.hadoop.fs.s3native.$Proxy8.retrieveMetadata(Unknown Source) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:414) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:341) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:851) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:832) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:731) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1030) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1014) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at
[jira] [Created] (SPARK-8899) remove duplicated equals method for Row
Wenchen Fan created SPARK-8899: -- Summary: remove duplicated equals method for Row Key: SPARK-8899 URL: https://issues.apache.org/jira/browse/SPARK-8899 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8118: - Target Version/s: 1.5.0 (was: 1.4.1, 1.5.0) Turn off noisy log output produced by Parquet 1.7.0 --- Key: SPARK-8118 URL: https://issues.apache.org/jira/browse/SPARK-8118 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.4.1, 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Fix For: 1.5.0 Parquet 1.7.0 renames package name to org.apache.parquet, need to adjust {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8872) Improve FPGrowthSuite with equivalent R code
[ https://issues.apache.org/jira/browse/SPARK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8872: - Assignee: Kashif Rasul Improve FPGrowthSuite with equivalent R code Key: SPARK-8872 URL: https://issues.apache.org/jira/browse/SPARK-8872 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Kashif Rasul Priority: Minor Labels: starter Fix For: 1.5.0 Original Estimate: 3h Remaining Estimate: 3h In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the expected output is hard-coded. We can add equivalent R code using the arules package to generate the expect output for validation purpose, similar to https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98 and the test code in https://github.com/apache/spark/pull/7005. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8873) Support cleaning up shuffle files for drivers launched with Mesos
[ https://issues.apache.org/jira/browse/SPARK-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8873: - Priority: Minor (was: Major) Component/s: Mesos Support cleaning up shuffle files for drivers launched with Mesos - Key: SPARK-8873 URL: https://issues.apache.org/jira/browse/SPARK-8873 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Priority: Minor Labels: mesos With dynamic allocation enabled with Mesos, drivers can launch with shuffle data cached in the external shuffle service. However, there is no reliable way to let the shuffle service clean up the shuffle data when the driver exits, since it may crash before it notifies the shuffle service and shuffle data will be cached forever. We need to implement a reliable way to detect driver termination and clean up shuffle data accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8872) Improve FPGrowthSuite with equivalent R code
[ https://issues.apache.org/jira/browse/SPARK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8872. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7269 [https://github.com/apache/spark/pull/7269] Improve FPGrowthSuite with equivalent R code Key: SPARK-8872 URL: https://issues.apache.org/jira/browse/SPARK-8872 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Priority: Minor Labels: starter Fix For: 1.5.0 Original Estimate: 3h Remaining Estimate: 3h In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the expected output is hard-coded. We can add equivalent R code using the arules package to generate the expect output for validation purpose, similar to https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98 and the test code in https://github.com/apache/spark/pull/7005. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8872) Improve FPGrowthSuite with equivalent R code
[ https://issues.apache.org/jira/browse/SPARK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618822#comment-14618822 ] Kashif Rasul commented on SPARK-8872: - [~mengxr] my jira username is krasul Improve FPGrowthSuite with equivalent R code Key: SPARK-8872 URL: https://issues.apache.org/jira/browse/SPARK-8872 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Priority: Minor Labels: starter Fix For: 1.5.0 Original Estimate: 3h Remaining Estimate: 3h In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the expected output is hard-coded. We can add equivalent R code using the arules package to generate the expect output for validation purpose, similar to https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98 and the test code in https://github.com/apache/spark/pull/7005. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618835#comment-14618835 ] Shivaram Venkataraman commented on SPARK-8514: -- I posted a link to a paper on communication efficient LU before (http://dl.acm.org/citation.cfm?id=1413400). There might be an MPI implementation at https://github.com/solomonik/NuLAB that might also be helpful LU factorization on BlockMatrix --- Key: SPARK-8514 URL: https://issues.apache.org/jira/browse/SPARK-8514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Labels: advanced LU is the most common method to solve a general linear system or inverse a general matrix. A distributed version could in implemented block-wise with pipelining. A reference implementation is provided in ScaLAPACK: http://netlib.org/scalapack/slug/node178.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8893) Require positive partition counts in RDD.repartition
[ https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8893: - Component/s: Spark Core Require positive partition counts in RDD.repartition Key: SPARK-8893 URL: https://issues.apache.org/jira/browse/SPARK-8893 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Daniel Darabos Priority: Trivial What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}} 1, it returns {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}. I think the case is pretty clear for {{p}} 0. But the behavior for {{p}} = 0 is also error prone. In fact that's how I found this strange behavior. I used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded down to zero and the results surprised me. I'd prefer an exception instead of unexpected (corrupt) results. I'm happy to send a pull request for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8897) SparkR DataFrame fail to return data of float type
[ https://issues.apache.org/jira/browse/SPARK-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8897. -- Resolution: Duplicate SparkR DataFrame fail to return data of float type -- Key: SPARK-8897 URL: https://issues.apache.org/jira/browse/SPARK-8897 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui This is an issue reported by Evgeny Sinelnikov esinelni...@griddynamics.com in the Spark mailing list: {quote} I'm got a trouble with float type coercion on SparkR with hiveContext. result - sql(hiveContext, SELECT offset, percentage from data limit 100) show(result) DataFrame[offset:float, percentage:float] head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618863#comment-14618863 ] Shivaram Venkataraman commented on SPARK-8596: -- Thanks for the PR. Will review this today. And we don't have anything like this open for ipython as far as I know. You can open a new JIRA and discuss this. Install and configure RStudio server on Spark EC2 - Key: SPARK-8596 URL: https://issues.apache.org/jira/browse/SPARK-8596 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8897) SparkR DataFrame fail to return data of float type
[ https://issues.apache.org/jira/browse/SPARK-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618849#comment-14618849 ] Shivaram Venkataraman commented on SPARK-8897: -- [~sunrui] I'm closing this as we already have a PR and an issue open in SPARK-8840 SparkR DataFrame fail to return data of float type -- Key: SPARK-8897 URL: https://issues.apache.org/jira/browse/SPARK-8897 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui This is an issue reported by Evgeny Sinelnikov esinelni...@griddynamics.com in the Spark mailing list: {quote} I'm got a trouble with float type coercion on SparkR with hiveContext. result - sql(hiveContext, SELECT offset, percentage from data limit 100) show(result) DataFrame[offset:float, percentage:float] head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8866) Use 1 microsecond (us) precision for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8866: --- Assignee: Yijie Shen (was: Apache Spark) Use 1 microsecond (us) precision for TimestampType -- Key: SPARK-8866 URL: https://issues.apache.org/jira/browse/SPARK-8866 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Yijie Shen 100ns is slightly weird to compute. Let's use 1us to be more consistent with other systems (e.g. Postgres) and less error prone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)
[ https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618179#comment-14618179 ] Mingyu Kim commented on SPARK-7917: --- This looks like a duplicate of SPARK-5970, which is fixed in Spark 1.4. Spark doesn't clean up Application Directories (local dirs) Key: SPARK-7917 URL: https://issues.apache.org/jira/browse/SPARK-7917 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zach Fry Priority: Minor Similar to SPARK-4834. Spark does clean up the cache and lock files in the local dirs, however, it doesn't clean up the actual directories. We have to write custom scripts to go back through the local dirs and find directories that don't contain any files and clear those out. Its a pretty simple repro: Run a job that does some shuffling, wait for the shuffle files to get cleaned up, go and look on disk at spark.local.dir and notice that the directory(s) are still there, but there are no files in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8881) Scheduling fails if num_executors num_workers
[ https://issues.apache.org/jira/browse/SPARK-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618369#comment-14618369 ] Nishkam Ravi commented on SPARK-8881: - This isn't the best example because the third worker will get screened out. Consider the following instead: three workers with num_cores (8, 8, 3). spark.cores.maximum=8, spark.executor.cores=2. Core allocation would be (3, 3, 2). 3 executors launched instead of 4. You get the drift. Scheduling fails if num_executors num_workers --- Key: SPARK-8881 URL: https://issues.apache.org/jira/browse/SPARK-8881 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.4.0, 1.5.0 Reporter: Nishkam Ravi Current scheduling algorithm (in Master.scala) has two issues: 1. cores are allocated one at a time instead of spark.executor.cores at a time 2. when spark.cores.max/spark.executor.cores num_workers, executors are not launched and the app hangs (due to 1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619098#comment-14619098 ] shane knapp commented on SPARK-8571: ok, upon auditing all of the spark builds, i think i found the culprits: https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/configure https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.4-Maven-with-YARN/configure https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.4-Maven-pre-YARN/configure https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.3-Maven-with-YARN/configure https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.3-Maven-pre-YARN/configure * #!/bin/bash is NOT set (defaulting to behavior where if mvn fails, the lsof/xargs kill commands will never run) * the lsof/xargs kill line will potentially pollute the exit code of the build block * set -e looks to be impossible to set due to the lsof/xargs kill being the last line of the block proposed: * store the retcodes of the mvn commands, and if either one fails, fail the build after the lsof/xargs kill command * add #!/bin/bash spark streaming hanging processes upon build exit - Key: SPARK-8571 URL: https://issues.apache.org/jira/browse/SPARK-8571 Project: Spark Issue Type: Bug Components: Build, Streaming Environment: centos 6.6 amplab build system Reporter: shane knapp Assignee: shane knapp Priority: Minor Labels: build, test over the past 3 months i've been noticing that there are occasionally hanging processes on our build system workers after various spark builds have finished. these are all spark streaming processes. today i noticed a 3+ hour spark build that was timed out after 200 minutes (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on amp-jenkins-worker-02. after the timeout, it left the following process (and all of it's children) hanging. the process' CLI command was: {quote} [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java -Dderby.system.durability=test -Djava.awt.headless=true -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp -Dspark.driver.allowMultipleContexts=true -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos -Dspark.testing=1 -Dspark.ui.enabled=false -Dspark.ui.showConsoleProgress=false -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m org.scalatest.tools.Runner -R /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes -o -f /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt -u /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. {quote} stracing that process doesn't give us much: {quote} [root@amp-jenkins-worker-02 ~]# strace -p 1714 Process 1714 attached - interrupt to quit futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL {quote} stracing it's children gives is a *little* bit more... some loop like this: {quote} snip futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, ) = -1 ETIMEDOUT (Connection timed out) {quote} and others loop on prtrace_attach (no such process) or restart_syscall (resuming interrupted call) even though this behavior has been solidly pinned to jobs timing out (which ends w/an aborted, not failed, build), i've seen it happen for failed builds as well. if i see any hanging processes from failed (not aborted) builds, i will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe,
[jira] [Commented] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet
[ https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619100#comment-14619100 ] Matt Massie commented on SPARK-7263: The Spark shuffle manager APIs, in their current state, don't support a standalone shuffle implementation. If you like, I can split my pull request into two parts: (a) changes to Spark, e.g. [serializing class info|https://github.com/massie/spark/commit/fc03c0bd29fa71ff390b86a8f6fd31c1cbef960f], making APIs public, etc and (b) the new Parquet implementation. I think your comment that we're creating a whole new shuffle subsystem for one data type is technically correct but it misses the bigger point. The currently supported data type, {{IndexedRecord}} is the base type for all Avro objects and includes three methods -- {{get}}, {{put}} and {{getSchema}} -- the primitives necessary for describing, storing and building objects. Since Parquet supports Thrift and Protobuf too, it would be straight-forward to add their base types here too which perform similar functions. I reached out to Michael Armbrust and looked at the Spark SQL code, in depth, before I wrote this. I had hoped to piggyback on the Spark SQL work but found that it wasn't a good match. If you like, I can list all the issues that I found. I'd like to know why you think this would be a maintenance nightmare? I think otherwise, but of course I wrote this. Can you be more specific with your concerns around maintenance? Add new shuffle manager which stores shuffle blocks in Parquet -- Key: SPARK-7263 URL: https://issues.apache.org/jira/browse/SPARK-7263 Project: Spark Issue Type: New Feature Components: Block Manager Reporter: Matt Massie I have a working prototype of this feature that can be viewed at https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1 Setting the spark.shuffle.manager to parquet enables this shuffle manager. The dictionary support that Parquet provides appreciably reduces the amount of memory that objects use; however, once Parquet data is shuffled, all the dictionary information is lost and the column-oriented data is written to shuffle blocks in a record-oriented fashion. This shuffle manager addresses this issue by reading and writing all shuffle blocks in the Parquet format. If shuffle objects are Avro records, then the Avro $SCHEMA is converted to Parquet schema and used directly, otherwise, the Parquet schema is generated via reflection. Currently, the only non-Avro keys supported is primitive types. The reflection code can be improved (or replaced) to support complex records. The ParquetShufflePair class allows the shuffle key and value to be stored in Parquet blocks as a single record with a single schema. This commit adds the following new Spark configuration options: spark.shuffle.parquet.compression - sets the Parquet compression codec spark.shuffle.parquet.blocksize - sets the Parquet block size spark.shuffle.parquet.pagesize - set the Parquet page size spark.shuffle.parquet.enabledictionary - turns dictionary encoding on/off Parquet does not (and has no plans to) support a streaming API. Metadata sections are scattered through a Parquet file making a streaming API difficult. As such, the ShuffleBlockFetcherIterator has been modified to fetch the entire contents of map outputs into temporary blocks before loading the data into the reducer. Interesting future asides: o There is no need to define a data serializer (although Spark requires it) o Parquet support predicate pushdown and projection which could be used at between shuffle stages to improve performance in the future -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet
[ https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619105#comment-14619105 ] Reynold Xin commented on SPARK-7263: Can you please list the issues that made Spark SQL a bad fit for this? Add new shuffle manager which stores shuffle blocks in Parquet -- Key: SPARK-7263 URL: https://issues.apache.org/jira/browse/SPARK-7263 Project: Spark Issue Type: New Feature Components: Block Manager Reporter: Matt Massie I have a working prototype of this feature that can be viewed at https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1 Setting the spark.shuffle.manager to parquet enables this shuffle manager. The dictionary support that Parquet provides appreciably reduces the amount of memory that objects use; however, once Parquet data is shuffled, all the dictionary information is lost and the column-oriented data is written to shuffle blocks in a record-oriented fashion. This shuffle manager addresses this issue by reading and writing all shuffle blocks in the Parquet format. If shuffle objects are Avro records, then the Avro $SCHEMA is converted to Parquet schema and used directly, otherwise, the Parquet schema is generated via reflection. Currently, the only non-Avro keys supported is primitive types. The reflection code can be improved (or replaced) to support complex records. The ParquetShufflePair class allows the shuffle key and value to be stored in Parquet blocks as a single record with a single schema. This commit adds the following new Spark configuration options: spark.shuffle.parquet.compression - sets the Parquet compression codec spark.shuffle.parquet.blocksize - sets the Parquet block size spark.shuffle.parquet.pagesize - set the Parquet page size spark.shuffle.parquet.enabledictionary - turns dictionary encoding on/off Parquet does not (and has no plans to) support a streaming API. Metadata sections are scattered through a Parquet file making a streaming API difficult. As such, the ShuffleBlockFetcherIterator has been modified to fetch the entire contents of map outputs into temporary blocks before loading the data into the reducer. Interesting future asides: o There is no need to define a data serializer (although Spark requires it) o Parquet support predicate pushdown and projection which could be used at between shuffle stages to improve performance in the future -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8657) Fail to upload conf archive to viewfs
[ https://issues.apache.org/jira/browse/SPARK-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8657. -- Resolution: Fixed Fail to upload conf archive to viewfs - Key: SPARK-8657 URL: https://issues.apache.org/jira/browse/SPARK-8657 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Environment: spark-1.4.2 hadoop-2.5.0-cdh5.3.2 Reporter: Tao Li Assignee: Tao Li Priority: Minor Labels: distributed_cache, viewfs Fix For: 1.4.2, 1.5.0 When I run in spark-1.4 yarn-client mode, I throws the following Exception when trying to upload conf archive to viewfs: 15/06/26 17:56:37 INFO yarn.Client: Uploading resource file:/tmp/spark-095ec3d2-5dad-468c-8d46-2c813457404d/__hadoop_conf__8436284925771788661 .zip - viewfs://nsX/user/ultraman/.sparkStaging/application_1434370929997_191242/__hadoop_conf__8436284925771788661.zip 15/06/26 17:56:38 INFO yarn.Client: Deleting staging directory .sparkStaging/application_1434370929997_191242 15/06/26 17:56:38 ERROR spark.SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Wrong FS: hdfs://SunshineNameNode2:8020/user/ultraman/.sparkStaging/application_1434370929997_191242/__had oop_conf__8436284925771788661.zip, expected: viewfs://nsX/ at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:117) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:346) at org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:341) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:338) at scala.Option.foreach(Option.scala:236) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:559) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) at org.apache.spark.SparkContext.init(SparkContext.scala:497) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017) at $line3.$read$$iwC$$iwC.init(console:9) at $line3.$read$$iwC.init(console:18) at $line3.$read.init(console:20) at $line3.$read$.init(console:24) at $line3.$read$.clinit(console) at $line3.$eval$.init(console:7) at $line3.$eval$.clinit(console) at $line3.$eval.$print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) The bug is easy to fix, we should pass the correct file system object to addResource. The similar issure is: https://github.com/apache/spark/pull/1483. I will attach my bug fix PR very soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8657) Fail to upload conf archive to viewfs
[ https://issues.apache.org/jira/browse/SPARK-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8657: - Assignee: Tao Li Target Version/s: (was: 1.5.0) Priority: Minor (was: Major) Fix Version/s: 1.5.0 1.4.2 Fail to upload conf archive to viewfs - Key: SPARK-8657 URL: https://issues.apache.org/jira/browse/SPARK-8657 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Environment: spark-1.4.2 hadoop-2.5.0-cdh5.3.2 Reporter: Tao Li Assignee: Tao Li Priority: Minor Labels: distributed_cache, viewfs Fix For: 1.4.2, 1.5.0 When I run in spark-1.4 yarn-client mode, I throws the following Exception when trying to upload conf archive to viewfs: 15/06/26 17:56:37 INFO yarn.Client: Uploading resource file:/tmp/spark-095ec3d2-5dad-468c-8d46-2c813457404d/__hadoop_conf__8436284925771788661 .zip - viewfs://nsX/user/ultraman/.sparkStaging/application_1434370929997_191242/__hadoop_conf__8436284925771788661.zip 15/06/26 17:56:38 INFO yarn.Client: Deleting staging directory .sparkStaging/application_1434370929997_191242 15/06/26 17:56:38 ERROR spark.SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Wrong FS: hdfs://SunshineNameNode2:8020/user/ultraman/.sparkStaging/application_1434370929997_191242/__had oop_conf__8436284925771788661.zip, expected: viewfs://nsX/ at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:117) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:346) at org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:341) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:338) at scala.Option.foreach(Option.scala:236) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:559) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) at org.apache.spark.SparkContext.init(SparkContext.scala:497) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017) at $line3.$read$$iwC$$iwC.init(console:9) at $line3.$read$$iwC.init(console:18) at $line3.$read.init(console:20) at $line3.$read$.init(console:24) at $line3.$read$.clinit(console) at $line3.$eval$.init(console:7) at $line3.$eval$.clinit(console) at $line3.$eval.$print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) The bug is easy to fix, we should pass the correct file system object to addResource. The similar issure is: https://github.com/apache/spark/pull/1483. I will attach my bug fix PR very soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619135#comment-14619135 ] shane knapp commented on SPARK-8571: basically the code would look something like: #!/bin/bash rm -rf ./work git clean -fdx export BLAH build/mvn BLAH BLAH retcode1=$? build/mvn WHEE ZOMG retcode2=$? lsof | xargs kill if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then exit 1 fi spark streaming hanging processes upon build exit - Key: SPARK-8571 URL: https://issues.apache.org/jira/browse/SPARK-8571 Project: Spark Issue Type: Bug Components: Build, Streaming Environment: centos 6.6 amplab build system Reporter: shane knapp Assignee: shane knapp Priority: Minor Labels: build, test over the past 3 months i've been noticing that there are occasionally hanging processes on our build system workers after various spark builds have finished. these are all spark streaming processes. today i noticed a 3+ hour spark build that was timed out after 200 minutes (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on amp-jenkins-worker-02. after the timeout, it left the following process (and all of it's children) hanging. the process' CLI command was: {quote} [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java -Dderby.system.durability=test -Djava.awt.headless=true -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp -Dspark.driver.allowMultipleContexts=true -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos -Dspark.testing=1 -Dspark.ui.enabled=false -Dspark.ui.showConsoleProgress=false -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m org.scalatest.tools.Runner -R /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes -o -f /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt -u /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. {quote} stracing that process doesn't give us much: {quote} [root@amp-jenkins-worker-02 ~]# strace -p 1714 Process 1714 attached - interrupt to quit futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL {quote} stracing it's children gives is a *little* bit more... some loop like this: {quote} snip futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, ) = -1 ETIMEDOUT (Connection timed out) {quote} and others loop on prtrace_attach (no such process) or restart_syscall (resuming interrupted call) even though this behavior has been solidly pinned to jobs timing out (which ends w/an aborted, not failed, build), i've seen it happen for failed builds as well. if i see any hanging processes from failed (not aborted) builds, i will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619144#comment-14619144 ] RJ Nowling commented on SPARK-3644: --- [~joshrosen] The issue and corresponding PR you reference only seem to provide read-only access. Is that correct? If so, then are there open issues to address the needs of the users above? Thanks! REST API for Spark application info (jobs / stages / tasks / storage info) -- Key: SPARK-3644 URL: https://issues.apache.org/jira/browse/SPARK-3644 Project: Spark Issue Type: New Feature Components: Spark Core, Web UI Reporter: Josh Rosen Assignee: Imran Rashid Fix For: 1.4.0 This JIRA is a forum to draft a design proposal for a REST interface for accessing information about Spark applications, such as job / stage / task / storage status. There have been a number of proposals to serve JSON representations of the information displayed in Spark's web UI. Given that we might redesign the pages of the web UI (and possibly re-implement the UI as a client of a REST API), the API endpoints and their responses should be independent of what we choose to display on particular web UI pages / layouts. Let's start a discussion of what a good REST API would look like from first-principles. We can discuss what urls / endpoints expose access to data, how our JSON responses will be formatted, how fields will be named, how the API will be documented and tested, etc. Some links for inspiration: https://developer.github.com/v3/ http://developer.netflix.com/docs/REST_API_Reference https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8894) Example code errors in SparkR documentation
[ https://issues.apache.org/jira/browse/SPARK-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8894. -- Resolution: Fixed Fix Version/s: 1.5.0 1.4.2 Issue resolved by pull request 7287 [https://github.com/apache/spark/pull/7287] Example code errors in SparkR documentation --- Key: SPARK-8894 URL: https://issues.apache.org/jira/browse/SPARK-8894 Project: Spark Issue Type: Bug Components: Documentation, SparkR Affects Versions: 1.4.0 Reporter: Sun Rui Fix For: 1.4.2, 1.5.0 There are errors in SparkR related documentation. 1. in https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, for R part, {code} results = sqlContext.sql(FROM src SELECT key, value).collect() {code} should be {code} results - collect(sql(sqlContext, FROM src SELECT key, value)) {code} 2. in https://spark.apache.org/docs/latest/sparkr.html#from-hive-tables, {code} results - hiveContext.sql(FROM src SELECT key, value) {code} should be {code} results - sql(hiveContext, FROM src SELECT key, value) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8894) Example code errors in SparkR documentation
[ https://issues.apache.org/jira/browse/SPARK-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-8894: - Assignee: Sun Rui Example code errors in SparkR documentation --- Key: SPARK-8894 URL: https://issues.apache.org/jira/browse/SPARK-8894 Project: Spark Issue Type: Bug Components: Documentation, SparkR Affects Versions: 1.4.0 Reporter: Sun Rui Assignee: Sun Rui Fix For: 1.4.2, 1.5.0 There are errors in SparkR related documentation. 1. in https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, for R part, {code} results = sqlContext.sql(FROM src SELECT key, value).collect() {code} should be {code} results - collect(sql(sqlContext, FROM src SELECT key, value)) {code} 2. in https://spark.apache.org/docs/latest/sparkr.html#from-hive-tables, {code} results - hiveContext.sql(FROM src SELECT key, value) {code} should be {code} results - sql(hiveContext, FROM src SELECT key, value) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8900) sparkPackages flag name is wrong in the documentation
[ https://issues.apache.org/jira/browse/SPARK-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618928#comment-14618928 ] Apache Spark commented on SPARK-8900: - User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/7293 sparkPackages flag name is wrong in the documentation - Key: SPARK-8900 URL: https://issues.apache.org/jira/browse/SPARK-8900 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1 Reporter: Shivaram Venkataraman The SparkR documentation example in http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html is incorrect. sc - sparkR.init(packages=com.databricks:spark-csv_2.11:1.0.3) should be sc - sparkR.init(sparkPackages=com.databricks:spark-csv_2.11:1.0.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8900) sparkPackages flag name is wrong in the documentation
[ https://issues.apache.org/jira/browse/SPARK-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8900: --- Assignee: (was: Apache Spark) sparkPackages flag name is wrong in the documentation - Key: SPARK-8900 URL: https://issues.apache.org/jira/browse/SPARK-8900 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1 Reporter: Shivaram Venkataraman The SparkR documentation example in http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html is incorrect. sc - sparkR.init(packages=com.databricks:spark-csv_2.11:1.0.3) should be sc - sparkR.init(sparkPackages=com.databricks:spark-csv_2.11:1.0.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8785) Improve Parquet schema merging
[ https://issues.apache.org/jira/browse/SPARK-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8785: -- Assignee: Liang-Chi Hsieh Improve Parquet schema merging -- Key: SPARK-8785 URL: https://issues.apache.org/jira/browse/SPARK-8785 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend much time to merge duplicate schema. We can select only non duplicate schema and merge them later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8785) Improve Parquet schema merging
[ https://issues.apache.org/jira/browse/SPARK-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-8785. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7182 [https://github.com/apache/spark/pull/7182] Improve Parquet schema merging -- Key: SPARK-8785 URL: https://issues.apache.org/jira/browse/SPARK-8785 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Fix For: 1.5.0 Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend much time to merge duplicate schema. We can select only non duplicate schema and merge them later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6912) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6912. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7257 [https://github.com/apache/spark/pull/7257] Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF Key: SPARK-6912 URL: https://issues.apache.org/jira/browse/SPARK-6912 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro Fix For: 1.5.0 The current implementation can't handle MapK,V as a return type in Hive UDF. We assume an UDF below; public class UDFToIntIntMap extends UDF { public MapInteger, Integer evaluate(Object o); } Hive supports this type, and see a link below for details; https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8900) sparkPackages flag name is wrong in the documentation
Shivaram Venkataraman created SPARK-8900: Summary: sparkPackages flag name is wrong in the documentation Key: SPARK-8900 URL: https://issues.apache.org/jira/browse/SPARK-8900 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1 Reporter: Shivaram Venkataraman The SparkR documentation example in http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html is incorrect. sc - sparkR.init(packages=com.databricks:spark-csv_2.11:1.0.3) should be sc - sparkR.init(sparkPackages=com.databricks:spark-csv_2.11:1.0.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8900) sparkPackages flag name is wrong in the documentation
[ https://issues.apache.org/jira/browse/SPARK-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8900: --- Assignee: Apache Spark sparkPackages flag name is wrong in the documentation - Key: SPARK-8900 URL: https://issues.apache.org/jira/browse/SPARK-8900 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1 Reporter: Shivaram Venkataraman Assignee: Apache Spark The SparkR documentation example in http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html is incorrect. sc - sparkR.init(packages=com.databricks:spark-csv_2.11:1.0.3) should be sc - sparkR.init(sparkPackages=com.databricks:spark-csv_2.11:1.0.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6266) PySpark SparseVector missing doc for size, indices, values
[ https://issues.apache.org/jira/browse/SPARK-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6266: - Shepherd: Xiangrui Meng PySpark SparseVector missing doc for size, indices, values -- Key: SPARK-6266 URL: https://issues.apache.org/jira/browse/SPARK-6266 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Need to add doc for size, indices, values attributes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618885#comment-14618885 ] Neelesh Srinivas Salian commented on SPARK-7736: My 2 cents: To have a YARN failed, ApplicationMaster running the driver needs to fail. Scenario: 1) It fails once, YARN retries and succeeds if the exception has been handled correctly. This results in a Successful YARN job (assuming the child tasks (executors) succeeded). 2) The retries fail and the YARN job fails completely. You need the Spark Application to coz a failure in YARN to mark it as a Failure. Moreover, the ApplicationMaster.java code from the: /hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java in the Hadoop project should help. Reference: [1] http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html So, I would say this is expected behavior. Hope that helps. Please add/correct me if needed. Exception not failing Python applications (in yarn cluster mode) Key: SPARK-7736 URL: https://issues.apache.org/jira/browse/SPARK-7736 Project: Spark Issue Type: Bug Components: YARN Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 Reporter: Shay Rojansky It seems that exceptions thrown in Python spark apps after the SparkContext is instantiated don't cause the application to fail, at least in Yarn: the application is marked as SUCCEEDED. Note that any exception right before the SparkContext correctly places the application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8785) Improve Parquet schema merging
[ https://issues.apache.org/jira/browse/SPARK-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8785: -- Shepherd: Cheng Lian Improve Parquet schema merging -- Key: SPARK-8785 URL: https://issues.apache.org/jira/browse/SPARK-8785 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend much time to merge duplicate schema. We can select only non duplicate schema and merge them later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6266) PySpark SparseVector missing doc for size, indices, values
[ https://issues.apache.org/jira/browse/SPARK-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6266: - Assignee: Kai Sasaki PySpark SparseVector missing doc for size, indices, values -- Key: SPARK-6266 URL: https://issues.apache.org/jira/browse/SPARK-6266 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Kai Sasaki Priority: Minor Need to add doc for size, indices, values attributes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6912) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6912: Assignee: Takeshi Yamamuro Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF Key: SPARK-6912 URL: https://issues.apache.org/jira/browse/SPARK-6912 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro Assignee: Takeshi Yamamuro Fix For: 1.5.0 The current implementation can't handle MapK,V as a return type in Hive UDF. We assume an UDF below; public class UDFToIntIntMap extends UDF { public MapInteger, Integer evaluate(Object o); } Hive supports this type, and see a link below for details; https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5707) Enabling spark.sql.codegen throws ClassNotFound exception
[ https://issues.apache.org/jira/browse/SPARK-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5707. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7272 [https://github.com/apache/spark/pull/7272] Enabling spark.sql.codegen throws ClassNotFound exception - Key: SPARK-5707 URL: https://issues.apache.org/jira/browse/SPARK-5707 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.1 Environment: yarn-client mode, spark.sql.codegen=true Reporter: Yi Yao Assignee: Ram Sriharsha Priority: Blocker Fix For: 1.5.0 Exception thrown: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 133.0 failed 4 times, most recent failure: Lost task 13.3 in stage 133.0 (TID 3066, cdh52-node2): java.io.IOException: com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$81257352e1c844aebf09cb84fe9e7459.__wrapper$1$81257352e1c844aebf09cb84fe9e7459$SpecificRow$1 Serialization trace: hashTable (org.apache.spark.sql.execution.joins.UniqueKeyHashedRelation) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:62) at org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:61) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at
[jira] [Commented] (SPARK-8900) sparkPackages flag name is wrong in the documentation
[ https://issues.apache.org/jira/browse/SPARK-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619036#comment-14619036 ] Shivaram Venkataraman commented on SPARK-8900: -- [~bashyal] You can take a look at the PR I have open. sparkPackages flag name is wrong in the documentation - Key: SPARK-8900 URL: https://issues.apache.org/jira/browse/SPARK-8900 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1 Reporter: Shivaram Venkataraman The SparkR documentation example in http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html is incorrect. sc - sparkR.init(packages=com.databricks:spark-csv_2.11:1.0.3) should be sc - sparkR.init(sparkPackages=com.databricks:spark-csv_2.11:1.0.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5427) Add support for floor function in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-5427: -- Description: floor() function is supported in Hive SQL. This issue is to add floor() function to Spark SQL. Related thread: http://search-hadoop.com/m/JW1q563fc22 was: floor() function is supported in Hive SQL. This issue is to add floor() function to Spark SQL. Related thread: http://search-hadoop.com/m/JW1q563fc22 Add support for floor function in Spark SQL --- Key: SPARK-5427 URL: https://issues.apache.org/jira/browse/SPARK-5427 Project: Spark Issue Type: Improvement Components: SQL Reporter: Ted Yu Labels: math floor() function is supported in Hive SQL. This issue is to add floor() function to Spark SQL. Related thread: http://search-hadoop.com/m/JW1q563fc22 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8571: -- Assignee: shane knapp spark streaming hanging processes upon build exit - Key: SPARK-8571 URL: https://issues.apache.org/jira/browse/SPARK-8571 Project: Spark Issue Type: Bug Components: Build, Streaming Environment: centos 6.6 amplab build system Reporter: shane knapp Assignee: shane knapp Priority: Minor Labels: build, test over the past 3 months i've been noticing that there are occasionally hanging processes on our build system workers after various spark builds have finished. these are all spark streaming processes. today i noticed a 3+ hour spark build that was timed out after 200 minutes (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on amp-jenkins-worker-02. after the timeout, it left the following process (and all of it's children) hanging. the process' CLI command was: {quote} [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java -Dderby.system.durability=test -Djava.awt.headless=true -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp -Dspark.driver.allowMultipleContexts=true -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos -Dspark.testing=1 -Dspark.ui.enabled=false -Dspark.ui.showConsoleProgress=false -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m org.scalatest.tools.Runner -R /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes -o -f /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt -u /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. {quote} stracing that process doesn't give us much: {quote} [root@amp-jenkins-worker-02 ~]# strace -p 1714 Process 1714 attached - interrupt to quit futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL {quote} stracing it's children gives is a *little* bit more... some loop like this: {quote} snip futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, ) = -1 ETIMEDOUT (Connection timed out) {quote} and others loop on prtrace_attach (no such process) or restart_syscall (resuming interrupted call) even though this behavior has been solidly pinned to jobs timing out (which ends w/an aborted, not failed, build), i've seen it happen for failed builds as well. if i see any hanging processes from failed (not aborted) builds, i will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8901) [SparkR] Documentation Incorrect for sparkR.init
Pradeep Bashyal created SPARK-8901: -- Summary: [SparkR] Documentation Incorrect for sparkR.init Key: SPARK-8901 URL: https://issues.apache.org/jira/browse/SPARK-8901 Project: Spark Issue Type: Documentation Components: R Affects Versions: 1.4.1 Environment: R Reporter: Pradeep Bashyal Priority: Minor The SparkR documentation example in http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html is incorrect. sc - sparkR.init(packages=com.databricks:spark-csv_2.11:1.0.3) should be sc - sparkR.init(sparkPackages=com.databricks:spark-csv_2.11:1.0.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8901) [SparkR] Documentation Incorrect for sparkR.init
[ https://issues.apache.org/jira/browse/SPARK-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-8901. --- Resolution: Duplicate [SparkR] Documentation Incorrect for sparkR.init Key: SPARK-8901 URL: https://issues.apache.org/jira/browse/SPARK-8901 Project: Spark Issue Type: Documentation Components: R Affects Versions: 1.4.1 Environment: R Reporter: Pradeep Bashyal Priority: Minor Labels: documentation Original Estimate: 1h Remaining Estimate: 1h The SparkR documentation example in http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html is incorrect. sc - sparkR.init(packages=com.databricks:spark-csv_2.11:1.0.3) should be sc - sparkR.init(sparkPackages=com.databricks:spark-csv_2.11:1.0.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8753) Create an IntervalType data type
[ https://issues.apache.org/jira/browse/SPARK-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8753. Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 1.5.0 Create an IntervalType data type Key: SPARK-8753 URL: https://issues.apache.org/jira/browse/SPARK-8753 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Wenchen Fan Fix For: 1.5.0 We should create an IntervalType data type that represents time intervals. Internally, we can use a long value to store it, similar to Timestamp (i.e. 100ns precision). This data type initially cannot be stored externally, but only used for expressions. 1. Add IntervalType data type. 2. Add parser support in our SQL expression, in the form of {code} INTERVAL [number] [unit] {code} unit can be YEAR[S], MONTH[S], WEEK[S], DAY[S], HOUR[S], MINUTE[S], SECOND[S], MILLISECOND[S], MICROSECOND[S], or NANOSECOND[S]. 3. Add in the analyzer to make sure we throw some exception to prevent saving a dataframe/table with IntervalType out to external systems. Related Hive ticket: https://issues.apache.org/jira/browse/HIVE-9792 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3164) Store DecisionTree Split.categories as Set
[ https://issues.apache.org/jira/browse/SPARK-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619215#comment-14619215 ] Rekha Joshi commented on SPARK-3164: Its all good, but thanks for taking the feedback , [~josephkb] Overall I am delighted by responsive, helpful reviewers :-) Store DecisionTree Split.categories as Set -- Key: SPARK-3164 URL: https://issues.apache.org/jira/browse/SPARK-3164 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Fix For: 1.4.0 Improvement: computation For categorical features with many categories, it could be more efficient to store Split.categories as a Set, not a List. (It is currently a List.) A Set might be more scalable (for log n lookups), though tests would need to be done to ensure that Sets do not incur too much more overhead than Lists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet
[ https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619246#comment-14619246 ] Matt Massie commented on SPARK-7263: Also, let me know if you'd like me to split the PR into (a) Spark changes and (b) the Parquet implementation. Add new shuffle manager which stores shuffle blocks in Parquet -- Key: SPARK-7263 URL: https://issues.apache.org/jira/browse/SPARK-7263 Project: Spark Issue Type: New Feature Components: Block Manager Reporter: Matt Massie I have a working prototype of this feature that can be viewed at https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1 Setting the spark.shuffle.manager to parquet enables this shuffle manager. The dictionary support that Parquet provides appreciably reduces the amount of memory that objects use; however, once Parquet data is shuffled, all the dictionary information is lost and the column-oriented data is written to shuffle blocks in a record-oriented fashion. This shuffle manager addresses this issue by reading and writing all shuffle blocks in the Parquet format. If shuffle objects are Avro records, then the Avro $SCHEMA is converted to Parquet schema and used directly, otherwise, the Parquet schema is generated via reflection. Currently, the only non-Avro keys supported is primitive types. The reflection code can be improved (or replaced) to support complex records. The ParquetShufflePair class allows the shuffle key and value to be stored in Parquet blocks as a single record with a single schema. This commit adds the following new Spark configuration options: spark.shuffle.parquet.compression - sets the Parquet compression codec spark.shuffle.parquet.blocksize - sets the Parquet block size spark.shuffle.parquet.pagesize - set the Parquet page size spark.shuffle.parquet.enabledictionary - turns dictionary encoding on/off Parquet does not (and has no plans to) support a streaming API. Metadata sections are scattered through a Parquet file making a streaming API difficult. As such, the ShuffleBlockFetcherIterator has been modified to fetch the entire contents of map outputs into temporary blocks before loading the data into the reducer. Interesting future asides: o There is no need to define a data serializer (although Spark requires it) o Parquet support predicate pushdown and projection which could be used at between shuffle stages to improve performance in the future -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8905) Spark Streaming receiving socket data sporadically on Mesos 0.22.1
[ https://issues.apache.org/jira/browse/SPARK-8905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619256#comment-14619256 ] Brandon Bradley commented on SPARK-8905: I also tested Spark 1.3.1 against Mesos 0.22.1. Both examples are still exhibiting the same behavior. Spark Streaming receiving socket data sporadically on Mesos 0.22.1 -- Key: SPARK-8905 URL: https://issues.apache.org/jira/browse/SPARK-8905 Project: Spark Issue Type: Bug Affects Versions: 1.4.0 Reporter: Brandon Bradley Hello! When submitting PySpark Streaming job network_wordcount.py from the examples to a Mesos cluster, I am encountering a few 'Return message: null' and many py4j 'Connection channel' errors after. If I let the job run for a long time, there are more py4j errors than my ~2000 line buffer can hold. {code} 15/07/08 14:27:16 ERROR JobScheduler: Error running job streaming job 1436383595000 ms.0 py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: null at py4j.Protocol.getReturnValue(Protocol.java:417) at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113) at com.sun.proxy.$Proxy14.call(Unknown Source) at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63) at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156) at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/07/08 14:27:16 ERROR JobScheduler: Error running job streaming job 1436383596000 ms.0 py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: null at py4j.Protocol.getReturnValue(Protocol.java:417) at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113) at com.sun.proxy.$Proxy14.call(Unknown Source) at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63) at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156) at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at