[jira] [Closed] (SPARK-20337) Support upgrade a jar dependency and don't restart SparkContext
[ https://issues.apache.org/jira/browse/SPARK-20337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang closed SPARK-20337. --- Resolution: Won't Fix > Support upgrade a jar dependency and don't restart SparkContext > --- > > Key: SPARK-20337 > URL: https://issues.apache.org/jira/browse/SPARK-20337 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Priority: Minor > > Suppose that we need upgrade a jar dependency and don't want to restart > {{SparkContext}}, Something like this: > {code} > sc.addJar("breeze-natives_2.11-0.12.jar") > // do something > sc.removeJar("spark://192.168.26.200:23420/jar/breeze-natives_2.11-0.12.jar") > sc.addJar("breeze-natives_2.11-0.13.jar") > // do something > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version resolution
[ https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20343: Assignee: (was: Apache Spark) > SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version > resolution > > > Key: SPARK-20343 > URL: https://issues.apache.org/jira/browse/SPARK-20343 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637 > {quote} > [error] > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: > value createDatumWriter is not a member of > org.apache.avro.generic.GenericData > [error] writerCache.getOrElseUpdate(schema, > GenericData.get.createDatumWriter(schema)) > [error] > {quote} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull > It seems sbt has a different resolution for Avro differently with Maven in > some cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version resolution
[ https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20343: Assignee: Apache Spark > SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version > resolution > > > Key: SPARK-20343 > URL: https://issues.apache.org/jira/browse/SPARK-20343 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637 > {quote} > [error] > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: > value createDatumWriter is not a member of > org.apache.avro.generic.GenericData > [error] writerCache.getOrElseUpdate(schema, > GenericData.get.createDatumWriter(schema)) > [error] > {quote} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull > It seems sbt has a different resolution for Avro differently with Maven in > some cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version resolution
[ https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969738#comment-15969738 ] Apache Spark commented on SPARK-20343: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17642 > SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version > resolution > > > Key: SPARK-20343 > URL: https://issues.apache.org/jira/browse/SPARK-20343 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637 > {quote} > [error] > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: > value createDatumWriter is not a member of > org.apache.avro.generic.GenericData > [error] writerCache.getOrElseUpdate(schema, > GenericData.get.createDatumWriter(schema)) > [error] > {quote} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull > It seems sbt has a different resolution for Avro differently with Maven in > some cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins due to Avro version resolution
[ https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20343: - Summary: SBT master build for Hadoop 2.6 in Jenkins due to Avro version resolution (was: Resolve SBT master build for Hadoop 2.6 in Jenkins due to Avro version resolution ) > SBT master build for Hadoop 2.6 in Jenkins due to Avro version resolution > -- > > Key: SPARK-20343 > URL: https://issues.apache.org/jira/browse/SPARK-20343 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637 > {quote} > [error] > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: > value createDatumWriter is not a member of > org.apache.avro.generic.GenericData > [error] writerCache.getOrElseUpdate(schema, > GenericData.get.createDatumWriter(schema)) > [error] > {quote} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull > It seems sbt has a different resolution for Avro differently with Maven in > some cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version resolution
[ https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20343: - Summary: SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version resolution (was: SBT master build for Hadoop 2.6 in Jenkins due to Avro version resolution ) > SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version > resolution > > > Key: SPARK-20343 > URL: https://issues.apache.org/jira/browse/SPARK-20343 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637 > {quote} > [error] > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: > value createDatumWriter is not a member of > org.apache.avro.generic.GenericData > [error] writerCache.getOrElseUpdate(schema, > GenericData.get.createDatumWriter(schema)) > [error] > {quote} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull > It seems sbt has a different resolution for Avro differently with Maven in > some cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20343) Resolve SBT master build for Hadoop 2.6 in Jenkins due to Avro version resolution
[ https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20343: - Summary: Resolve SBT master build for Hadoop 2.6 in Jenkins due to Avro version resolution (was: Resolve SBT master build for Hadoop 2.6 due to Avro version resolution ) > Resolve SBT master build for Hadoop 2.6 in Jenkins due to Avro version > resolution > -- > > Key: SPARK-20343 > URL: https://issues.apache.org/jira/browse/SPARK-20343 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637 > {quote} > [error] > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: > value createDatumWriter is not a member of > org.apache.avro.generic.GenericData > [error] writerCache.getOrElseUpdate(schema, > GenericData.get.createDatumWriter(schema)) > [error] > {quote} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull > It seems sbt has a different resolution for Avro differently with Maven in > some cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20343) Resolve SBT master build for Hadoop 2.6 due to Avro version resolution
[ https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969709#comment-15969709 ] Hyukjin Kwon commented on SPARK-20343: -- I don't know the {{Priority}} in this case. Please correct this if anyone knows. > Resolve SBT master build for Hadoop 2.6 due to Avro version resolution > --- > > Key: SPARK-20343 > URL: https://issues.apache.org/jira/browse/SPARK-20343 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637 > {quote} > [error] > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: > value createDatumWriter is not a member of > org.apache.avro.generic.GenericData > [error] writerCache.getOrElseUpdate(schema, > GenericData.get.createDatumWriter(schema)) > [error] > {quote} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull > It seems sbt has a different resolution for Avro differently with Maven in > some cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20343) Resolve SBT master build for Hadoop 2.6 due to Avro version resolution
Hyukjin Kwon created SPARK-20343: Summary: Resolve SBT master build for Hadoop 2.6 due to Avro version resolution Key: SPARK-20343 URL: https://issues.apache.org/jira/browse/SPARK-20343 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.0 Reporter: Hyukjin Kwon Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637 {quote} [error] /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: value createDatumWriter is not a member of org.apache.avro.generic.GenericData [error] writerCache.getOrElseUpdate(schema, GenericData.get.createDatumWriter(schema)) [error] {quote} https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull It seems sbt has a different resolution for Avro differently with Maven in some cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20340) Size estimate very wrong in ExternalAppendOnlyMap from CoGroupedRDD, cause OOM
[ https://issues.apache.org/jira/browse/SPARK-20340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969680#comment-15969680 ] Shixiong Zhu commented on SPARK-20340: -- I think it's just a trade off between accuracy and performance. > Size estimate very wrong in ExternalAppendOnlyMap from CoGroupedRDD, cause OOM > -- > > Key: SPARK-20340 > URL: https://issues.apache.org/jira/browse/SPARK-20340 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > I had a user doing a basic join operation. The values are image's binary > data(in base64 format) and widely vary in size. > The job failed with out of memory. Originally failed on yarn with using to > much overhead memory, turned spark.shuffle.io.preferDirectBufs to false > then failed with out of heap memory. I debugged it down to during the > shuffle when CoGroupedRDD putting things into the ExternalAppendOnlyMap, it > computes an estimated size to determine when to spill. In this case > SizeEstimator handle arrays such that if it is larger then 400 elements, it > samples 100 elements. The estimate is coming back as GB's different from the > actual size. It claims 1GB when it is actually using close to 5GB. > Temporary work around it to increase the memory to be very large (10GB > executors) but that isn't really acceptable here. User did the same thing in > pig and it easily handled the data with 1.5GB of memory. > It seems risky to be using an estimate in such a critical thing. If the > estimate is wrong you are going to run out of memory and fail the job. > I'm looking closer at the users data still to get more insights. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20341) Support BigIngeger values > 19 precision
[ https://issues.apache.org/jira/browse/SPARK-20341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20341: - Component/s: (was: Spark Core) SQL > Support BigIngeger values > 19 precision > > > Key: SPARK-20341 > URL: https://issues.apache.org/jira/browse/SPARK-20341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Paul Zaczkieiwcz > > If you create a {{Dataset\[scala.math.BigInt\]}}, then you can't have a > precision > 19. > {code} > scala> case class BigIntWrapper(value:scala.math.BigInt) > defined class BigIntWrapper > scala> val longDf = > spark.createDataset(BigIntWrapper(scala.math.BigInt("1002"))::Nil) > 17/04/14 19:45:15 main INFO CodeGenerator: Code generated in 211.949738 ms > java.lang.RuntimeException: Error while encoding: > java.lang.IllegalArgumentException: requirement failed: BigInteger > 1002 too large for decimal > staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), > apply, assertnotnull(input[0, BigIntWrapper, true], top level non-flat input > object).value, true) AS value#16 > +- staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), > apply, assertnotnull(input[0, BigIntWrapper, true], top level non-flat input > object).value, true) >+- assertnotnull(input[0, BigIntWrapper, true], top level non-flat input > object).value > +- assertnotnull(input[0, BigIntWrapper, true], top level non-flat > input object) > +- input[0, BigIntWrapper, true] > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:280) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:421) > ... 54 elided > Caused by: java.lang.IllegalArgumentException: requirement failed: BigInteger > 1002 too large for decimal > at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.sql.types.Decimal.set(Decimal.scala:137) > at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:419) > at org.apache.spark.sql.types.Decimal.apply(Decimal.scala) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:277) > ... 62 more > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969622#comment-15969622 ] Yan Facai (颜发才) commented on SPARK-20199: - ping [~jkbreuer] [~sethah] [~mengxr]. Which one is better? > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20342) DAGScheduler sends SparkListenerTaskEnd before updating task's accumulators
Marcelo Vanzin created SPARK-20342: -- Summary: DAGScheduler sends SparkListenerTaskEnd before updating task's accumulators Key: SPARK-20342 URL: https://issues.apache.org/jira/browse/SPARK-20342 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Marcelo Vanzin Hit this on 2.2, but probably has been there forever. This is similar in spirit to SPARK-20205. Event is sent here, around L1154: {code} listenerBus.post(SparkListenerTaskEnd( stageId, task.stageAttemptId, taskType, event.reason, event.taskInfo, taskMetrics)) {code} Accumulators are updated later, around L1173: {code} val stage = stageIdToStage(task.stageId) event.reason match { case Success => task match { case rt: ResultTask[_, _] => // Cast to ResultStage here because it's part of the ResultTask // TODO Refactor this out to a function that accepts a ResultStage val resultStage = stage.asInstanceOf[ResultStage] resultStage.activeJob match { case Some(job) => if (!job.finished(rt.outputId)) { updateAccumulators(event) {code} Same thing applies here; UI shows correct info because it's pointing at the mutable {{TaskInfo}} structure. But the event log, for example, may record the wrong information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16899) Structured Streaming Checkpointing Example invalid
[ https://issues.apache.org/jira/browse/SPARK-16899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-16899. -- Resolution: Not A Problem This has been fixed. I believe you are using an old version of Spark. > Structured Streaming Checkpointing Example invalid > -- > > Key: SPARK-16899 > URL: https://issues.apache.org/jira/browse/SPARK-16899 > Project: Spark > Issue Type: Bug > Components: Documentation, Structured Streaming >Reporter: Vladimir Feinberg >Priority: Minor > > The structured streaming checkpointing example at the bottom of the page > (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing) > has the following excerpt: > {code} > aggDF >.writeStream >.outputMode("complete") >.option(“checkpointLocation”, “path/to/HDFS/dir”) >.format("memory") >.start() > {code} > But memory sinks are not fault-tolerant. Indeed, trying this out, I get the > following error: > {{This query does not support recovering from checkpoint location. Delete > /tmp/streaming.metadata-625631e5-baee-41da-acd1-f16c82f68a40/offsets to start > over.;}} > The documentation should be changed to demonstrate checkpointing for a > non-aggregation streaming task, and explicitly mention there is no way to > checkpoint aggregates. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16899) Structured Streaming Checkpointing Example invalid
[ https://issues.apache.org/jira/browse/SPARK-16899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-16899: - Component/s: Structured Streaming > Structured Streaming Checkpointing Example invalid > -- > > Key: SPARK-16899 > URL: https://issues.apache.org/jira/browse/SPARK-16899 > Project: Spark > Issue Type: Bug > Components: Documentation, Structured Streaming >Reporter: Vladimir Feinberg >Priority: Minor > > The structured streaming checkpointing example at the bottom of the page > (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing) > has the following excerpt: > {code} > aggDF >.writeStream >.outputMode("complete") >.option(“checkpointLocation”, “path/to/HDFS/dir”) >.format("memory") >.start() > {code} > But memory sinks are not fault-tolerant. Indeed, trying this out, I get the > following error: > {{This query does not support recovering from checkpoint location. Delete > /tmp/streaming.metadata-625631e5-baee-41da-acd1-f16c82f68a40/offsets to start > over.;}} > The documentation should be changed to demonstrate checkpointing for a > non-aggregation streaming task, and explicitly mention there is no way to > checkpoint aggregates. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20329) Resolution error when HAVING clause uses GROUP BY expression that involves implicit type coercion
[ https://issues.apache.org/jira/browse/SPARK-20329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969483#comment-15969483 ] Apache Spark commented on SPARK-20329: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/17641 > Resolution error when HAVING clause uses GROUP BY expression that involves > implicit type coercion > - > > Key: SPARK-20329 > URL: https://issues.apache.org/jira/browse/SPARK-20329 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Priority: Blocker > > The following example runs without error on Spark 2.0.x and 2.1.x but fails > in the current Spark master: > {code} > create temporary view foo (a, b) as values (cast(1 as bigint), 2), (cast(3 as > bigint), 4); > select a + b from foo group by a + b having (a + b) > 1 > {code} > The error is > {code} > Error in SQL statement: AnalysisException: cannot resolve '`a`' given input > columns: [(a + CAST(b AS BIGINT))]; line 1 pos 45; > 'Filter (('a + 'b) > 1) > +- Aggregate [(a#249243L + cast(b#249244 as bigint))], [(a#249243L + > cast(b#249244 as bigint)) AS (a + CAST(b AS BIGINT))#249246L] >+- SubqueryAlias foo > +- Project [col1#249241L AS a#249243L, col2#249242 AS b#249244] > +- LocalRelation [col1#249241L, col2#249242] > {code} > I think what's happening here is that the implicit cast is breaking things: > if we change the types so that both columns are integers then the analysis > error disappears. Similarly, adding explicit casts, as in > {code} > select a + cast(b as bigint) from foo group by a + cast(b as bigint) having > (a + cast(b as bigint)) > 1 > {code} > works so I'm pretty sure that the resolution problem is being introduced when > the casts are automatically added by the type coercion rule. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20329) Resolution error when HAVING clause uses GROUP BY expression that involves implicit type coercion
[ https://issues.apache.org/jira/browse/SPARK-20329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20329: Assignee: Apache Spark > Resolution error when HAVING clause uses GROUP BY expression that involves > implicit type coercion > - > > Key: SPARK-20329 > URL: https://issues.apache.org/jira/browse/SPARK-20329 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Blocker > > The following example runs without error on Spark 2.0.x and 2.1.x but fails > in the current Spark master: > {code} > create temporary view foo (a, b) as values (cast(1 as bigint), 2), (cast(3 as > bigint), 4); > select a + b from foo group by a + b having (a + b) > 1 > {code} > The error is > {code} > Error in SQL statement: AnalysisException: cannot resolve '`a`' given input > columns: [(a + CAST(b AS BIGINT))]; line 1 pos 45; > 'Filter (('a + 'b) > 1) > +- Aggregate [(a#249243L + cast(b#249244 as bigint))], [(a#249243L + > cast(b#249244 as bigint)) AS (a + CAST(b AS BIGINT))#249246L] >+- SubqueryAlias foo > +- Project [col1#249241L AS a#249243L, col2#249242 AS b#249244] > +- LocalRelation [col1#249241L, col2#249242] > {code} > I think what's happening here is that the implicit cast is breaking things: > if we change the types so that both columns are integers then the analysis > error disappears. Similarly, adding explicit casts, as in > {code} > select a + cast(b as bigint) from foo group by a + cast(b as bigint) having > (a + cast(b as bigint)) > 1 > {code} > works so I'm pretty sure that the resolution problem is being introduced when > the casts are automatically added by the type coercion rule. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20329) Resolution error when HAVING clause uses GROUP BY expression that involves implicit type coercion
[ https://issues.apache.org/jira/browse/SPARK-20329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20329: Assignee: (was: Apache Spark) > Resolution error when HAVING clause uses GROUP BY expression that involves > implicit type coercion > - > > Key: SPARK-20329 > URL: https://issues.apache.org/jira/browse/SPARK-20329 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Priority: Blocker > > The following example runs without error on Spark 2.0.x and 2.1.x but fails > in the current Spark master: > {code} > create temporary view foo (a, b) as values (cast(1 as bigint), 2), (cast(3 as > bigint), 4); > select a + b from foo group by a + b having (a + b) > 1 > {code} > The error is > {code} > Error in SQL statement: AnalysisException: cannot resolve '`a`' given input > columns: [(a + CAST(b AS BIGINT))]; line 1 pos 45; > 'Filter (('a + 'b) > 1) > +- Aggregate [(a#249243L + cast(b#249244 as bigint))], [(a#249243L + > cast(b#249244 as bigint)) AS (a + CAST(b AS BIGINT))#249246L] >+- SubqueryAlias foo > +- Project [col1#249241L AS a#249243L, col2#249242 AS b#249244] > +- LocalRelation [col1#249241L, col2#249242] > {code} > I think what's happening here is that the implicit cast is breaking things: > if we change the types so that both columns are integers then the analysis > error disappears. Similarly, adding explicit casts, as in > {code} > select a + cast(b as bigint) from foo group by a + cast(b as bigint) having > (a + cast(b as bigint)) > 1 > {code} > works so I'm pretty sure that the resolution problem is being introduced when > the casts are automatically added by the type coercion rule. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20341) Support BigIngeger values > 19 precision
Paul Zaczkieiwcz created SPARK-20341: Summary: Support BigIngeger values > 19 precision Key: SPARK-20341 URL: https://issues.apache.org/jira/browse/SPARK-20341 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0, 2.0.2 Reporter: Paul Zaczkieiwcz If you create a {{Dataset\[scala.math.BigInt\]}}, then you can't have a precision > 19. {code} scala> case class BigIntWrapper(value:scala.math.BigInt) defined class BigIntWrapper scala> val longDf = spark.createDataset(BigIntWrapper(scala.math.BigInt("1002"))::Nil) 17/04/14 19:45:15 main INFO CodeGenerator: Code generated in 211.949738 ms java.lang.RuntimeException: Error while encoding: java.lang.IllegalArgumentException: requirement failed: BigInteger 1002 too large for decimal staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(input[0, BigIntWrapper, true], top level non-flat input object).value, true) AS value#16 +- staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(input[0, BigIntWrapper, true], top level non-flat input object).value, true) +- assertnotnull(input[0, BigIntWrapper, true], top level non-flat input object).value +- assertnotnull(input[0, BigIntWrapper, true], top level non-flat input object) +- input[0, BigIntWrapper, true] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:280) at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:421) ... 54 elided Caused by: java.lang.IllegalArgumentException: requirement failed: BigInteger 1002 too large for decimal at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:137) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:419) at org.apache.spark.sql.types.Decimal.apply(Decimal.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:277) ... 62 more {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20340) Size estimate very wrong in ExternalAppendOnlyMap from CoGroupedRDD, cause OOM
Thomas Graves created SPARK-20340: - Summary: Size estimate very wrong in ExternalAppendOnlyMap from CoGroupedRDD, cause OOM Key: SPARK-20340 URL: https://issues.apache.org/jira/browse/SPARK-20340 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Thomas Graves I had a user doing a basic join operation. The values are image's binary data(in base64 format) and widely vary in size. The job failed with out of memory. Originally failed on yarn with using to much overhead memory, turned spark.shuffle.io.preferDirectBufsto false then failed with out of heap memory. I debugged it down to during the shuffle when CoGroupedRDD putting things into the ExternalAppendOnlyMap, it computes an estimated size to determine when to spill. In this case SizeEstimator handle arrays such that if it is larger then 400 elements, it samples 100 elements. The estimate is coming back as GB's different from the actual size. It claims 1GB when it is actually using close to 5GB. Temporary work around it to increase the memory to be very large (10GB executors) but that isn't really acceptable here. User did the same thing in pig and it easily handled the data with 1.5GB of memory. It seems risky to be using an estimate in such a critical thing. If the estimate is wrong you are going to run out of memory and fail the job. I'm looking closer at the users data still to get more insights. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969422#comment-15969422 ] Ritesh Tijoriwala commented on SPARK-650: - [~Skamandros] - I would also like to know about hooking 'JavaSerializer'. I have a similar use case where I need to initialize set of objects/resources on each executor. I would also like to know if anybody has a way to hook into some "clean up" on each executor when 1) the executor shutdown 2) when a batch finishes and before next batch starts > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20339) Issue in regex_replace in Apache Spark Java
[ https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nischay updated SPARK-20339: Description: We are currently facing couple of issues 1. "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB". 2. "java.lang.StackOverflowError" The first issue is reported as a Major bug in Jira of Apache spark https://issues.apache.org/jira/browse/SPARK-18492 We got these issues by the following program. We are trying to replace the Manufacturer name by its equivalent alternate name, These issues occur only when we have Huge number of alternate names to replace, for small number of replacements it works with no issues. dataFileContent=dataFileContent.withColumn("ManufacturerSource", regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));` Kindly suggest us an alternative method or a solution to go around this problem. Hashtable manufacturerNames = new Hashtable(); Enumeration names; String str; double bal; manufacturerNames.put("Allen","Apex Tool Group"); manufacturerNames.put("Armstrong","Apex Tool Group"); manufacturerNames.put("Campbell","Apex Tool Group"); manufacturerNames.put("Lubriplate","Apex Tool Group"); manufacturerNames.put("Delta","Apex Tool Group"); manufacturerNames.put("Gearwrench","Apex Tool Group"); manufacturerNames.put("H.K. Porter","Apex Tool Group"); manufacturerNames.put("Jacobs","Apex Tool Group"); manufacturerNames.put("Jobox","Apex Tool Group"); manufacturerNames.put("Lufkin","Apex Tool Group"); manufacturerNames.put("Nicholson","Apex Tool Group"); manufacturerNames.put("Plumb","Apex Tool Group"); manufacturerNames.put("Wiss","Apex Tool Group"); manufacturerNames.put("Covert","Apex Tool Group"); manufacturerNames.put("Apex-Geta","Apex Tool Group"); manufacturerNames.put("Dotco-Airetool","Apex Tool Group"); manufacturerNames.put("Apex","Apex Tool Group"); manufacturerNames.put("Cleco","Apex Tool Group"); manufacturerNames.put("Dotco","Apex Tool Group"); manufacturerNames.put("Erem","Apex Tool Group"); manufacturerNames.put("Master Power","Apex Tool Group"); manufacturerNames.put("Recoules Quackenbush","Apex Tool Group"); manufacturerNames.put("Apex-Utica","Apex Tool Group"); manufacturerNames.put("Weller","Apex Tool Group"); manufacturerNames.put("Xcelite","Apex Tool Group"); manufacturerNames.put("JET","JPW Industries"); manufacturerNames.put("Powermatic","JPW Industries"); manufacturerNames.put("Wilton","JPW Industries"); manufacturerNames.put("Black+Decker","StanleyBlack & Decker"); manufacturerNames.put("BlackhawkBy Proto","StanleyBlack & Decker"); manufacturerNames.put("Bostitch","StanleyBlack & Decker"); manufacturerNames.put("Cribmaster","StanleyBlack & Decker"); manufacturerNames.put("DeWALT","StanleyBlack & Decker"); manufacturerNames.put("Expert (Hand Tools & Accessories); Expert (Wrenches)","StanleyBlack & Decker"); manufacturerNames.put("Facom","StanleyBlack & Decker"); manufacturerNames.put("Mac","StanleyBlack & Decker"); manufacturerNames.put("Lista","StanleyBlack & Decker"); manufacturerNames.put("Porter-Cable","StanleyBlack & Decker"); manufacturerNames.put("Powers","StanleyBlack & Decker"); manufacturerNames.put("Proto","StanleyBlack & Decker"); manufacturerNames.put("Stanley","StanleyBlack & Decker"); manufacturerNames.put("Vidmar","StanleyBlack & Decker"); manufacturerNames.put("Abell-Howe","Columbus McKinnon"); manufacturerNames.put("Budgit Hoists","Columbus McKinnon"); manufacturerNames.put("Cady Lifters","Columbus McKinnon"); manufacturerNames.put("Chester Hoist","Columbus McKinnon");
[jira] [Created] (SPARK-20339) Issue in regex_replace in Apache Spark Java
Nischay created SPARK-20339: --- Summary: Issue in regex_replace in Apache Spark Java Key: SPARK-20339 URL: https://issues.apache.org/jira/browse/SPARK-20339 Project: Spark Issue Type: Question Components: Java API, Spark Core, SQL Affects Versions: 2.1.0 Reporter: Nischay We are currently facing couple of issues 1. "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB". 2. "java.lang.StackOverflowError" The first issue is reported as a Major bug in Jira of Apache spark https://issues.apache.org/jira/browse/SPARK-18492 We got these issues by the following program. We are trying to replace the Manufacturer name by its equivalent alternate name, These issues occur only when we have Huge number of alternate names to replace, for small number of replacements it works with no issues. `dataFileContent=dataFileContent.withColumn("ManufacturerSource", regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));` Kindly suggest us an alternative method or a solution to go around this problem. Hashtable manufacturerNames = new Hashtable(); Enumeration names; String str; double bal; manufacturerNames.put("Allen","Apex Tool Group"); manufacturerNames.put("Armstrong","Apex Tool Group"); manufacturerNames.put("Campbell","Apex Tool Group"); manufacturerNames.put("Lubriplate","Apex Tool Group"); manufacturerNames.put("Delta","Apex Tool Group"); manufacturerNames.put("Gearwrench","Apex Tool Group"); manufacturerNames.put("H.K. Porter","Apex Tool Group"); manufacturerNames.put("Jacobs","Apex Tool Group"); manufacturerNames.put("Jobox","Apex Tool Group"); manufacturerNames.put("Lufkin","Apex Tool Group"); manufacturerNames.put("Nicholson","Apex Tool Group"); manufacturerNames.put("Plumb","Apex Tool Group"); manufacturerNames.put("Wiss","Apex Tool Group"); manufacturerNames.put("Covert","Apex Tool Group"); manufacturerNames.put("Apex-Geta","Apex Tool Group"); manufacturerNames.put("Dotco-Airetool","Apex Tool Group"); manufacturerNames.put("Apex","Apex Tool Group"); manufacturerNames.put("Cleco","Apex Tool Group"); manufacturerNames.put("Dotco","Apex Tool Group"); manufacturerNames.put("Erem","Apex Tool Group"); manufacturerNames.put("Master Power","Apex Tool Group"); manufacturerNames.put("Recoules Quackenbush","Apex Tool Group"); manufacturerNames.put("Apex-Utica","Apex Tool Group"); manufacturerNames.put("Weller","Apex Tool Group"); manufacturerNames.put("Xcelite","Apex Tool Group"); manufacturerNames.put("JET","JPW Industries"); manufacturerNames.put("Powermatic","JPW Industries"); manufacturerNames.put("Wilton","JPW Industries"); manufacturerNames.put("Black+Decker","StanleyBlack & Decker"); manufacturerNames.put("BlackhawkBy Proto","StanleyBlack & Decker"); manufacturerNames.put("Bostitch","StanleyBlack & Decker"); manufacturerNames.put("Cribmaster","StanleyBlack & Decker"); manufacturerNames.put("DeWALT","StanleyBlack & Decker"); manufacturerNames.put("Expert (Hand Tools & Accessories); Expert (Wrenches)","StanleyBlack & Decker"); manufacturerNames.put("Facom","StanleyBlack & Decker"); manufacturerNames.put("Mac","StanleyBlack & Decker"); manufacturerNames.put("Lista","StanleyBlack & Decker"); manufacturerNames.put("Porter-Cable","StanleyBlack & Decker"); manufacturerNames.put("Powers","StanleyBlack & Decker"); manufacturerNames.put("Proto","StanleyBlack & Decker"); manufacturerNames.put("Stanley","StanleyBlack & Decker"); manufacturerNames.put("Vidmar","StanleyBlack & Decker"); manufacturerNames.put("Abell-Howe","Columbus McKinnon"); manufacturerNames.put("Budgit
[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures
[ https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969388#comment-15969388 ] Thomas Graves commented on SPARK-20178: --- One thing I ran into today which is somewhat related to this is a combination of failure types. In this case it was broadcast fetch failures combined with shuffle fetch failures which lead to 4 task failures and failed the job. I believe they were all from the same host and happened really quickly (within 3 seconds). This seems like this should fall under the fetch failure case as well. Failed to get broadcast_646_piece0 of broadcast_646 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:691) at org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:204) at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:143) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:47) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:147) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > Improve Scheduler fetch failures > > > Key: SPARK-20178 > URL: https://issues.apache.org/jira/browse/SPARK-20178 > Project: Spark > Issue Type: Epic > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > We have been having a lot of discussions around improving the handling of > fetch failures. There are 4 jira currently related to this. > We should try to get a list of things we want to improve and come up with one > cohesive design. > SPARK-20163, SPARK-20091, SPARK-14649 , and SPARK-19753 > I will put my initial thoughts in a follow on comment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969384#comment-15969384 ] Yongqin Xiao commented on SPARK-18406: -- The same JIRA is found under https://issues-test.apache.org/jira/browse/SPARK-18406. Same issue observed in spark 2.1.0 as well. The issue is observed on some simple spark query that compiles into 3 stages. I have some custom RDD being used in this case, and it is registered for persistence. In the compute() method of the custom RDD, I spawn a new thread to compute the data in the background, and then immediately return an abstract iterator (wrapped under a InterruptibleIterator) that gets data from the background computation on demand. The assertion happens when iterator of the parent RDD reaches the end of the data. This issue doesn't always happen when the custom RDD is used in the query, regardless being used once or multiple times. The issue is related to the new thread I created which accesses data from the input iterator of parent RDD. The new thread is missing the TSS(thread-specific-storage) for TaskContext. I see BlockInfoManager is using this TSS TaskContext as key to search the storage. Here is log showing the task ID being unset in this thread: Line 1819: 2017/04/13 15:01:01.674 [Thread-33]: TRACE storage.BlockInfoManager: Task -1024 releasing lock for rdd_25_0 However, I have no way to set TSS for my thread now because the method is made protected as below: object TaskContext { ... private[this] val taskContext: ThreadLocal[TaskContext] = new ThreadLocal[TaskContext] // Note: protected[spark] instead of private[spark] to prevent the following two from // showing up in JavaDoc. /** Set the thread local TaskContext. Internal to Spark. */ protected[spark] def setTaskContext(tc: TaskContext): Unit = taskContext.set(tc) Just to confirm my theory, I made the TaskContext.setTaskContext public, and called it in the beginning of my thread. The use cases that were failing consistently with assertion on lock-release now run successful in all scenarios I have tried, which include having different number of src/shuffle partitions, number of executors, async vs. sequential execution (for having multiple downstream custom RDDs pulling data from upstream RDD). > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969379#comment-15969379 ] Michael Gummelt commented on SPARK-20328: - Ah, yes, of course. Thanks. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969341#comment-15969341 ] Michael Gummelt commented on SPARK-16742: - [~jerryshao] No, but you can look at our solution here: https://github.com/mesosphere/spark/commit/0a2cc4248039ca989e177e96e92a594a025661fe#diff-79391110e9f26657e415aa169a004998R129 > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969341#comment-15969341 ] Michael Gummelt edited comment on SPARK-16742 at 4/14/17 6:01 PM: -- [~jerryshao] No, but you can look at our solution here: https://github.com/mesosphere/spark/commit/0a2cc4248039ca989e177e96e92a594a025661fe#diff-79391110e9f26657e415aa169a004998R129 The code we upstream will be quite different, but the delegation token handling will be similar. was (Author: mgummelt): [~jerryshao] No, but you can look at our solution here: https://github.com/mesosphere/spark/commit/0a2cc4248039ca989e177e96e92a594a025661fe#diff-79391110e9f26657e415aa169a004998R129 > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17608) Long type has incorrect serialization/deserialization
[ https://issues.apache.org/jira/browse/SPARK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17608: Assignee: (was: Apache Spark) > Long type has incorrect serialization/deserialization > - > > Key: SPARK-17608 > URL: https://issues.apache.org/jira/browse/SPARK-17608 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Thomas Powell > > Am hitting issues when using {{dapply}} on a data frame that contains a > {{bigint}} in its schema. When this is converted to a SparkR data frame a > "bigint" gets converted to a R {{numeric}} type: > https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L25. > However, the R {{numeric}} type gets converted to > {{org.apache.spark.sql.types.DoubleType}}: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L97. > The two directions therefore aren't compatible. If I use the same schema when > using dapply (and just an identity function) I will get type collisions > because the output type is a double but the schema expects a bigint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17608) Long type has incorrect serialization/deserialization
[ https://issues.apache.org/jira/browse/SPARK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17608: Assignee: Apache Spark > Long type has incorrect serialization/deserialization > - > > Key: SPARK-17608 > URL: https://issues.apache.org/jira/browse/SPARK-17608 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Thomas Powell >Assignee: Apache Spark > > Am hitting issues when using {{dapply}} on a data frame that contains a > {{bigint}} in its schema. When this is converted to a SparkR data frame a > "bigint" gets converted to a R {{numeric}} type: > https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L25. > However, the R {{numeric}} type gets converted to > {{org.apache.spark.sql.types.DoubleType}}: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L97. > The two directions therefore aren't compatible. If I use the same schema when > using dapply (and just an identity function) I will get type collisions > because the output type is a double but the schema expects a bigint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17608) Long type has incorrect serialization/deserialization
[ https://issues.apache.org/jira/browse/SPARK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969328#comment-15969328 ] Apache Spark commented on SPARK-17608: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/17640 > Long type has incorrect serialization/deserialization > - > > Key: SPARK-17608 > URL: https://issues.apache.org/jira/browse/SPARK-17608 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Thomas Powell > > Am hitting issues when using {{dapply}} on a data frame that contains a > {{bigint}} in its schema. When this is converted to a SparkR data frame a > "bigint" gets converted to a R {{numeric}} type: > https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L25. > However, the R {{numeric}} type gets converted to > {{org.apache.spark.sql.types.DoubleType}}: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L97. > The two directions therefore aren't compatible. If I use the same schema when > using dapply (and just an identity function) I will get type collisions > because the output type is a double but the schema expects a bigint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race
[ https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-20243: -- Fix Version/s: 2.1.1 > DebugFilesystem.assertNoOpenStreams thread race > --- > > Key: SPARK-20243 > URL: https://issues.apache.org/jira/browse/SPARK-20243 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu > Fix For: 2.1.1, 2.2.0 > > > Introduced by SPARK-19946. > DebugFilesystem.assertNoOpenStreams gets the size of the openStreams > ConcurrentHashMap and then later, if the size was > 0, accesses the first > element in openStreams.values. But, the ConcurrentHashMap might be cleared by > another thread between getting its size and accessing it, resulting in an > exception when trying to call .head on an empty collection. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array
[ https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969051#comment-15969051 ] Apache Spark commented on SPARK-19716: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17639 > Dataset should allow by-name resolution for struct type elements in array > - > > Key: SPARK-19716 > URL: https://issues.apache.org/jira/browse/SPARK-19716 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > > if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it > to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will > extract the `a` and `c` columns to build the Data. > However, if the struct is inside array, e.g. schema is {{arr: array>}}, and we wanna convert it to Dataset with {{case class > ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow > compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we > will add cast for each field, except struct type field, because struct type > is flexible, the number of columns can mismatch. We should probably also skip > cast for array and map type. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang
[ https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969042#comment-15969042 ] Andrei Taleanu commented on SPARK-20323: Ok, thank you. Unfortunately I can't even get at-least once - you can see that from the example I provided in the last comment. Anyway, I'll try that. > Calling stop in a transform stage causes the app to hang > > > Key: SPARK-20323 > URL: https://issues.apache.org/jira/browse/SPARK-20323 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > I'm not sure if this is a bug or just the way it needs to happen but I've run > in this issue with the following code: > {noformat} > object ImmortalStreamingJob extends App { > val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]") > val ssc = new StreamingContext(conf, Seconds(1)) > val elems = (1 to 1000).grouped(10) > .map(seq => ssc.sparkContext.parallelize(seq)) > .toSeq > val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*)) > val transformed = stream.transform { rdd => > try { > if (Random.nextInt(6) == 5) throw new RuntimeException("boom") > else println("lucky bastard") > rdd > } catch { > case e: Throwable => > println("stopping streaming context", e) > ssc.stop(stopSparkContext = true, stopGracefully = false) > throw e > } > } > transformed.foreachRDD { rdd => > println(rdd.collect().mkString(",")) > } > ssc.start() > ssc.awaitTermination() > } > {noformat} > There are two things I can note here: > * if the exception is thrown in the first transformation (when the first RDD > is processed), the spark context is stopped and the app dies > * if the exception is thrown after at least one RDD has been processed, the > app hangs after printing the error message and never stops > I think there's some sort of deadlock in the second case, is that normal? I > also asked this > [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624] > but up two this point there's no answer pointing exactly to what happens, > only guidelines. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang
[ https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969030#comment-15969030 ] Sean Owen commented on SPARK-20323: --- It's normal to continue processing if one batch fails. Achieving the semantics you want is up to your app and depends on your source, destination. I don't think you can get exactly-once semantics here because there's no way to make the DB write and Kafka update atomic. You can get at-least or at-most once though. But, this is not the question here. You're just asking how to stop the context and I think it's simply best to do so with try-finally wrapping ssc.awaitTermination(). > Calling stop in a transform stage causes the app to hang > > > Key: SPARK-20323 > URL: https://issues.apache.org/jira/browse/SPARK-20323 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > I'm not sure if this is a bug or just the way it needs to happen but I've run > in this issue with the following code: > {noformat} > object ImmortalStreamingJob extends App { > val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]") > val ssc = new StreamingContext(conf, Seconds(1)) > val elems = (1 to 1000).grouped(10) > .map(seq => ssc.sparkContext.parallelize(seq)) > .toSeq > val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*)) > val transformed = stream.transform { rdd => > try { > if (Random.nextInt(6) == 5) throw new RuntimeException("boom") > else println("lucky bastard") > rdd > } catch { > case e: Throwable => > println("stopping streaming context", e) > ssc.stop(stopSparkContext = true, stopGracefully = false) > throw e > } > } > transformed.foreachRDD { rdd => > println(rdd.collect().mkString(",")) > } > ssc.start() > ssc.awaitTermination() > } > {noformat} > There are two things I can note here: > * if the exception is thrown in the first transformation (when the first RDD > is processed), the spark context is stopped and the app dies > * if the exception is thrown after at least one RDD has been processed, the > app hangs after printing the error message and never stops > I think there's some sort of deadlock in the second case, is that normal? I > also asked this > [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624] > but up two this point there's no answer pointing exactly to what happens, > only guidelines. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang
[ https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968969#comment-15968969 ] Andrei Taleanu commented on SPARK-20323: [~srowen] I see. Let me describe you better the problem. Short version: I have a *streaming job*. Although a *batch processing fails the processing continues* if I let Spark alone handle the thrown exceptions. This translates to data loss and loss of at-least once semantics. Detailed version: I started originally from an app we run on Spark 2.1.0 on top of Mesos w/ Hadoop 2.6, checkpointing disabled (it's done "manually" as you'll see below). I tried narrowing it down as much as possible to reproduce a similar issue in the local mode, just for illustration purposes (that's where the code I put in the issue description came). Consider the following use-case: {noformat} 1) read data from a Kafka source 2) transform the dstream: a) get data from an external service to avoid too many calls from executors (might fail) b) broadcast the data c) map the RDD using the broadcast value 3) cache the transformed dstream 4) foreach RDD write cached data into a db (might fail) 5) foreach RDD: a) write cached data in Kafka (might fail) b) manually commit the new Kafka offsets (because I need a human readable format) {noformat} There are multiple points of failure here (e.g. 2.a) and what I need is failing asap (see 5.b which means data loss if anything prior to that one failed in a micro-batch processing). Obviously manipulating the context in transform is wrong. Obviously doing this in foreachRDD in the same thread is again wrong (as recommended by [~zsxwing] in SPARK-20321). What's the recommended way to handle this? If I just let Spark alone handle exceptions it seems to somehow ignore them (2.a case for example) and continue processing. Since this means data loss I need to avoid it (I need at-least once guarantees). Thanks again :) > Calling stop in a transform stage causes the app to hang > > > Key: SPARK-20323 > URL: https://issues.apache.org/jira/browse/SPARK-20323 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > I'm not sure if this is a bug or just the way it needs to happen but I've run > in this issue with the following code: > {noformat} > object ImmortalStreamingJob extends App { > val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]") > val ssc = new StreamingContext(conf, Seconds(1)) > val elems = (1 to 1000).grouped(10) > .map(seq => ssc.sparkContext.parallelize(seq)) > .toSeq > val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*)) > val transformed = stream.transform { rdd => > try { > if (Random.nextInt(6) == 5) throw new RuntimeException("boom") > else println("lucky bastard") > rdd > } catch { > case e: Throwable => > println("stopping streaming context", e) > ssc.stop(stopSparkContext = true, stopGracefully = false) > throw e > } > } > transformed.foreachRDD { rdd => > println(rdd.collect().mkString(",")) > } > ssc.start() > ssc.awaitTermination() > } > {noformat} > There are two things I can note here: > * if the exception is thrown in the first transformation (when the first RDD > is processed), the spark context is stopped and the app dies > * if the exception is thrown after at least one RDD has been processed, the > app hangs after printing the error message and never stops > I think there's some sort of deadlock in the second case, is that normal? I > also asked this > [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624] > but up two this point there's no answer pointing exactly to what happens, > only guidelines. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12717) pyspark broadcast fails when using multiple threads
[ https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968945#comment-15968945 ] Maciej Bryński edited comment on SPARK-12717 at 4/14/17 12:12 PM: -- Same problem with Python3. CC: [~davies] was (Author: maver1ck): Same here. CC: [~davies] > pyspark broadcast fails when using multiple threads > --- > > Key: SPARK-12717 > URL: https://issues.apache.org/jira/browse/SPARK-12717 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 > Environment: Linux, python 2.6 or python 2.7. >Reporter: Edward Walker >Priority: Critical > > The following multi-threaded program that uses broadcast variables > consistently throws exceptions like: *Exception("Broadcast variable '18' not > loaded!",)* --- even when run with "--master local[10]". > {code:title=bug_spark.py|borderStyle=solid} > try: > > import pyspark > > except: > > pass > > from optparse import OptionParser > > > > def my_option_parser(): > > op = OptionParser() > > op.add_option("--parallelism", dest="parallelism", type="int", > default=20) > return op > > > > def do_process(x, w): > > return x * w.value > > > > def func(name, rdd, conf): > > new_rdd = rdd.map(lambda x : do_process(x, conf)) > > total = new_rdd.reduce(lambda x, y : x + y) > > count = rdd.count() > > print name, 1.0 * total / count > > > > if __name__ == "__main__": > > import threading > > op = my_option_parser() > > options, args = op.parse_args() > > sc = pyspark.SparkContext(appName="Buggy") > > data_rdd = sc.parallelize(range(0,1000), 1) > > confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ] > > threads = [ threading.Thread(target=func, args=["thread_" + str(i), > data_rdd, confs[i]]) for i in xrange(options.parallelism) ] > > for t in threads: > > t.start() > > for t in threads: > > t.join() > {code} > Abridged run output: > {code:title=abridge_run.txt|borderStyle=solid} > % spark-submit --master local[10] bug_spark.py
[jira] [Comment Edited] (SPARK-12717) pyspark broadcast fails when using multiple threads
[ https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968945#comment-15968945 ] Maciej Bryński edited comment on SPARK-12717 at 4/14/17 12:10 PM: -- Same here. CC: [~davies] was (Author: maver1ck): Same here. > pyspark broadcast fails when using multiple threads > --- > > Key: SPARK-12717 > URL: https://issues.apache.org/jira/browse/SPARK-12717 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 > Environment: Linux, python 2.6 or python 2.7. >Reporter: Edward Walker >Priority: Critical > > The following multi-threaded program that uses broadcast variables > consistently throws exceptions like: *Exception("Broadcast variable '18' not > loaded!",)* --- even when run with "--master local[10]". > {code:title=bug_spark.py|borderStyle=solid} > try: > > import pyspark > > except: > > pass > > from optparse import OptionParser > > > > def my_option_parser(): > > op = OptionParser() > > op.add_option("--parallelism", dest="parallelism", type="int", > default=20) > return op > > > > def do_process(x, w): > > return x * w.value > > > > def func(name, rdd, conf): > > new_rdd = rdd.map(lambda x : do_process(x, conf)) > > total = new_rdd.reduce(lambda x, y : x + y) > > count = rdd.count() > > print name, 1.0 * total / count > > > > if __name__ == "__main__": > > import threading > > op = my_option_parser() > > options, args = op.parse_args() > > sc = pyspark.SparkContext(appName="Buggy") > > data_rdd = sc.parallelize(range(0,1000), 1) > > confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ] > > threads = [ threading.Thread(target=func, args=["thread_" + str(i), > data_rdd, confs[i]]) for i in xrange(options.parallelism) ] > > for t in threads: > > t.start() > > for t in threads: > > t.join() > {code} > Abridged run output: > {code:title=abridge_run.txt|borderStyle=solid} > % spark-submit --master local[10] bug_spark.py --parallelism 20 > [snip] >
[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang
[ https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968943#comment-15968943 ] Sean Owen commented on SPARK-20323: --- Not being able to use a context in a remote operation? I think it's kind of implicit because it doesn't make sense in the context of the Spark model, but I don't know if it's explicitly stated. It will throw an obvious error in most other cases that you try it. The examples show stopping the context at the end of the program. They don't consistently show using a try-finally block for it, but that is good practice. You just do it outside of streaming operations. > Calling stop in a transform stage causes the app to hang > > > Key: SPARK-20323 > URL: https://issues.apache.org/jira/browse/SPARK-20323 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > I'm not sure if this is a bug or just the way it needs to happen but I've run > in this issue with the following code: > {noformat} > object ImmortalStreamingJob extends App { > val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]") > val ssc = new StreamingContext(conf, Seconds(1)) > val elems = (1 to 1000).grouped(10) > .map(seq => ssc.sparkContext.parallelize(seq)) > .toSeq > val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*)) > val transformed = stream.transform { rdd => > try { > if (Random.nextInt(6) == 5) throw new RuntimeException("boom") > else println("lucky bastard") > rdd > } catch { > case e: Throwable => > println("stopping streaming context", e) > ssc.stop(stopSparkContext = true, stopGracefully = false) > throw e > } > } > transformed.foreachRDD { rdd => > println(rdd.collect().mkString(",")) > } > ssc.start() > ssc.awaitTermination() > } > {noformat} > There are two things I can note here: > * if the exception is thrown in the first transformation (when the first RDD > is processed), the spark context is stopped and the app dies > * if the exception is thrown after at least one RDD has been processed, the > app hangs after printing the error message and never stops > I think there's some sort of deadlock in the second case, is that normal? I > also asked this > [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624] > but up two this point there's no answer pointing exactly to what happens, > only guidelines. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12717) pyspark broadcast fails when using multiple threads
[ https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968945#comment-15968945 ] Maciej Bryński commented on SPARK-12717: Same here. > pyspark broadcast fails when using multiple threads > --- > > Key: SPARK-12717 > URL: https://issues.apache.org/jira/browse/SPARK-12717 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 > Environment: Linux, python 2.6 or python 2.7. >Reporter: Edward Walker >Priority: Critical > > The following multi-threaded program that uses broadcast variables > consistently throws exceptions like: *Exception("Broadcast variable '18' not > loaded!",)* --- even when run with "--master local[10]". > {code:title=bug_spark.py|borderStyle=solid} > try: > > import pyspark > > except: > > pass > > from optparse import OptionParser > > > > def my_option_parser(): > > op = OptionParser() > > op.add_option("--parallelism", dest="parallelism", type="int", > default=20) > return op > > > > def do_process(x, w): > > return x * w.value > > > > def func(name, rdd, conf): > > new_rdd = rdd.map(lambda x : do_process(x, conf)) > > total = new_rdd.reduce(lambda x, y : x + y) > > count = rdd.count() > > print name, 1.0 * total / count > > > > if __name__ == "__main__": > > import threading > > op = my_option_parser() > > options, args = op.parse_args() > > sc = pyspark.SparkContext(appName="Buggy") > > data_rdd = sc.parallelize(range(0,1000), 1) > > confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ] > > threads = [ threading.Thread(target=func, args=["thread_" + str(i), > data_rdd, confs[i]]) for i in xrange(options.parallelism) ] > > for t in threads: > > t.start() > > for t in threads: > > t.join() > {code} > Abridged run output: > {code:title=abridge_run.txt|borderStyle=solid} > % spark-submit --master local[10] bug_spark.py --parallelism 20 > [snip] > 16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9) >
[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang
[ https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968940#comment-15968940 ] Andrei Taleanu commented on SPARK-20323: [~srowen] could you please give me a link that documents this bad-practice? I understood and expected that this was an incorrect approach since transformation are lazy. However, I asked the question because of a common problem I would say: you have a streaming app and when something goes wrong either on the driver / executors you want to fail fast in order to avoid some data loss / corruption. Stopping the context seems the most straightforward way. But doing so appears not to be that easy. So how should one handle this case? Thanks. > Calling stop in a transform stage causes the app to hang > > > Key: SPARK-20323 > URL: https://issues.apache.org/jira/browse/SPARK-20323 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > I'm not sure if this is a bug or just the way it needs to happen but I've run > in this issue with the following code: > {noformat} > object ImmortalStreamingJob extends App { > val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]") > val ssc = new StreamingContext(conf, Seconds(1)) > val elems = (1 to 1000).grouped(10) > .map(seq => ssc.sparkContext.parallelize(seq)) > .toSeq > val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*)) > val transformed = stream.transform { rdd => > try { > if (Random.nextInt(6) == 5) throw new RuntimeException("boom") > else println("lucky bastard") > rdd > } catch { > case e: Throwable => > println("stopping streaming context", e) > ssc.stop(stopSparkContext = true, stopGracefully = false) > throw e > } > } > transformed.foreachRDD { rdd => > println(rdd.collect().mkString(",")) > } > ssc.start() > ssc.awaitTermination() > } > {noformat} > There are two things I can note here: > * if the exception is thrown in the first transformation (when the first RDD > is processed), the spark context is stopped and the app dies > * if the exception is thrown after at least one RDD has been processed, the > app hangs after printing the error message and never stops > I think there's some sort of deadlock in the second case, is that normal? I > also asked this > [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624] > but up two this point there's no answer pointing exactly to what happens, > only guidelines. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10109) NPE when saving Parquet To HDFS
[ https://issues.apache.org/jira/browse/SPARK-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968934#comment-15968934 ] Steve Loughran commented on SPARK-10109: SPARK-20038 should stop the failure being so dramatic > NPE when saving Parquet To HDFS > --- > > Key: SPARK-10109 > URL: https://issues.apache.org/jira/browse/SPARK-10109 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Sparc-ec2, standalone cluster on amazon >Reporter: Virgil Palanciuc > > Very simple code, trying to save a dataframe > I get this in the driver > {quote} > 15/08/19 11:21:41 INFO TaskSetManager: Lost task 9.2 in stage 217.0 (TID > 4748) on executor 172.xx.xx.xx: java.lang.NullPointerException (null) > and (not for that task): > 15/08/19 11:21:46 WARN TaskSetManager: Lost task 5.0 in stage 543.0 (TID > 5607, 172.yy.yy.yy): java.lang.NullPointerException > at > parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146) > at > parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) > at > parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) > at > org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:88) > at > org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536) > at > org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536) > at > scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107) > at > scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:107) > at > org.apache.spark.sql.sources.DynamicPartitionWriterContainer.clearOutputWriters(commands.scala:536) > at > org.apache.spark.sql.sources.DynamicPartitionWriterContainer.abortTask(commands.scala:552) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$2(commands.scala:269) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > I get this in the executor log: > {quote} > 15/08/19 11:21:41 WARN DFSClient: DataStreamer Exception > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on > /gglogs/2015-07-27/_temporary/_attempt_201508191119_0217_m_09_2/dpid=18432/pid=1109/part-r-9-46ac3a79-a95c-4d9c-a2f1-b3ee76f6a46c.snappy.parquet > File does not exist. Holder DFSClient_NONMAPREDUCE_1730998114_63 does not > have any open files. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) > at java.security.AccessController.doPrivileged(Native Method) > at
[jira] [Assigned] (SPARK-20318) Use Catalyst type for min/max in ColumnStat for ease of estimation
[ https://issues.apache.org/jira/browse/SPARK-20318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20318: --- Assignee: Zhenhua Wang > Use Catalyst type for min/max in ColumnStat for ease of estimation > -- > > Key: SPARK-20318 > URL: https://issues.apache.org/jira/browse/SPARK-20318 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.2.0 > > > Currently when estimating predicates like col > literal or col = literal, we > will update min or max in column stats based on literal value. However, > literal value is of Catalyst type (internal type), while min/max is of > external type. This causes unnecessary conversion when comparing them and > updating column stats. > To solve this, we can use Catalyst type for min/max in ColumnStat for ease of > estimation. Note that the persistent form in metastore is still of external > type, so there's no inconsistency for statistics in metastore. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20318) Use Catalyst type for min/max in ColumnStat for ease of estimation
[ https://issues.apache.org/jira/browse/SPARK-20318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20318. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17630 [https://github.com/apache/spark/pull/17630] > Use Catalyst type for min/max in ColumnStat for ease of estimation > -- > > Key: SPARK-20318 > URL: https://issues.apache.org/jira/browse/SPARK-20318 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang > Fix For: 2.2.0 > > > Currently when estimating predicates like col > literal or col = literal, we > will update min or max in column stats based on literal value. However, > literal value is of Catalyst type (internal type), while min/max is of > external type. This causes unnecessary conversion when comparing them and > updating column stats. > To solve this, we can use Catalyst type for min/max in ColumnStat for ease of > estimation. Note that the persistent form in metastore is still of external > type, so there's no inconsistency for statistics in metastore. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20338) Spaces in spark.eventLog.dir are not correctly handled
[ https://issues.apache.org/jira/browse/SPARK-20338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968882#comment-15968882 ] Apache Spark commented on SPARK-20338: -- User 'zuotingbing' has created a pull request for this issue: https://github.com/apache/spark/pull/17638 > Spaces in spark.eventLog.dir are not correctly handled > -- > > Key: SPARK-20338 > URL: https://issues.apache.org/jira/browse/SPARK-20338 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: zuotingbing >Priority: Minor > > set spark.eventLog.dir=/home/mr/event log and submit an app ,we got error as > follows: > 017-04-14 17:28:40,378 INFO org.apache.spark.SparkContext: Successfully > stopped SparkContext > Exception in thread "main" ExitCodeException exitCode=1: chmod: cannot access > `/home/mr/event%20log/app-20170414172839-.inprogress': No such file or > directory > at org.apache.hadoop.util.Shell.runCommand(Shell.java:561) > at org.apache.hadoop.util.Shell.run(Shell.java:478) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:831) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:814) > at > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:712) > at > org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:506) > at > org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:125) > at org.apache.spark.SparkContext.(SparkContext.scala:516) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:879) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:871) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:871) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20338) Spaces in spark.eventLog.dir are not correctly handled
[ https://issues.apache.org/jira/browse/SPARK-20338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20338: Assignee: (was: Apache Spark) > Spaces in spark.eventLog.dir are not correctly handled > -- > > Key: SPARK-20338 > URL: https://issues.apache.org/jira/browse/SPARK-20338 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: zuotingbing >Priority: Minor > > set spark.eventLog.dir=/home/mr/event log and submit an app ,we got error as > follows: > 017-04-14 17:28:40,378 INFO org.apache.spark.SparkContext: Successfully > stopped SparkContext > Exception in thread "main" ExitCodeException exitCode=1: chmod: cannot access > `/home/mr/event%20log/app-20170414172839-.inprogress': No such file or > directory > at org.apache.hadoop.util.Shell.runCommand(Shell.java:561) > at org.apache.hadoop.util.Shell.run(Shell.java:478) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:831) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:814) > at > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:712) > at > org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:506) > at > org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:125) > at org.apache.spark.SparkContext.(SparkContext.scala:516) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:879) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:871) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:871) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20338) Spaces in spark.eventLog.dir are not correctly handled
[ https://issues.apache.org/jira/browse/SPARK-20338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20338: Assignee: Apache Spark > Spaces in spark.eventLog.dir are not correctly handled > -- > > Key: SPARK-20338 > URL: https://issues.apache.org/jira/browse/SPARK-20338 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: zuotingbing >Assignee: Apache Spark >Priority: Minor > > set spark.eventLog.dir=/home/mr/event log and submit an app ,we got error as > follows: > 017-04-14 17:28:40,378 INFO org.apache.spark.SparkContext: Successfully > stopped SparkContext > Exception in thread "main" ExitCodeException exitCode=1: chmod: cannot access > `/home/mr/event%20log/app-20170414172839-.inprogress': No such file or > directory > at org.apache.hadoop.util.Shell.runCommand(Shell.java:561) > at org.apache.hadoop.util.Shell.run(Shell.java:478) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:831) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:814) > at > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:712) > at > org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:506) > at > org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:125) > at org.apache.spark.SparkContext.(SparkContext.scala:516) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:879) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:871) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:871) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20338) Spaces in spark.eventLog.dir are not correctly handled
zuotingbing created SPARK-20338: --- Summary: Spaces in spark.eventLog.dir are not correctly handled Key: SPARK-20338 URL: https://issues.apache.org/jira/browse/SPARK-20338 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: zuotingbing Priority: Minor set spark.eventLog.dir=/home/mr/event log and submit an app ,we got error as follows: 017-04-14 17:28:40,378 INFO org.apache.spark.SparkContext: Successfully stopped SparkContext Exception in thread "main" ExitCodeException exitCode=1: chmod: cannot access `/home/mr/event%20log/app-20170414172839-.inprogress': No such file or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:561) at org.apache.hadoop.util.Shell.run(Shell.java:478) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738) at org.apache.hadoop.util.Shell.execCommand(Shell.java:831) at org.apache.hadoop.util.Shell.execCommand(Shell.java:814) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:712) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:506) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:125) at org.apache.spark.SparkContext.(SparkContext.scala:516) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258) at org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:879) at org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:871) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:871) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20337) Support upgrade a jar dependency and don't restart SparkContext
[ https://issues.apache.org/jira/browse/SPARK-20337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20337: Assignee: (was: Apache Spark) > Support upgrade a jar dependency and don't restart SparkContext > --- > > Key: SPARK-20337 > URL: https://issues.apache.org/jira/browse/SPARK-20337 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Priority: Minor > > Suppose that we need upgrade a jar dependency and don't want to restart > {{SparkContext}}, Something like this: > {code} > sc.addJar("breeze-natives_2.11-0.12.jar") > // do something > sc.removeJar("spark://192.168.26.200:23420/jar/breeze-natives_2.11-0.12.jar") > sc.addJar("breeze-natives_2.11-0.13.jar") > // do something > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20337) Support upgrade a jar dependency and don't restart SparkContext
[ https://issues.apache.org/jira/browse/SPARK-20337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20337: Assignee: Apache Spark > Support upgrade a jar dependency and don't restart SparkContext > --- > > Key: SPARK-20337 > URL: https://issues.apache.org/jira/browse/SPARK-20337 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Minor > > Suppose that we need upgrade a jar dependency and don't want to restart > {{SparkContext}}, Something like this: > {code} > sc.addJar("breeze-natives_2.11-0.12.jar") > // do something > sc.removeJar("spark://192.168.26.200:23420/jar/breeze-natives_2.11-0.12.jar") > sc.addJar("breeze-natives_2.11-0.13.jar") > // do something > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20337) Support upgrade a jar dependency and don't restart SparkContext
[ https://issues.apache.org/jira/browse/SPARK-20337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968868#comment-15968868 ] Apache Spark commented on SPARK-20337: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/17637 > Support upgrade a jar dependency and don't restart SparkContext > --- > > Key: SPARK-20337 > URL: https://issues.apache.org/jira/browse/SPARK-20337 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Priority: Minor > > Suppose that we need upgrade a jar dependency and don't want to restart > {{SparkContext}}, Something like this: > {code} > sc.addJar("breeze-natives_2.11-0.12.jar") > // do something > sc.removeJar("spark://192.168.26.200:23420/jar/breeze-natives_2.11-0.12.jar") > sc.addJar("breeze-natives_2.11-0.13.jar") > // do something > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20337) Support upgrade a jar dependency and don't restart SparkContext
[ https://issues.apache.org/jira/browse/SPARK-20337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968843#comment-15968843 ] Sean Owen commented on SPARK-20337: --- I don't think this is a reasonable use case to support. Spark jobs are conceptually not long-lived, certainly not shells. Just restart. This raises all kinds of problems with references to instances of existing classes. > Support upgrade a jar dependency and don't restart SparkContext > --- > > Key: SPARK-20337 > URL: https://issues.apache.org/jira/browse/SPARK-20337 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Priority: Minor > > Suppose that we need upgrade a jar dependency and don't want to restart > {{SparkContext}}, Something like this: > {code} > sc.addJar("breeze-natives_2.11-0.12.jar") > // do something > sc.removeJar("spark://192.168.26.200:23420/jar/breeze-natives_2.11-0.12.jar") > sc.addJar("breeze-natives_2.11-0.13.jar") > // do something > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20337) Support upgrade a jar dependency and don't restart SparkContext
Yuming Wang created SPARK-20337: --- Summary: Support upgrade a jar dependency and don't restart SparkContext Key: SPARK-20337 URL: https://issues.apache.org/jira/browse/SPARK-20337 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0 Reporter: Yuming Wang Priority: Minor Suppose that we need upgrade a jar dependency and don't want to restart {{SparkContext}}, Something like this: {code} sc.addJar("breeze-natives_2.11-0.12.jar") // do something sc.removeJar("spark://192.168.26.200:23420/jar/breeze-natives_2.11-0.12.jar") sc.addJar("breeze-natives_2.11-0.13.jar") // do something {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20321) Spark UI cannot be shutdown in spark streaming app
[ https://issues.apache.org/jira/browse/SPARK-20321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20321. --- Resolution: Not A Problem > Spark UI cannot be shutdown in spark streaming app > -- > > Key: SPARK-20321 > URL: https://issues.apache.org/jira/browse/SPARK-20321 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > When an exception thrown the transform stage is handled in foreachRDD and the > streaming context is forced to stop, the SparkUI appears to hang and > continually dump the following logs in an infinite loop: > {noformat} > ... > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop woken up from > select, 0/0 selected > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop woken up from > select, 0/0 selected > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select > ... > {noformat} > Unfortunately I don't have a minimal example that reproduces this issue but > here is what I can share: > {noformat} > val dstream = pull data from kafka > val mapped = dstream.transform { rdd => > val data = getData // Perform a call that potentially throws an exception > // broadcast the data > // flatMap the RDD using the data > } > mapped.foreachRDD { > try { > // write some data in a DB > } catch { > case t: Throwable => > dstream.context.stop(stopSparkContext = true, stopGracefully = false) > } > } > mapped.foreachRDD { > try { > // write data to Kafka > // manually checkpoint the Kafka offsets (because I need them in JSON > format) > } catch { > case t: Throwable => > dstream.context.stop(stopSparkContext = true, stopGracefully = false) > } > } > {noformat} > The issue appears when stop is invoked. At the point when SparkUI is stopped, > it enters that infinite loop. Initially I thought it relates to Jetty, as the > version used in SparkUI had some bugs (e.g. [this > one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to > a more recent version (March 2017) and built Spark 2.1.0 with that one but > still got the error. > I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of > Mesos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20323) Calling stop in a transform stage causes the app to hang
[ https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20323. --- Resolution: Not A Problem > Calling stop in a transform stage causes the app to hang > > > Key: SPARK-20323 > URL: https://issues.apache.org/jira/browse/SPARK-20323 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > I'm not sure if this is a bug or just the way it needs to happen but I've run > in this issue with the following code: > {noformat} > object ImmortalStreamingJob extends App { > val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]") > val ssc = new StreamingContext(conf, Seconds(1)) > val elems = (1 to 1000).grouped(10) > .map(seq => ssc.sparkContext.parallelize(seq)) > .toSeq > val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*)) > val transformed = stream.transform { rdd => > try { > if (Random.nextInt(6) == 5) throw new RuntimeException("boom") > else println("lucky bastard") > rdd > } catch { > case e: Throwable => > println("stopping streaming context", e) > ssc.stop(stopSparkContext = true, stopGracefully = false) > throw e > } > } > transformed.foreachRDD { rdd => > println(rdd.collect().mkString(",")) > } > ssc.start() > ssc.awaitTermination() > } > {noformat} > There are two things I can note here: > * if the exception is thrown in the first transformation (when the first RDD > is processed), the spark context is stopped and the app dies > * if the exception is thrown after at least one RDD has been processed, the > app hangs after printing the error message and never stops > I think there's some sort of deadlock in the second case, is that normal? I > also asked this > [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624] > but up two this point there's no answer pointing exactly to what happens, > only guidelines. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters
[ https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968753#comment-15968753 ] Hyukjin Kwon commented on SPARK-20336: -- Oh, probably, I would also appreciate if you are able to test this with json, {{wholeFile}} enabled. Both work similarly and trying this out helps to verify this case. > spark.read.csv() with wholeFile=True option fails to read non ASCII unicode > characters > -- > > Key: SPARK-20336 > URL: https://issues.apache.org/jira/browse/SPARK-20336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 (master branch is downloaded from Github) > PySpark >Reporter: HanCheol Cho > > I used spark.read.csv() method with wholeFile=True option to load data that > has multi-line records. > However, non-ASCII characters are not properly loaded. > The following is a sample data for test: > {code:none} > col1,col2,col3 > 1,a,text > 2,b,テキスト > 3,c,텍스트 > 4,d,"text > テキスト > 텍스트" > 5,e,last > {code} > When it is loaded without wholeFile=True option, non-ASCII characters are > shown correctly although multi-line records are parsed incorrectly as follows: > {code:none} > testdf_default = spark.read.csv("test.encoding.csv", header=True) > testdf_default.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|テキスト| > | 3| c| 텍스트| > | 4| d|text| > |テキスト|null|null| > | 텍스트"|null|null| > | 5| e|last| > ++++ > {code} > When wholeFile=True option is used, non-ASCII characters are broken as > follows: > {code:none} > testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, > wholeFile=True) > testdf_wholefile.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|| > | 3| c| �| > | 4| d|text > ...| > | 5| e|last| > ++++ > {code} > The result is same even if I use encoding="UTF-8" option with wholeFile=True. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters
[ https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968749#comment-15968749 ] Hyukjin Kwon commented on SPARK-20336: -- I just tried to reproduce it as below: Scala {code} scala> spark.read.option("wholeFile", true).option("header", true).csv("tmp.csv").show() +++-+ |col1|col2| col3| +++-+ | 1| a| text| | 2| b| テキスト| | 3| c| 텍스트| | 4| d|text テキスト 텍스트| | 5| e| last| +++-+ {code} Python {code} >>> spark.read.csv("tmp.csv", header=True, wholeFile=True).show() +++-+ |col1|col2| col3| +++-+ | 1| a| text| | 2| b| テキスト| | 3| c| 텍스트| | 4| d|text テキスト 텍스트| | 5| e| last| +++-+ {code} It seems working fine. Do you mind if I ask to double check the encoding in the file and environments please? I would like to verify this case. > spark.read.csv() with wholeFile=True option fails to read non ASCII unicode > characters > -- > > Key: SPARK-20336 > URL: https://issues.apache.org/jira/browse/SPARK-20336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 (master branch is downloaded from Github) > PySpark >Reporter: HanCheol Cho > > I used spark.read.csv() method with wholeFile=True option to load data that > has multi-line records. > However, non-ASCII characters are not properly loaded. > The following is a sample data for test: > {code:none} > col1,col2,col3 > 1,a,text > 2,b,テキスト > 3,c,텍스트 > 4,d,"text > テキスト > 텍스트" > 5,e,last > {code} > When it is loaded without wholeFile=True option, non-ASCII characters are > shown correctly although multi-line records are parsed incorrectly as follows: > {code:none} > testdf_default = spark.read.csv("test.encoding.csv", header=True) > testdf_default.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|テキスト| > | 3| c| 텍스트| > | 4| d|text| > |テキスト|null|null| > | 텍스트"|null|null| > | 5| e|last| > ++++ > {code} > When wholeFile=True option is used, non-ASCII characters are broken as > follows: > {code:none} > testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, > wholeFile=True) > testdf_wholefile.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|| > | 3| c| �| > | 4| d|text > ...| > | 5| e|last| > ++++ > {code} > The result is same even if I use encoding="UTF-8" option with wholeFile=True. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20215) ReuseExchange is boken in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-20215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-20215: --- Comment: was deleted (was: Seems to be fixed in SPARK-20229) > ReuseExchange is boken in SparkSQL > -- > > Key: SPARK-20215 > URL: https://issues.apache.org/jira/browse/SPARK-20215 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Priority: Minor > > Currently if we have query like: A join B Union A join C... with the same > join key. Table A will be scanned multiple times in sql. It is because the > megastoreRelation are not shared by two joins, and ExprId is different. > canonicalized in Expression will not be able to unify them and two Exchange > will not compatible and cannot be reused. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20215) ReuseExchange is boken in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-20215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968734#comment-15968734 ] Zhan Zhang commented on SPARK-20215: Seems to be fixed in SPARK-20229 > ReuseExchange is boken in SparkSQL > -- > > Key: SPARK-20215 > URL: https://issues.apache.org/jira/browse/SPARK-20215 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Priority: Minor > > Currently if we have query like: A join B Union A join C... with the same > join key. Table A will be scanned multiple times in sql. It is because the > megastoreRelation are not shared by two joins, and ExprId is different. > canonicalized in Expression will not be able to unify them and two Exchange > will not compatible and cannot be reused. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters
[ https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] HanCheol Cho updated SPARK-20336: - Description: I used spark.read.csv() method with wholeFile=True option to load data that has multi-line records. However, non-ASCII characters are not properly loaded. The following is a sample data for test: {code:none} col1,col2,col3 1,a,text 2,b,テキスト 3,c,텍스트 4,d,"text テキスト 텍스트" 5,e,last {code} When it is loaded without wholeFile=True option, non-ASCII characters are shown correctly although multi-line records are parsed incorrectly as follows: {code:none} testdf_default = spark.read.csv("test.encoding.csv", header=True) testdf_default.show() ++++ |col1|col2|col3| ++++ | 1| a|text| | 2| b|テキスト| | 3| c| 텍스트| | 4| d|text| |テキスト|null|null| | 텍스트"|null|null| | 5| e|last| ++++ {code} When wholeFile=True option is used, non-ASCII characters are broken as follows: {code:none} testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, wholeFile=True) testdf_wholefile.show() ++++ |col1|col2|col3| ++++ | 1| a|text| | 2| b|| | 3| c| �| | 4| d|text ...| | 5| e|last| ++++ {code} The result is same even if I use encoding="UTF-8" option with wholeFile=True. was: I used spark.read.csv() method with wholeFile=True option to load data that has multi-line records. However, non-ASCII characters are not properly loaded. The following is a sample data for test: {code:none} col1,col2,col3 1,a,text 2,b,テキスト 3,c,텍스트 4,d,"text テキスト 텍스트 5,e,last {code} When it is loaded without wholeFile=True option, non-ASCII characters are shown correctly although multi-line records are parsed incorrectly as follows: {code:none} testdf_default = spark.read.csv("test.encoding.csv", header=True) testdf_default.show() ++++ |col1|col2|col3| ++++ | 1| a|text| | 2| b|テキスト| | 3| c| 텍스트| | 4| d|text| |テキスト|null|null| | 텍스트|null|null| | 5| e|last| ++++ {code} When wholeFile=True option is used, non-ASCII characters are broken as follows: {code:none} testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, wholeFile=True) testdf_wholefile.show() ++++ |col1|col2|col3| ++++ | 1| a|text| | 2| b|| | 3| c| �| | 4| d|text ...| ++++ {code} The result is same even if I use encoding="UTF-8" option with wholeFile=True. > spark.read.csv() with wholeFile=True option fails to read non ASCII unicode > characters > -- > > Key: SPARK-20336 > URL: https://issues.apache.org/jira/browse/SPARK-20336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 (master branch is downloaded from Github) > PySpark >Reporter: HanCheol Cho > > I used spark.read.csv() method with wholeFile=True option to load data that > has multi-line records. > However, non-ASCII characters are not properly loaded. > The following is a sample data for test: > {code:none} > col1,col2,col3 > 1,a,text > 2,b,テキスト > 3,c,텍스트 > 4,d,"text > テキスト > 텍스트" > 5,e,last > {code} > When it is loaded without wholeFile=True option, non-ASCII characters are > shown correctly although multi-line records are parsed incorrectly as follows: > {code:none} > testdf_default = spark.read.csv("test.encoding.csv", header=True) > testdf_default.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|テキスト| > | 3| c| 텍스트| > | 4| d|text| > |テキスト|null|null| > | 텍스트"|null|null| > | 5| e|last| > ++++ > {code} > When wholeFile=True option is used, non-ASCII characters are broken as > follows: > {code:none} > testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, > wholeFile=True) > testdf_wholefile.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|| > | 3| c| �| > | 4| d|text > ...| > | 5| e|last| > ++++ > {code} > The result is same even if I use encoding="UTF-8" option with wholeFile=True. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Created] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters
HanCheol Cho created SPARK-20336: Summary: spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters Key: SPARK-20336 URL: https://issues.apache.org/jira/browse/SPARK-20336 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Environment: Spark 2.2.0 (master branch is downloaded from Github) PySpark Reporter: HanCheol Cho I used spark.read.csv() method with wholeFile=True option to load data that has multi-line records. However, non-ASCII characters are not properly loaded. The following is a sample data for test: {code:none} col1,col2,col3 1,a,text 2,b,テキスト 3,c,텍스트 4,d,"text テキスト 텍스트 5,e,last {code} When it is loaded without wholeFile=True option, non-ASCII characters are shown correctly although multi-line records are parsed incorrectly as follows: {code:none} testdf_default = spark.read.csv("test.encoding.csv", header=True) testdf_default.show() ++++ |col1|col2|col3| ++++ | 1| a|text| | 2| b|テキスト| | 3| c| 텍스트| | 4| d|text| |テキスト|null|null| | 텍스트|null|null| | 5| e|last| ++++ {code} When wholeFile=True option is used, non-ASCII characters are broken as follows: {code:none} testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, wholeFile=True) testdf_wholefile.show() ++++ |col1|col2|col3| ++++ | 1| a|text| | 2| b|| | 3| c| �| | 4| d|text ...| ++++ {code} The result is same even if I use encoding="UTF-8" option with wholeFile=True. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org