date:20170414

[jira] [Closed] (SPARK-20337) Support upgrade a jar dependency and don't restart SparkContext

2017-04-14 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang closed SPARK-20337.
---
Resolution: Won't Fix

> Support upgrade a jar dependency and don't restart SparkContext
> ---
>
> Key: SPARK-20337
> URL: https://issues.apache.org/jira/browse/SPARK-20337
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Priority: Minor
>
> Suppose that we need upgrade a jar dependency and don't want to restart 
> {{SparkContext}}, Something like this:
> {code}
> sc.addJar("breeze-natives_2.11-0.12.jar")
> // do something
> sc.removeJar("spark://192.168.26.200:23420/jar/breeze-natives_2.11-0.12.jar")
> sc.addJar("breeze-natives_2.11-0.13.jar")
> // do something
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version resolution

2017-04-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20343:


Assignee: (was: Apache Spark)

> SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version 
> resolution 
> 
>
> Key: SPARK-20343
> URL: https://issues.apache.org/jira/browse/SPARK-20343
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637
> {quote}
> [error] 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123:
>  value createDatumWriter is not a member of 
> org.apache.avro.generic.GenericData
> [error] writerCache.getOrElseUpdate(schema, 
> GenericData.get.createDatumWriter(schema))
> [error] 
> {quote}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
> It seems sbt has a different resolution for Avro differently with Maven in 
> some cases. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version resolution

2017-04-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20343:


Assignee: Apache Spark

> SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version 
> resolution 
> 
>
> Key: SPARK-20343
> URL: https://issues.apache.org/jira/browse/SPARK-20343
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637
> {quote}
> [error] 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123:
>  value createDatumWriter is not a member of 
> org.apache.avro.generic.GenericData
> [error] writerCache.getOrElseUpdate(schema, 
> GenericData.get.createDatumWriter(schema))
> [error] 
> {quote}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
> It seems sbt has a different resolution for Avro differently with Maven in 
> some cases. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version resolution

2017-04-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969738#comment-15969738
 ] 

Apache Spark commented on SPARK-20343:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17642

> SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version 
> resolution 
> 
>
> Key: SPARK-20343
> URL: https://issues.apache.org/jira/browse/SPARK-20343
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637
> {quote}
> [error] 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123:
>  value createDatumWriter is not a member of 
> org.apache.avro.generic.GenericData
> [error] writerCache.getOrElseUpdate(schema, 
> GenericData.get.createDatumWriter(schema))
> [error] 
> {quote}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
> It seems sbt has a different resolution for Avro differently with Maven in 
> some cases. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins due to Avro version resolution

2017-04-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20343:
-
Summary: SBT master build for Hadoop 2.6 in Jenkins due to Avro version 
resolution   (was: Resolve SBT master build for Hadoop 2.6 in Jenkins due to 
Avro version resolution )

> SBT master build for Hadoop 2.6 in Jenkins due to Avro version resolution 
> --
>
> Key: SPARK-20343
> URL: https://issues.apache.org/jira/browse/SPARK-20343
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637
> {quote}
> [error] 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123:
>  value createDatumWriter is not a member of 
> org.apache.avro.generic.GenericData
> [error] writerCache.getOrElseUpdate(schema, 
> GenericData.get.createDatumWriter(schema))
> [error] 
> {quote}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
> It seems sbt has a different resolution for Avro differently with Maven in 
> some cases. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version resolution

2017-04-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20343:
-
Summary: SBT master build for Hadoop 2.6 in Jenkins fails due to Avro 
version resolution   (was: SBT master build for Hadoop 2.6 in Jenkins due to 
Avro version resolution )

> SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version 
> resolution 
> 
>
> Key: SPARK-20343
> URL: https://issues.apache.org/jira/browse/SPARK-20343
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637
> {quote}
> [error] 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123:
>  value createDatumWriter is not a member of 
> org.apache.avro.generic.GenericData
> [error] writerCache.getOrElseUpdate(schema, 
> GenericData.get.createDatumWriter(schema))
> [error] 
> {quote}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
> It seems sbt has a different resolution for Avro differently with Maven in 
> some cases. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20343) Resolve SBT master build for Hadoop 2.6 in Jenkins due to Avro version resolution

2017-04-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20343:
-
Summary: Resolve SBT master build for Hadoop 2.6 in Jenkins due to Avro 
version resolution   (was: Resolve SBT master build for Hadoop 2.6 due to Avro 
version resolution )

> Resolve SBT master build for Hadoop 2.6 in Jenkins due to Avro version 
> resolution 
> --
>
> Key: SPARK-20343
> URL: https://issues.apache.org/jira/browse/SPARK-20343
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637
> {quote}
> [error] 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123:
>  value createDatumWriter is not a member of 
> org.apache.avro.generic.GenericData
> [error] writerCache.getOrElseUpdate(schema, 
> GenericData.get.createDatumWriter(schema))
> [error] 
> {quote}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
> It seems sbt has a different resolution for Avro differently with Maven in 
> some cases. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20343) Resolve SBT master build for Hadoop 2.6 due to Avro version resolution

2017-04-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969709#comment-15969709
 ] 

Hyukjin Kwon commented on SPARK-20343:
--

I don't know the {{Priority}} in this case. Please correct this if anyone knows.

> Resolve SBT master build for Hadoop 2.6 due to Avro version resolution 
> ---
>
> Key: SPARK-20343
> URL: https://issues.apache.org/jira/browse/SPARK-20343
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637
> {quote}
> [error] 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123:
>  value createDatumWriter is not a member of 
> org.apache.avro.generic.GenericData
> [error] writerCache.getOrElseUpdate(schema, 
> GenericData.get.createDatumWriter(schema))
> [error] 
> {quote}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
> It seems sbt has a different resolution for Avro differently with Maven in 
> some cases. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20343) Resolve SBT master build for Hadoop 2.6 due to Avro version resolution

2017-04-14 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-20343:


 Summary: Resolve SBT master build for Hadoop 2.6 due to Avro 
version resolution 
 Key: SPARK-20343
 URL: https://issues.apache.org/jira/browse/SPARK-20343
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon


Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637

{quote}
[error] 
/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123:
 value createDatumWriter is not a member of org.apache.avro.generic.GenericData
[error] writerCache.getOrElseUpdate(schema, 
GenericData.get.createDatumWriter(schema))
[error] 
{quote}

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull

It seems sbt has a different resolution for Avro differently with Maven in some 
cases. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20340) Size estimate very wrong in ExternalAppendOnlyMap from CoGroupedRDD, cause OOM

2017-04-14 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969680#comment-15969680
 ] 

Shixiong Zhu commented on SPARK-20340:
--

I think it's just a trade off between accuracy and performance.

> Size estimate very wrong in ExternalAppendOnlyMap from CoGroupedRDD, cause OOM
> --
>
> Key: SPARK-20340
> URL: https://issues.apache.org/jira/browse/SPARK-20340
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> I had a user doing a basic join operation. The values are image's binary 
> data(in base64 format) and widely vary in size.
> The job failed with out of memory. Originally failed on yarn with using to 
> much overhead memory, turned spark.shuffle.io.preferDirectBufs  to false 
> then failed with out of heap memory.  I debugged it down to during the 
> shuffle when CoGroupedRDD putting things into the ExternalAppendOnlyMap, it 
> computes an estimated size to determine when to spill.  In this case 
> SizeEstimator handle arrays such that if it is larger then 400 elements, it 
> samples 100 elements. The estimate is coming back as GB's different from the 
> actual size.  It claims 1GB when it is actually using close to 5GB. 
> Temporary work around it to increase the memory to be very large (10GB 
> executors) but that isn't really acceptable here.  User did the same thing in 
> pig and it easily handled the data with 1.5GB of memory.
> It seems risky to be using an estimate in such a critical thing. If the 
> estimate is wrong you are going to run out of memory and fail the job.
> I'm looking closer at the users data still to get more insights.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20341) Support BigIngeger values > 19 precision

2017-04-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20341:
-
Component/s: (was: Spark Core)
 SQL

> Support BigIngeger values > 19 precision
> 
>
> Key: SPARK-20341
> URL: https://issues.apache.org/jira/browse/SPARK-20341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Paul Zaczkieiwcz
>
> If you create a {{Dataset\[scala.math.BigInt\]}}, then you can't have a 
> precision > 19.
> {code}
> scala> case class BigIntWrapper(value:scala.math.BigInt)
> defined class BigIntWrapper
> scala> val longDf = 
> spark.createDataset(BigIntWrapper(scala.math.BigInt("1002"))::Nil)
> 17/04/14 19:45:15 main INFO CodeGenerator: Code generated in 211.949738 ms
> java.lang.RuntimeException: Error while encoding: 
> java.lang.IllegalArgumentException: requirement failed: BigInteger 
> 1002 too large for decimal
> staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), 
> apply, assertnotnull(input[0, BigIntWrapper, true], top level non-flat input 
> object).value, true) AS value#16
> +- staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), 
> apply, assertnotnull(input[0, BigIntWrapper, true], top level non-flat input 
> object).value, true)
>+- assertnotnull(input[0, BigIntWrapper, true], top level non-flat input 
> object).value
>   +- assertnotnull(input[0, BigIntWrapper, true], top level non-flat 
> input object)
>  +- input[0, BigIntWrapper, true]
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:280)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:421)
>   ... 54 elided
> Caused by: java.lang.IllegalArgumentException: requirement failed: BigInteger 
> 1002 too large for decimal
>   at scala.Predef$.require(Predef.scala:224)
>   at org.apache.spark.sql.types.Decimal.set(Decimal.scala:137)
>   at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:419)
>   at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:277)
>   ... 62 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-04-14 Thread 颜发才


[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969622#comment-15969622
 ] 

Yan Facai (颜发才) commented on SPARK-20199:
-

ping [~jkbreuer] [~sethah] [~mengxr].

Which one is better?

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20342) DAGScheduler sends SparkListenerTaskEnd before updating task's accumulators

2017-04-14 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-20342:
--

 Summary: DAGScheduler sends SparkListenerTaskEnd before updating 
task's accumulators
 Key: SPARK-20342
 URL: https://issues.apache.org/jira/browse/SPARK-20342
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Marcelo Vanzin


Hit this on 2.2, but probably has been there forever. This is similar in spirit 
to SPARK-20205.

Event is sent here, around L1154:

{code}
listenerBus.post(SparkListenerTaskEnd(
   stageId, task.stageAttemptId, taskType, event.reason, event.taskInfo, 
taskMetrics))
{code}

Accumulators are updated later, around L1173:

{code}
val stage = stageIdToStage(task.stageId)
event.reason match {
  case Success =>
task match {
  case rt: ResultTask[_, _] =>
// Cast to ResultStage here because it's part of the ResultTask
// TODO Refactor this out to a function that accepts a ResultStage
val resultStage = stage.asInstanceOf[ResultStage]
resultStage.activeJob match {
  case Some(job) =>
if (!job.finished(rt.outputId)) {
  updateAccumulators(event)
{code}

Same thing applies here; UI shows correct info because it's pointing at the 
mutable {{TaskInfo}} structure. But the event log, for example, may record the 
wrong information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16899) Structured Streaming Checkpointing Example invalid

2017-04-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-16899.
--
Resolution: Not A Problem

This has been fixed.  I believe you are using an old version of Spark.

> Structured Streaming Checkpointing Example invalid
> --
>
> Key: SPARK-16899
> URL: https://issues.apache.org/jira/browse/SPARK-16899
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Structured Streaming
>Reporter: Vladimir Feinberg
>Priority: Minor
>
> The structured streaming checkpointing example at the bottom of the page 
> (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing)
>  has the following excerpt:
> {code}
> aggDF
>.writeStream
>.outputMode("complete")
>.option(“checkpointLocation”, “path/to/HDFS/dir”)
>.format("memory")
>.start()
> {code}
> But memory sinks are not fault-tolerant. Indeed, trying this out, I get the 
> following error: 
> {{This query does not support recovering from checkpoint location. Delete 
> /tmp/streaming.metadata-625631e5-baee-41da-acd1-f16c82f68a40/offsets to start 
> over.;}}
> The documentation should be changed to demonstrate checkpointing for a 
> non-aggregation streaming task, and explicitly mention there is no way to 
> checkpoint aggregates.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16899) Structured Streaming Checkpointing Example invalid

2017-04-14 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-16899:
-
Component/s: Structured Streaming

> Structured Streaming Checkpointing Example invalid
> --
>
> Key: SPARK-16899
> URL: https://issues.apache.org/jira/browse/SPARK-16899
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Structured Streaming
>Reporter: Vladimir Feinberg
>Priority: Minor
>
> The structured streaming checkpointing example at the bottom of the page 
> (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing)
>  has the following excerpt:
> {code}
> aggDF
>.writeStream
>.outputMode("complete")
>.option(“checkpointLocation”, “path/to/HDFS/dir”)
>.format("memory")
>.start()
> {code}
> But memory sinks are not fault-tolerant. Indeed, trying this out, I get the 
> following error: 
> {{This query does not support recovering from checkpoint location. Delete 
> /tmp/streaming.metadata-625631e5-baee-41da-acd1-f16c82f68a40/offsets to start 
> over.;}}
> The documentation should be changed to demonstrate checkpointing for a 
> non-aggregation streaming task, and explicitly mention there is no way to 
> checkpoint aggregates.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20329) Resolution error when HAVING clause uses GROUP BY expression that involves implicit type coercion

2017-04-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969483#comment-15969483
 ] 

Apache Spark commented on SPARK-20329:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/17641

> Resolution error when HAVING clause uses GROUP BY expression that involves 
> implicit type coercion
> -
>
> Key: SPARK-20329
> URL: https://issues.apache.org/jira/browse/SPARK-20329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> The following example runs without error on Spark 2.0.x and 2.1.x but fails 
> in the current Spark master:
> {code}
> create temporary view foo (a, b) as values (cast(1 as bigint), 2), (cast(3 as 
> bigint), 4);
> select a + b from foo group by a + b having (a + b) > 1 
> {code}
> The error is
> {code}
> Error in SQL statement: AnalysisException: cannot resolve '`a`' given input 
> columns: [(a + CAST(b AS BIGINT))]; line 1 pos 45;
> 'Filter (('a + 'b) > 1)
> +- Aggregate [(a#249243L + cast(b#249244 as bigint))], [(a#249243L + 
> cast(b#249244 as bigint)) AS (a + CAST(b AS BIGINT))#249246L]
>+- SubqueryAlias foo
>   +- Project [col1#249241L AS a#249243L, col2#249242 AS b#249244]
>  +- LocalRelation [col1#249241L, col2#249242]
> {code}
> I think what's happening here is that the implicit cast is breaking things: 
> if we change the types so that both columns are integers then the analysis 
> error disappears. Similarly, adding explicit casts, as in
> {code}
> select a + cast(b as bigint) from foo group by a + cast(b as bigint) having 
> (a + cast(b as bigint)) > 1 
> {code}
> works so I'm pretty sure that the resolution problem is being introduced when 
> the casts are automatically added by the type coercion rule.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20329) Resolution error when HAVING clause uses GROUP BY expression that involves implicit type coercion

2017-04-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20329:


Assignee: Apache Spark

> Resolution error when HAVING clause uses GROUP BY expression that involves 
> implicit type coercion
> -
>
> Key: SPARK-20329
> URL: https://issues.apache.org/jira/browse/SPARK-20329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
>
> The following example runs without error on Spark 2.0.x and 2.1.x but fails 
> in the current Spark master:
> {code}
> create temporary view foo (a, b) as values (cast(1 as bigint), 2), (cast(3 as 
> bigint), 4);
> select a + b from foo group by a + b having (a + b) > 1 
> {code}
> The error is
> {code}
> Error in SQL statement: AnalysisException: cannot resolve '`a`' given input 
> columns: [(a + CAST(b AS BIGINT))]; line 1 pos 45;
> 'Filter (('a + 'b) > 1)
> +- Aggregate [(a#249243L + cast(b#249244 as bigint))], [(a#249243L + 
> cast(b#249244 as bigint)) AS (a + CAST(b AS BIGINT))#249246L]
>+- SubqueryAlias foo
>   +- Project [col1#249241L AS a#249243L, col2#249242 AS b#249244]
>  +- LocalRelation [col1#249241L, col2#249242]
> {code}
> I think what's happening here is that the implicit cast is breaking things: 
> if we change the types so that both columns are integers then the analysis 
> error disappears. Similarly, adding explicit casts, as in
> {code}
> select a + cast(b as bigint) from foo group by a + cast(b as bigint) having 
> (a + cast(b as bigint)) > 1 
> {code}
> works so I'm pretty sure that the resolution problem is being introduced when 
> the casts are automatically added by the type coercion rule.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20329) Resolution error when HAVING clause uses GROUP BY expression that involves implicit type coercion

2017-04-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20329:


Assignee: (was: Apache Spark)

> Resolution error when HAVING clause uses GROUP BY expression that involves 
> implicit type coercion
> -
>
> Key: SPARK-20329
> URL: https://issues.apache.org/jira/browse/SPARK-20329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> The following example runs without error on Spark 2.0.x and 2.1.x but fails 
> in the current Spark master:
> {code}
> create temporary view foo (a, b) as values (cast(1 as bigint), 2), (cast(3 as 
> bigint), 4);
> select a + b from foo group by a + b having (a + b) > 1 
> {code}
> The error is
> {code}
> Error in SQL statement: AnalysisException: cannot resolve '`a`' given input 
> columns: [(a + CAST(b AS BIGINT))]; line 1 pos 45;
> 'Filter (('a + 'b) > 1)
> +- Aggregate [(a#249243L + cast(b#249244 as bigint))], [(a#249243L + 
> cast(b#249244 as bigint)) AS (a + CAST(b AS BIGINT))#249246L]
>+- SubqueryAlias foo
>   +- Project [col1#249241L AS a#249243L, col2#249242 AS b#249244]
>  +- LocalRelation [col1#249241L, col2#249242]
> {code}
> I think what's happening here is that the implicit cast is breaking things: 
> if we change the types so that both columns are integers then the analysis 
> error disappears. Similarly, adding explicit casts, as in
> {code}
> select a + cast(b as bigint) from foo group by a + cast(b as bigint) having 
> (a + cast(b as bigint)) > 1 
> {code}
> works so I'm pretty sure that the resolution problem is being introduced when 
> the casts are automatically added by the type coercion rule.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20341) Support BigIngeger values > 19 precision

2017-04-14 Thread Paul Zaczkieiwcz (JIRA)

Paul Zaczkieiwcz created SPARK-20341:


 Summary: Support BigIngeger values > 19 precision
 Key: SPARK-20341
 URL: https://issues.apache.org/jira/browse/SPARK-20341
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 2.0.2
Reporter: Paul Zaczkieiwcz


If you create a {{Dataset\[scala.math.BigInt\]}}, then you can't have a 
precision > 19.
{code}
scala> case class BigIntWrapper(value:scala.math.BigInt)
defined class BigIntWrapper

scala> val longDf = 
spark.createDataset(BigIntWrapper(scala.math.BigInt("1002"))::Nil)
17/04/14 19:45:15 main INFO CodeGenerator: Code generated in 211.949738 ms
java.lang.RuntimeException: Error while encoding: 
java.lang.IllegalArgumentException: requirement failed: BigInteger 
1002 too large for decimal
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), 
apply, assertnotnull(input[0, BigIntWrapper, true], top level non-flat input 
object).value, true) AS value#16
+- staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), 
apply, assertnotnull(input[0, BigIntWrapper, true], top level non-flat input 
object).value, true)
   +- assertnotnull(input[0, BigIntWrapper, true], top level non-flat input 
object).value
  +- assertnotnull(input[0, BigIntWrapper, true], top level non-flat input 
object)
 +- input[0, BigIntWrapper, true]

  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:280)
  at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
  at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:421)
  ... 54 elided
Caused by: java.lang.IllegalArgumentException: requirement failed: BigInteger 
1002 too large for decimal
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.sql.types.Decimal.set(Decimal.scala:137)
  at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:419)
  at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:277)
  ... 62 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20340) Size estimate very wrong in ExternalAppendOnlyMap from CoGroupedRDD, cause OOM

2017-04-14 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-20340:
-

 Summary: Size estimate very wrong in ExternalAppendOnlyMap from 
CoGroupedRDD, cause OOM
 Key: SPARK-20340
 URL: https://issues.apache.org/jira/browse/SPARK-20340
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Thomas Graves


I had a user doing a basic join operation. The values are image's binary 
data(in base64 format) and widely vary in size.

The job failed with out of memory. Originally failed on yarn with using to much 
overhead memory, turned spark.shuffle.io.preferDirectBufsto false then 
failed with out of heap memory.  I debugged it down to during the shuffle when 
CoGroupedRDD putting things into the ExternalAppendOnlyMap, it computes an 
estimated size to determine when to spill.  In this case SizeEstimator handle 
arrays such that if it is larger then 400 elements, it samples 100 elements. 
The estimate is coming back as GB's different from the actual size.  It claims 
1GB when it is actually using close to 5GB. 

Temporary work around it to increase the memory to be very large (10GB 
executors) but that isn't really acceptable here.  User did the same thing in 
pig and it easily handled the data with 1.5GB of memory.

It seems risky to be using an estimate in such a critical thing. If the 
estimate is wrong you are going to run out of memory and fail the job.

I'm looking closer at the users data still to get more insights.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-04-14 Thread Ritesh Tijoriwala (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969422#comment-15969422
 ] 

Ritesh Tijoriwala commented on SPARK-650:
-

[~Skamandros] - I would also like to know about hooking 'JavaSerializer'. I 
have a similar use case where I need to initialize set of objects/resources on 
each executor. I would also like to know if anybody has a way to hook into some 
"clean up" on each executor when 1) the executor shutdown 2) when a batch 
finishes and before next batch starts

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20339) Issue in regex_replace in Apache Spark Java

2017-04-14 Thread Nischay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nischay updated SPARK-20339:

Description: 
We are currently facing couple of issues

1. "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB".
2. "java.lang.StackOverflowError"
The first issue is reported as a Major bug in Jira of Apache spark 
https://issues.apache.org/jira/browse/SPARK-18492

We got these issues by the following program. We are trying to replace the 
Manufacturer name by its equivalent alternate name,

These issues occur only when we have Huge number of alternate names to replace, 
for small number of replacements it works with no issues.
dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`

Kindly suggest us an alternative method or a solution to go around this problem.

Hashtable manufacturerNames = new Hashtable();
  Enumeration names;
  String str;
  double bal;

  manufacturerNames.put("Allen","Apex Tool Group");
  manufacturerNames.put("Armstrong","Apex Tool Group");
  manufacturerNames.put("Campbell","Apex Tool Group");
  manufacturerNames.put("Lubriplate","Apex Tool Group");
  manufacturerNames.put("Delta","Apex Tool Group");
  manufacturerNames.put("Gearwrench","Apex Tool Group");
  manufacturerNames.put("H.K. Porter","Apex Tool 
Group");
  manufacturerNames.put("Jacobs","Apex Tool Group");
  manufacturerNames.put("Jobox","Apex Tool Group");
  manufacturerNames.put("Lufkin","Apex Tool Group");
  manufacturerNames.put("Nicholson","Apex Tool Group");
  manufacturerNames.put("Plumb","Apex Tool Group");
  manufacturerNames.put("Wiss","Apex Tool Group");
  manufacturerNames.put("Covert","Apex Tool Group");
  manufacturerNames.put("Apex-Geta","Apex Tool Group");
  manufacturerNames.put("Dotco-Airetool","Apex Tool 
Group");
  manufacturerNames.put("Apex","Apex Tool Group");
  manufacturerNames.put("Cleco","Apex Tool Group");
  manufacturerNames.put("Dotco","Apex Tool Group");
  manufacturerNames.put("Erem","Apex Tool Group");
  manufacturerNames.put("Master Power","Apex Tool 
Group");
  manufacturerNames.put("Recoules Quackenbush","Apex 
Tool Group");
  manufacturerNames.put("Apex-Utica","Apex Tool Group");
  manufacturerNames.put("Weller","Apex Tool Group");
  manufacturerNames.put("Xcelite","Apex Tool Group");
  manufacturerNames.put("JET","JPW Industries");
  manufacturerNames.put("Powermatic","JPW Industries");
  manufacturerNames.put("Wilton","JPW Industries");
  manufacturerNames.put("Black+Decker","StanleyBlack & 
Decker");
  manufacturerNames.put("BlackhawkBy 
Proto","StanleyBlack & Decker");
  manufacturerNames.put("Bostitch","StanleyBlack & 
Decker");
  manufacturerNames.put("Cribmaster","StanleyBlack & 
Decker");
  manufacturerNames.put("DeWALT","StanleyBlack & 
Decker");
  manufacturerNames.put("Expert (Hand Tools & 
Accessories); Expert (Wrenches)","StanleyBlack & Decker");
  manufacturerNames.put("Facom","StanleyBlack & 
Decker");
  manufacturerNames.put("Mac","StanleyBlack & Decker");
  manufacturerNames.put("Lista","StanleyBlack & 
Decker");
  manufacturerNames.put("Porter-Cable","StanleyBlack & 
Decker");
  manufacturerNames.put("Powers","StanleyBlack & 
Decker");
  manufacturerNames.put("Proto","StanleyBlack & 
Decker");
  manufacturerNames.put("Stanley","StanleyBlack & 
Decker");
  manufacturerNames.put("Vidmar","StanleyBlack & 
Decker");
  manufacturerNames.put("Abell-Howe","Columbus 
McKinnon");
  manufacturerNames.put("Budgit Hoists","Columbus 
McKinnon");
  manufacturerNames.put("Cady Lifters","Columbus 
McKinnon");
  manufacturerNames.put("Chester Hoist","Columbus 
McKinnon");

[jira] [Created] (SPARK-20339) Issue in regex_replace in Apache Spark Java

2017-04-14 Thread Nischay (JIRA)

Nischay created SPARK-20339:
---

 Summary: Issue in regex_replace in Apache Spark Java
 Key: SPARK-20339
 URL: https://issues.apache.org/jira/browse/SPARK-20339
 Project: Spark
  Issue Type: Question
  Components: Java API, Spark Core, SQL
Affects Versions: 2.1.0
Reporter: Nischay


We are currently facing couple of issues
1. "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB".
2. "java.lang.StackOverflowError"
The first issue is reported as a Major bug in Jira of Apache spark 
https://issues.apache.org/jira/browse/SPARK-18492

We got these issues by the following program. We are trying to replace the 
Manufacturer name by its equivalent alternate name,

These issues occur only when we have Huge number of alternate names to replace, 
for small number of replacements it works with no issues.
`dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`

Kindly suggest us an alternative method or a solution to go around this problem.

Hashtable manufacturerNames = new Hashtable();
  Enumeration names;
  String str;
  double bal;

  manufacturerNames.put("Allen","Apex Tool Group");
  manufacturerNames.put("Armstrong","Apex Tool Group");
  manufacturerNames.put("Campbell","Apex Tool Group");
  manufacturerNames.put("Lubriplate","Apex Tool Group");
  manufacturerNames.put("Delta","Apex Tool Group");
  manufacturerNames.put("Gearwrench","Apex Tool Group");
  manufacturerNames.put("H.K. Porter","Apex Tool 
Group");
  manufacturerNames.put("Jacobs","Apex Tool Group");
  manufacturerNames.put("Jobox","Apex Tool Group");
  manufacturerNames.put("Lufkin","Apex Tool Group");
  manufacturerNames.put("Nicholson","Apex Tool Group");
  manufacturerNames.put("Plumb","Apex Tool Group");
  manufacturerNames.put("Wiss","Apex Tool Group");
  manufacturerNames.put("Covert","Apex Tool Group");
  manufacturerNames.put("Apex-Geta","Apex Tool Group");
  manufacturerNames.put("Dotco-Airetool","Apex Tool 
Group");
  manufacturerNames.put("Apex","Apex Tool Group");
  manufacturerNames.put("Cleco","Apex Tool Group");
  manufacturerNames.put("Dotco","Apex Tool Group");
  manufacturerNames.put("Erem","Apex Tool Group");
  manufacturerNames.put("Master Power","Apex Tool 
Group");
  manufacturerNames.put("Recoules Quackenbush","Apex 
Tool Group");
  manufacturerNames.put("Apex-Utica","Apex Tool Group");
  manufacturerNames.put("Weller","Apex Tool Group");
  manufacturerNames.put("Xcelite","Apex Tool Group");
  manufacturerNames.put("JET","JPW Industries");
  manufacturerNames.put("Powermatic","JPW Industries");
  manufacturerNames.put("Wilton","JPW Industries");
  manufacturerNames.put("Black+Decker","StanleyBlack & 
Decker");
  manufacturerNames.put("BlackhawkBy 
Proto","StanleyBlack & Decker");
  manufacturerNames.put("Bostitch","StanleyBlack & 
Decker");
  manufacturerNames.put("Cribmaster","StanleyBlack & 
Decker");
  manufacturerNames.put("DeWALT","StanleyBlack & 
Decker");
  manufacturerNames.put("Expert (Hand Tools & 
Accessories); Expert (Wrenches)","StanleyBlack & Decker");
  manufacturerNames.put("Facom","StanleyBlack & 
Decker");
  manufacturerNames.put("Mac","StanleyBlack & Decker");
  manufacturerNames.put("Lista","StanleyBlack & 
Decker");
  manufacturerNames.put("Porter-Cable","StanleyBlack & 
Decker");
  manufacturerNames.put("Powers","StanleyBlack & 
Decker");
  manufacturerNames.put("Proto","StanleyBlack & 
Decker");
  manufacturerNames.put("Stanley","StanleyBlack & 
Decker");
  manufacturerNames.put("Vidmar","StanleyBlack & 
Decker");
  manufacturerNames.put("Abell-Howe","Columbus 
McKinnon");
  manufacturerNames.put("Budgit

[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures

2017-04-14 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969388#comment-15969388
 ] 

Thomas Graves commented on SPARK-20178:
---

One thing I ran into today which is somewhat related to this is a combination 
of failure types.  In this case it was broadcast fetch failures combined with 
shuffle fetch failures which lead to 4 task failures and failed the job.  I 
believe they were all from the same host and happened really quickly (within 3 
seconds).  This seems like this should fall under the fetch failure case as 
well.

 Failed to get broadcast_646_piece0 of broadcast_646
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at 
org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:691)
at 
org.apache.spark.MapOutputTracker.getStatuses(MapOutputTracker.scala:204)
at 
org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:143)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:47)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:147)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)

> Improve Scheduler fetch failures
> 
>
> Key: SPARK-20178
> URL: https://issues.apache.org/jira/browse/SPARK-20178
> Project: Spark
>  Issue Type: Epic
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-04-14 Thread Yongqin Xiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969384#comment-15969384
 ] 

Yongqin Xiao commented on SPARK-18406:
--

The same JIRA is found under 
https://issues-test.apache.org/jira/browse/SPARK-18406.
Same issue observed in spark 2.1.0 as well.

The issue is observed on some simple spark query that compiles into 3 stages. I 
have some custom RDD being used in this case, and it is registered for 
persistence. In the compute() method of the custom RDD, I spawn a new thread to 
compute the data in the background, and then immediately return an abstract 
iterator (wrapped under a InterruptibleIterator) that gets data from the 
background computation on demand. The assertion happens when iterator of the 
parent RDD reaches the end of the data. This issue doesn't always happen when 
the custom RDD is used in the query, regardless being used once or multiple 
times.

The issue is related to the new thread I created which accesses data from the 
input iterator of parent RDD. The new thread is missing the 
TSS(thread-specific-storage) for TaskContext. I see BlockInfoManager is using 
this TSS TaskContext as key to search the storage.

Here is log showing the task ID being unset in this thread: Line 1819: 
2017/04/13 15:01:01.674 [Thread-33]: TRACE storage.BlockInfoManager: Task -1024 
releasing lock for rdd_25_0

However, I have no way to set TSS for my thread now because the method is made 
protected as below:
object TaskContext {
 ...
private[this] val taskContext: ThreadLocal[TaskContext] = new 
ThreadLocal[TaskContext]
// Note: protected[spark] instead of private[spark] to prevent the following 
two from
 // showing up in JavaDoc.
 /** Set the thread local TaskContext. Internal to Spark. */
 protected[spark] def setTaskContext(tc: TaskContext): Unit = 
taskContext.set(tc)

Just to confirm my theory, I made the TaskContext.setTaskContext public, and 
called it in the beginning of my thread. The use cases that were failing 
consistently with assertion on lock-release now run successful in all scenarios 
I have tried, which include having different number of src/shuffle partitions, 
number of executors, async vs. sequential execution (for having multiple 
downstream custom RDDs pulling data from upstream RDD).

> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at

[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

2017-04-14 Thread Michael Gummelt (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969379#comment-15969379
 ] 

Michael Gummelt commented on SPARK-20328:
-

Ah, yes, of course.  Thanks.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -
>
> Key: SPARK-20328
> URL: https://issues.apache.org/jira/browse/SPARK-20328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.1.2
>Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>   at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-14 Thread Michael Gummelt (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969341#comment-15969341
 ] 

Michael Gummelt commented on SPARK-16742:
-

[~jerryshao] No, but you can look at our solution here: 
https://github.com/mesosphere/spark/commit/0a2cc4248039ca989e177e96e92a594a025661fe#diff-79391110e9f26657e415aa169a004998R129

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-14 Thread Michael Gummelt (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969341#comment-15969341
 ] 

Michael Gummelt edited comment on SPARK-16742 at 4/14/17 6:01 PM:
--

[~jerryshao] No, but you can look at our solution here: 
https://github.com/mesosphere/spark/commit/0a2cc4248039ca989e177e96e92a594a025661fe#diff-79391110e9f26657e415aa169a004998R129

The code we upstream will be quite different, but the delegation token handling 
will be similar.


was (Author: mgummelt):
[~jerryshao] No, but you can look at our solution here: 
https://github.com/mesosphere/spark/commit/0a2cc4248039ca989e177e96e92a594a025661fe#diff-79391110e9f26657e415aa169a004998R129

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17608) Long type has incorrect serialization/deserialization

2017-04-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17608:


Assignee: (was: Apache Spark)

> Long type has incorrect serialization/deserialization
> -
>
> Key: SPARK-17608
> URL: https://issues.apache.org/jira/browse/SPARK-17608
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Thomas Powell
>
> Am hitting issues when using {{dapply}} on a data frame that contains a 
> {{bigint}} in its schema. When this is converted to a SparkR data frame a 
> "bigint" gets converted to a R {{numeric}} type: 
> https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L25.
> However, the R {{numeric}} type gets converted to 
> {{org.apache.spark.sql.types.DoubleType}}: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L97.
> The two directions therefore aren't compatible. If I use the same schema when 
> using dapply (and just an identity function) I will get type collisions 
> because the output type is a double but the schema expects a bigint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17608) Long type has incorrect serialization/deserialization

2017-04-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17608:


Assignee: Apache Spark

> Long type has incorrect serialization/deserialization
> -
>
> Key: SPARK-17608
> URL: https://issues.apache.org/jira/browse/SPARK-17608
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Thomas Powell
>Assignee: Apache Spark
>
> Am hitting issues when using {{dapply}} on a data frame that contains a 
> {{bigint}} in its schema. When this is converted to a SparkR data frame a 
> "bigint" gets converted to a R {{numeric}} type: 
> https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L25.
> However, the R {{numeric}} type gets converted to 
> {{org.apache.spark.sql.types.DoubleType}}: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L97.
> The two directions therefore aren't compatible. If I use the same schema when 
> using dapply (and just an identity function) I will get type collisions 
> because the output type is a double but the schema expects a bigint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17608) Long type has incorrect serialization/deserialization

2017-04-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969328#comment-15969328
 ] 

Apache Spark commented on SPARK-17608:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/17640

> Long type has incorrect serialization/deserialization
> -
>
> Key: SPARK-17608
> URL: https://issues.apache.org/jira/browse/SPARK-17608
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Thomas Powell
>
> Am hitting issues when using {{dapply}} on a data frame that contains a 
> {{bigint}} in its schema. When this is converted to a SparkR data frame a 
> "bigint" gets converted to a R {{numeric}} type: 
> https://github.com/apache/spark/blob/master/R/pkg/R/types.R#L25.
> However, the R {{numeric}} type gets converted to 
> {{org.apache.spark.sql.types.DoubleType}}: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L97.
> The two directions therefore aren't compatible. If I use the same schema when 
> using dapply (and just an identity function) I will get type collisions 
> because the output type is a double but the schema expects a bigint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race

2017-04-14 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-20243:
--
Fix Version/s: 2.1.1

> DebugFilesystem.assertNoOpenStreams thread race
> ---
>
> Key: SPARK-20243
> URL: https://issues.apache.org/jira/browse/SPARK-20243
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
> Fix For: 2.1.1, 2.2.0
>
>
> Introduced by SPARK-19946.
> DebugFilesystem.assertNoOpenStreams gets the size of the openStreams 
> ConcurrentHashMap and then later, if the size was > 0, accesses the first 
> element in openStreams.values. But, the ConcurrentHashMap might be cleared by 
> another thread between getting its size and accessing it, resulting in an 
> exception when trying to call .head on an empty collection.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-04-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969051#comment-15969051
 ] 

Apache Spark commented on SPARK-19716:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17639

> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang

2017-04-14 Thread Andrei Taleanu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969042#comment-15969042
 ] 

Andrei Taleanu commented on SPARK-20323:


Ok, thank you. Unfortunately I can't even get at-least once - you can see that 
from the example I provided in the last comment. Anyway, I'll try that.

> Calling stop in a transform stage causes the app to hang
> 
>
> Key: SPARK-20323
> URL: https://issues.apache.org/jira/browse/SPARK-20323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> I'm not sure if this is a bug or just the way it needs to happen but I've run 
> in this issue with the following code:
> {noformat}
> object ImmortalStreamingJob extends App {
>   val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]")
>   val ssc  = new StreamingContext(conf, Seconds(1))
>   val elems = (1 to 1000).grouped(10)
> .map(seq => ssc.sparkContext.parallelize(seq))
> .toSeq
>   val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*))
>   val transformed = stream.transform { rdd =>
> try {
>   if (Random.nextInt(6) == 5) throw new RuntimeException("boom")
>   else println("lucky bastard")
>   rdd
> } catch {
>   case e: Throwable =>
> println("stopping streaming context", e)
> ssc.stop(stopSparkContext = true, stopGracefully = false)
> throw e
> }
>   }
>   transformed.foreachRDD { rdd =>
> println(rdd.collect().mkString(","))
>   }
>   ssc.start()
>   ssc.awaitTermination()
> }
> {noformat}
> There are two things I can note here:
> * if the exception is thrown in the first transformation (when the first RDD 
> is processed), the spark context is stopped and the app dies
> * if the exception is thrown after at least one RDD has been processed, the 
> app hangs after printing the error message and never stops
> I think there's some sort of deadlock in the second case, is that normal? I 
> also asked this 
> [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624]
>  but up two this point there's no answer pointing exactly to what happens, 
> only guidelines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang

2017-04-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969030#comment-15969030
 ] 

Sean Owen commented on SPARK-20323:
---

It's normal to continue processing if one batch fails. Achieving the semantics 
you want is up to your app and depends on your source, destination. I don't 
think you can get exactly-once semantics here because there's no way to make 
the DB write and Kafka update atomic.  You can get at-least or at-most once 
though.

But, this is not the question here. You're just asking how to stop the context 
and I think it's simply best to do so with try-finally wrapping 
ssc.awaitTermination().

> Calling stop in a transform stage causes the app to hang
> 
>
> Key: SPARK-20323
> URL: https://issues.apache.org/jira/browse/SPARK-20323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> I'm not sure if this is a bug or just the way it needs to happen but I've run 
> in this issue with the following code:
> {noformat}
> object ImmortalStreamingJob extends App {
>   val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]")
>   val ssc  = new StreamingContext(conf, Seconds(1))
>   val elems = (1 to 1000).grouped(10)
> .map(seq => ssc.sparkContext.parallelize(seq))
> .toSeq
>   val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*))
>   val transformed = stream.transform { rdd =>
> try {
>   if (Random.nextInt(6) == 5) throw new RuntimeException("boom")
>   else println("lucky bastard")
>   rdd
> } catch {
>   case e: Throwable =>
> println("stopping streaming context", e)
> ssc.stop(stopSparkContext = true, stopGracefully = false)
> throw e
> }
>   }
>   transformed.foreachRDD { rdd =>
> println(rdd.collect().mkString(","))
>   }
>   ssc.start()
>   ssc.awaitTermination()
> }
> {noformat}
> There are two things I can note here:
> * if the exception is thrown in the first transformation (when the first RDD 
> is processed), the spark context is stopped and the app dies
> * if the exception is thrown after at least one RDD has been processed, the 
> app hangs after printing the error message and never stops
> I think there's some sort of deadlock in the second case, is that normal? I 
> also asked this 
> [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624]
>  but up two this point there's no answer pointing exactly to what happens, 
> only guidelines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang

2017-04-14 Thread Andrei Taleanu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968969#comment-15968969
 ] 

Andrei Taleanu commented on SPARK-20323:


[~srowen] I see. Let me describe you better the problem. Short version: I have 
a *streaming job*. Although a *batch processing fails the processing continues* 
if I let Spark alone handle the thrown exceptions. This translates to data loss 
and loss of at-least once semantics.

Detailed version: I started originally from an app we run on Spark 2.1.0 on top 
of Mesos w/ Hadoop 2.6, checkpointing disabled (it's done "manually" as you'll 
see below). I tried narrowing it down as much as possible to reproduce a 
similar issue in the local mode, just for illustration purposes (that's where 
the code I put in the issue description came). Consider the following use-case:
{noformat}
1) read data from a Kafka source
2) transform the dstream:
  a) get data from an external service to avoid too many calls from executors 
(might fail)
  b) broadcast the data
  c) map the RDD using the broadcast value
3) cache the transformed dstream
4) foreach RDD write cached data into a db (might fail)
5) foreach RDD:
  a) write cached data in Kafka (might fail)
  b) manually commit the new Kafka offsets (because I need a human readable 
format)
{noformat}

There are multiple points of failure here (e.g. 2.a) and what I need is failing 
asap (see 5.b which means data loss if anything prior to that one failed in a 
micro-batch processing). Obviously manipulating the context in transform is 
wrong. Obviously doing this in foreachRDD in the same thread is again wrong (as 
recommended by [~zsxwing] in SPARK-20321).

What's the recommended way to handle this? If I just let Spark alone handle 
exceptions it seems to somehow ignore them (2.a case for example) and continue 
processing. Since this means data loss I need to avoid it (I need at-least once 
guarantees).

Thanks again :)

> Calling stop in a transform stage causes the app to hang
> 
>
> Key: SPARK-20323
> URL: https://issues.apache.org/jira/browse/SPARK-20323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> I'm not sure if this is a bug or just the way it needs to happen but I've run 
> in this issue with the following code:
> {noformat}
> object ImmortalStreamingJob extends App {
>   val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]")
>   val ssc  = new StreamingContext(conf, Seconds(1))
>   val elems = (1 to 1000).grouped(10)
> .map(seq => ssc.sparkContext.parallelize(seq))
> .toSeq
>   val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*))
>   val transformed = stream.transform { rdd =>
> try {
>   if (Random.nextInt(6) == 5) throw new RuntimeException("boom")
>   else println("lucky bastard")
>   rdd
> } catch {
>   case e: Throwable =>
> println("stopping streaming context", e)
> ssc.stop(stopSparkContext = true, stopGracefully = false)
> throw e
> }
>   }
>   transformed.foreachRDD { rdd =>
> println(rdd.collect().mkString(","))
>   }
>   ssc.start()
>   ssc.awaitTermination()
> }
> {noformat}
> There are two things I can note here:
> * if the exception is thrown in the first transformation (when the first RDD 
> is processed), the spark context is stopped and the app dies
> * if the exception is thrown after at least one RDD has been processed, the 
> app hangs after printing the error message and never stops
> I think there's some sort of deadlock in the second case, is that normal? I 
> also asked this 
> [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624]
>  but up two this point there's no answer pointing exactly to what happens, 
> only guidelines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12717) pyspark broadcast fails when using multiple threads

2017-04-14 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968945#comment-15968945
 ] 

Maciej Bryński edited comment on SPARK-12717 at 4/14/17 12:12 PM:
--

Same problem with Python3.

CC: [~davies]


was (Author: maver1ck):
Same here.

CC: [~davies]

> pyspark broadcast fails when using multiple threads
> ---
>
> Key: SPARK-12717
> URL: https://issues.apache.org/jira/browse/SPARK-12717
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Linux, python 2.6 or python 2.7.
>Reporter: Edward Walker
>Priority: Critical
>
> The following multi-threaded program that uses broadcast variables 
> consistently throws exceptions like:  *Exception("Broadcast variable '18' not 
> loaded!",)* --- even when run with "--master local[10]".
> {code:title=bug_spark.py|borderStyle=solid}
> try:  
>  
> import pyspark
>  
> except:   
>  
> pass  
>  
> from optparse import OptionParser 
>  
>   
>  
> def my_option_parser():   
>  
> op = OptionParser()   
>  
> op.add_option("--parallelism", dest="parallelism", type="int", 
> default=20)  
> return op 
>  
>   
>  
> def do_process(x, w): 
>  
> return x * w.value
>  
>   
>  
> def func(name, rdd, conf):
>  
> new_rdd = rdd.map(lambda x :   do_process(x, conf))   
>  
> total = new_rdd.reduce(lambda x, y : x + y)   
>  
> count = rdd.count()   
>  
> print name, 1.0 * total / count   
>  
>   
>  
> if __name__ == "__main__":
>  
> import threading  
>  
> op = my_option_parser()   
>  
> options, args = op.parse_args()   
>  
> sc = pyspark.SparkContext(appName="Buggy")
>  
> data_rdd = sc.parallelize(range(0,1000), 1)   
>  
> confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ]  
>  
> threads = [ threading.Thread(target=func, args=["thread_" + str(i), 
> data_rdd, confs[i]]) for i in xrange(options.parallelism) ]   
>
> for t in threads: 
>  
> t.start() 
>  
> for t in threads: 
>  
> t.join() 
> {code}
> Abridged run output:
> {code:title=abridge_run.txt|borderStyle=solid}
> % spark-submit --master local[10] bug_spark.py

[jira] [Comment Edited] (SPARK-12717) pyspark broadcast fails when using multiple threads

2017-04-14 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968945#comment-15968945
 ] 

Maciej Bryński edited comment on SPARK-12717 at 4/14/17 12:10 PM:
--

Same here.

CC: [~davies]


was (Author: maver1ck):
Same here.

> pyspark broadcast fails when using multiple threads
> ---
>
> Key: SPARK-12717
> URL: https://issues.apache.org/jira/browse/SPARK-12717
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Linux, python 2.6 or python 2.7.
>Reporter: Edward Walker
>Priority: Critical
>
> The following multi-threaded program that uses broadcast variables 
> consistently throws exceptions like:  *Exception("Broadcast variable '18' not 
> loaded!",)* --- even when run with "--master local[10]".
> {code:title=bug_spark.py|borderStyle=solid}
> try:  
>  
> import pyspark
>  
> except:   
>  
> pass  
>  
> from optparse import OptionParser 
>  
>   
>  
> def my_option_parser():   
>  
> op = OptionParser()   
>  
> op.add_option("--parallelism", dest="parallelism", type="int", 
> default=20)  
> return op 
>  
>   
>  
> def do_process(x, w): 
>  
> return x * w.value
>  
>   
>  
> def func(name, rdd, conf):
>  
> new_rdd = rdd.map(lambda x :   do_process(x, conf))   
>  
> total = new_rdd.reduce(lambda x, y : x + y)   
>  
> count = rdd.count()   
>  
> print name, 1.0 * total / count   
>  
>   
>  
> if __name__ == "__main__":
>  
> import threading  
>  
> op = my_option_parser()   
>  
> options, args = op.parse_args()   
>  
> sc = pyspark.SparkContext(appName="Buggy")
>  
> data_rdd = sc.parallelize(range(0,1000), 1)   
>  
> confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ]  
>  
> threads = [ threading.Thread(target=func, args=["thread_" + str(i), 
> data_rdd, confs[i]]) for i in xrange(options.parallelism) ]   
>
> for t in threads: 
>  
> t.start() 
>  
> for t in threads: 
>  
> t.join() 
> {code}
> Abridged run output:
> {code:title=abridge_run.txt|borderStyle=solid}
> % spark-submit --master local[10] bug_spark.py --parallelism 20
> [snip]
>

[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang

2017-04-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968943#comment-15968943
 ] 

Sean Owen commented on SPARK-20323:
---

Not being able to use a context in a remote operation? I think it's kind of 
implicit because it doesn't make sense in the context of the Spark model, but I 
don't know if it's explicitly stated. It will throw an obvious error in most 
other cases that you try it.

The examples show stopping the context at the end of the program. They don't 
consistently show using a try-finally block for it, but that is good practice. 
You just do it outside of streaming operations.

> Calling stop in a transform stage causes the app to hang
> 
>
> Key: SPARK-20323
> URL: https://issues.apache.org/jira/browse/SPARK-20323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> I'm not sure if this is a bug or just the way it needs to happen but I've run 
> in this issue with the following code:
> {noformat}
> object ImmortalStreamingJob extends App {
>   val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]")
>   val ssc  = new StreamingContext(conf, Seconds(1))
>   val elems = (1 to 1000).grouped(10)
> .map(seq => ssc.sparkContext.parallelize(seq))
> .toSeq
>   val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*))
>   val transformed = stream.transform { rdd =>
> try {
>   if (Random.nextInt(6) == 5) throw new RuntimeException("boom")
>   else println("lucky bastard")
>   rdd
> } catch {
>   case e: Throwable =>
> println("stopping streaming context", e)
> ssc.stop(stopSparkContext = true, stopGracefully = false)
> throw e
> }
>   }
>   transformed.foreachRDD { rdd =>
> println(rdd.collect().mkString(","))
>   }
>   ssc.start()
>   ssc.awaitTermination()
> }
> {noformat}
> There are two things I can note here:
> * if the exception is thrown in the first transformation (when the first RDD 
> is processed), the spark context is stopped and the app dies
> * if the exception is thrown after at least one RDD has been processed, the 
> app hangs after printing the error message and never stops
> I think there's some sort of deadlock in the second case, is that normal? I 
> also asked this 
> [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624]
>  but up two this point there's no answer pointing exactly to what happens, 
> only guidelines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12717) pyspark broadcast fails when using multiple threads

2017-04-14 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968945#comment-15968945
 ] 

Maciej Bryński commented on SPARK-12717:


Same here.

> pyspark broadcast fails when using multiple threads
> ---
>
> Key: SPARK-12717
> URL: https://issues.apache.org/jira/browse/SPARK-12717
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Linux, python 2.6 or python 2.7.
>Reporter: Edward Walker
>Priority: Critical
>
> The following multi-threaded program that uses broadcast variables 
> consistently throws exceptions like:  *Exception("Broadcast variable '18' not 
> loaded!",)* --- even when run with "--master local[10]".
> {code:title=bug_spark.py|borderStyle=solid}
> try:  
>  
> import pyspark
>  
> except:   
>  
> pass  
>  
> from optparse import OptionParser 
>  
>   
>  
> def my_option_parser():   
>  
> op = OptionParser()   
>  
> op.add_option("--parallelism", dest="parallelism", type="int", 
> default=20)  
> return op 
>  
>   
>  
> def do_process(x, w): 
>  
> return x * w.value
>  
>   
>  
> def func(name, rdd, conf):
>  
> new_rdd = rdd.map(lambda x :   do_process(x, conf))   
>  
> total = new_rdd.reduce(lambda x, y : x + y)   
>  
> count = rdd.count()   
>  
> print name, 1.0 * total / count   
>  
>   
>  
> if __name__ == "__main__":
>  
> import threading  
>  
> op = my_option_parser()   
>  
> options, args = op.parse_args()   
>  
> sc = pyspark.SparkContext(appName="Buggy")
>  
> data_rdd = sc.parallelize(range(0,1000), 1)   
>  
> confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ]  
>  
> threads = [ threading.Thread(target=func, args=["thread_" + str(i), 
> data_rdd, confs[i]]) for i in xrange(options.parallelism) ]   
>
> for t in threads: 
>  
> t.start() 
>  
> for t in threads: 
>  
> t.join() 
> {code}
> Abridged run output:
> {code:title=abridge_run.txt|borderStyle=solid}
> % spark-submit --master local[10] bug_spark.py --parallelism 20
> [snip]
> 16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9)
>

[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang

2017-04-14 Thread Andrei Taleanu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968940#comment-15968940
 ] 

Andrei Taleanu commented on SPARK-20323:


[~srowen] could you please give me a link that documents this bad-practice? I 
understood and expected that this was an incorrect approach since 
transformation are lazy. However, I asked the question because of a common 
problem I would say: you have a streaming app and when something goes wrong 
either on the driver / executors you want to fail fast in order to avoid some 
data loss / corruption. Stopping the context seems the most straightforward 
way. But doing so appears not to be that easy. So how should one handle this 
case? Thanks.

> Calling stop in a transform stage causes the app to hang
> 
>
> Key: SPARK-20323
> URL: https://issues.apache.org/jira/browse/SPARK-20323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> I'm not sure if this is a bug or just the way it needs to happen but I've run 
> in this issue with the following code:
> {noformat}
> object ImmortalStreamingJob extends App {
>   val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]")
>   val ssc  = new StreamingContext(conf, Seconds(1))
>   val elems = (1 to 1000).grouped(10)
> .map(seq => ssc.sparkContext.parallelize(seq))
> .toSeq
>   val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*))
>   val transformed = stream.transform { rdd =>
> try {
>   if (Random.nextInt(6) == 5) throw new RuntimeException("boom")
>   else println("lucky bastard")
>   rdd
> } catch {
>   case e: Throwable =>
> println("stopping streaming context", e)
> ssc.stop(stopSparkContext = true, stopGracefully = false)
> throw e
> }
>   }
>   transformed.foreachRDD { rdd =>
> println(rdd.collect().mkString(","))
>   }
>   ssc.start()
>   ssc.awaitTermination()
> }
> {noformat}
> There are two things I can note here:
> * if the exception is thrown in the first transformation (when the first RDD 
> is processed), the spark context is stopped and the app dies
> * if the exception is thrown after at least one RDD has been processed, the 
> app hangs after printing the error message and never stops
> I think there's some sort of deadlock in the second case, is that normal? I 
> also asked this 
> [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624]
>  but up two this point there's no answer pointing exactly to what happens, 
> only guidelines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10109) NPE when saving Parquet To HDFS

2017-04-14 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968934#comment-15968934
 ] 

Steve Loughran commented on SPARK-10109:


SPARK-20038 should stop the failure being so dramatic

> NPE when saving Parquet To HDFS
> ---
>
> Key: SPARK-10109
> URL: https://issues.apache.org/jira/browse/SPARK-10109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Sparc-ec2, standalone cluster on amazon
>Reporter: Virgil Palanciuc
>
> Very simple code, trying to save a dataframe
> I get this in the driver
> {quote}
> 15/08/19 11:21:41 INFO TaskSetManager: Lost task 9.2 in stage 217.0 (TID 
> 4748) on executor 172.xx.xx.xx: java.lang.NullPointerException (null) 
> and  (not for that task):
> 15/08/19 11:21:46 WARN TaskSetManager: Lost task 5.0 in stage 543.0 (TID 
> 5607, 172.yy.yy.yy): java.lang.NullPointerException
> at 
> parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
> at 
> parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
> at 
> parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
> at 
> org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:88)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:107)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer.clearOutputWriters(commands.scala:536)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer.abortTask(commands.scala:552)
> at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$2(commands.scala:269)
> at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229)
> at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
> I get this in the executor log:
> {quote}
> 15/08/19 11:21:41 WARN DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on 
> /gglogs/2015-07-27/_temporary/_attempt_201508191119_0217_m_09_2/dpid=18432/pid=1109/part-r-9-46ac3a79-a95c-4d9c-a2f1-b3ee76f6a46c.snappy.parquet
>  File does not exist. Holder DFSClient_NONMAPREDUCE_1730998114_63 does not 
> have any open files.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at

[jira] [Assigned] (SPARK-20318) Use Catalyst type for min/max in ColumnStat for ease of estimation

2017-04-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20318:
---

Assignee: Zhenhua Wang

> Use Catalyst type for min/max in ColumnStat for ease of estimation
> --
>
> Key: SPARK-20318
> URL: https://issues.apache.org/jira/browse/SPARK-20318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> Currently when estimating predicates like col > literal or col = literal, we 
> will update min or max in column stats based on literal value. However, 
> literal value is of Catalyst type (internal type), while min/max is of 
> external type. This causes unnecessary conversion when comparing them and 
> updating column stats.
> To solve this, we can use Catalyst type for min/max in ColumnStat for ease of 
> estimation. Note that the persistent form in metastore is still of external 
> type, so there's no inconsistency for statistics in metastore.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20318) Use Catalyst type for min/max in ColumnStat for ease of estimation

2017-04-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20318.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17630
[https://github.com/apache/spark/pull/17630]

> Use Catalyst type for min/max in ColumnStat for ease of estimation
> --
>
> Key: SPARK-20318
> URL: https://issues.apache.org/jira/browse/SPARK-20318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
> Fix For: 2.2.0
>
>
> Currently when estimating predicates like col > literal or col = literal, we 
> will update min or max in column stats based on literal value. However, 
> literal value is of Catalyst type (internal type), while min/max is of 
> external type. This causes unnecessary conversion when comparing them and 
> updating column stats.
> To solve this, we can use Catalyst type for min/max in ColumnStat for ease of 
> estimation. Note that the persistent form in metastore is still of external 
> type, so there's no inconsistency for statistics in metastore.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20338) Spaces in spark.eventLog.dir are not correctly handled

2017-04-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968882#comment-15968882
 ] 

Apache Spark commented on SPARK-20338:
--

User 'zuotingbing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17638

> Spaces in spark.eventLog.dir are not correctly handled
> --
>
> Key: SPARK-20338
> URL: https://issues.apache.org/jira/browse/SPARK-20338
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: zuotingbing
>Priority: Minor
>
> set spark.eventLog.dir=/home/mr/event log and submit an app ,we got error as 
> follows:
> 017-04-14 17:28:40,378 INFO org.apache.spark.SparkContext: Successfully 
> stopped SparkContext
> Exception in thread "main" ExitCodeException exitCode=1: chmod: cannot access 
> `/home/mr/event%20log/app-20170414172839-.inprogress': No such file or 
> directory
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
>   at org.apache.hadoop.util.Shell.run(Shell.java:478)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:831)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:814)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:712)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:506)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:125)
>   at org.apache.spark.SparkContext.(SparkContext.scala:516)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:879)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:871)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:871)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20338) Spaces in spark.eventLog.dir are not correctly handled

2017-04-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20338:


Assignee: (was: Apache Spark)

> Spaces in spark.eventLog.dir are not correctly handled
> --
>
> Key: SPARK-20338
> URL: https://issues.apache.org/jira/browse/SPARK-20338
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: zuotingbing
>Priority: Minor
>
> set spark.eventLog.dir=/home/mr/event log and submit an app ,we got error as 
> follows:
> 017-04-14 17:28:40,378 INFO org.apache.spark.SparkContext: Successfully 
> stopped SparkContext
> Exception in thread "main" ExitCodeException exitCode=1: chmod: cannot access 
> `/home/mr/event%20log/app-20170414172839-.inprogress': No such file or 
> directory
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
>   at org.apache.hadoop.util.Shell.run(Shell.java:478)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:831)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:814)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:712)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:506)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:125)
>   at org.apache.spark.SparkContext.(SparkContext.scala:516)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:879)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:871)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:871)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20338) Spaces in spark.eventLog.dir are not correctly handled

2017-04-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20338:


Assignee: Apache Spark

> Spaces in spark.eventLog.dir are not correctly handled
> --
>
> Key: SPARK-20338
> URL: https://issues.apache.org/jira/browse/SPARK-20338
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: zuotingbing
>Assignee: Apache Spark
>Priority: Minor
>
> set spark.eventLog.dir=/home/mr/event log and submit an app ,we got error as 
> follows:
> 017-04-14 17:28:40,378 INFO org.apache.spark.SparkContext: Successfully 
> stopped SparkContext
> Exception in thread "main" ExitCodeException exitCode=1: chmod: cannot access 
> `/home/mr/event%20log/app-20170414172839-.inprogress': No such file or 
> directory
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
>   at org.apache.hadoop.util.Shell.run(Shell.java:478)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:831)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:814)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:712)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:506)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:125)
>   at org.apache.spark.SparkContext.(SparkContext.scala:516)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:879)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:871)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:871)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20338) Spaces in spark.eventLog.dir are not correctly handled

2017-04-14 Thread zuotingbing (JIRA)

zuotingbing created SPARK-20338:
---

 Summary: Spaces in spark.eventLog.dir are not correctly handled
 Key: SPARK-20338
 URL: https://issues.apache.org/jira/browse/SPARK-20338
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: zuotingbing
Priority: Minor


set spark.eventLog.dir=/home/mr/event log and submit an app ,we got error as 
follows:
017-04-14 17:28:40,378 INFO org.apache.spark.SparkContext: Successfully stopped 
SparkContext
Exception in thread "main" ExitCodeException exitCode=1: chmod: cannot access 
`/home/mr/event%20log/app-20170414172839-.inprogress': No such file or 
directory

at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:831)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:814)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:712)
at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:506)
at 
org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:125)
at org.apache.spark.SparkContext.(SparkContext.scala:516)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:879)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:871)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:871)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20337) Support upgrade a jar dependency and don't restart SparkContext

2017-04-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20337:


Assignee: (was: Apache Spark)

> Support upgrade a jar dependency and don't restart SparkContext
> ---
>
> Key: SPARK-20337
> URL: https://issues.apache.org/jira/browse/SPARK-20337
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Priority: Minor
>
> Suppose that we need upgrade a jar dependency and don't want to restart 
> {{SparkContext}}, Something like this:
> {code}
> sc.addJar("breeze-natives_2.11-0.12.jar")
> // do something
> sc.removeJar("spark://192.168.26.200:23420/jar/breeze-natives_2.11-0.12.jar")
> sc.addJar("breeze-natives_2.11-0.13.jar")
> // do something
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20337) Support upgrade a jar dependency and don't restart SparkContext

2017-04-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20337:


Assignee: Apache Spark

> Support upgrade a jar dependency and don't restart SparkContext
> ---
>
> Key: SPARK-20337
> URL: https://issues.apache.org/jira/browse/SPARK-20337
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Suppose that we need upgrade a jar dependency and don't want to restart 
> {{SparkContext}}, Something like this:
> {code}
> sc.addJar("breeze-natives_2.11-0.12.jar")
> // do something
> sc.removeJar("spark://192.168.26.200:23420/jar/breeze-natives_2.11-0.12.jar")
> sc.addJar("breeze-natives_2.11-0.13.jar")
> // do something
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20337) Support upgrade a jar dependency and don't restart SparkContext

2017-04-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968868#comment-15968868
 ] 

Apache Spark commented on SPARK-20337:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/17637

> Support upgrade a jar dependency and don't restart SparkContext
> ---
>
> Key: SPARK-20337
> URL: https://issues.apache.org/jira/browse/SPARK-20337
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Priority: Minor
>
> Suppose that we need upgrade a jar dependency and don't want to restart 
> {{SparkContext}}, Something like this:
> {code}
> sc.addJar("breeze-natives_2.11-0.12.jar")
> // do something
> sc.removeJar("spark://192.168.26.200:23420/jar/breeze-natives_2.11-0.12.jar")
> sc.addJar("breeze-natives_2.11-0.13.jar")
> // do something
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20337) Support upgrade a jar dependency and don't restart SparkContext

2017-04-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968843#comment-15968843
 ] 

Sean Owen commented on SPARK-20337:
---

I don't think this is a reasonable use case to support. Spark jobs are 
conceptually not long-lived, certainly not shells. Just restart. This raises 
all kinds of problems with references to instances of existing classes.

> Support upgrade a jar dependency and don't restart SparkContext
> ---
>
> Key: SPARK-20337
> URL: https://issues.apache.org/jira/browse/SPARK-20337
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Priority: Minor
>
> Suppose that we need upgrade a jar dependency and don't want to restart 
> {{SparkContext}}, Something like this:
> {code}
> sc.addJar("breeze-natives_2.11-0.12.jar")
> // do something
> sc.removeJar("spark://192.168.26.200:23420/jar/breeze-natives_2.11-0.12.jar")
> sc.addJar("breeze-natives_2.11-0.13.jar")
> // do something
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20337) Support upgrade a jar dependency and don't restart SparkContext

2017-04-14 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-20337:
---

 Summary: Support upgrade a jar dependency and don't restart 
SparkContext
 Key: SPARK-20337
 URL: https://issues.apache.org/jira/browse/SPARK-20337
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Yuming Wang
Priority: Minor


Suppose that we need upgrade a jar dependency and don't want to restart 
{{SparkContext}}, Something like this:
{code}
sc.addJar("breeze-natives_2.11-0.12.jar")
// do something
sc.removeJar("spark://192.168.26.200:23420/jar/breeze-natives_2.11-0.12.jar")
sc.addJar("breeze-natives_2.11-0.13.jar")
// do something
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20321) Spark UI cannot be shutdown in spark streaming app

2017-04-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20321.
---
Resolution: Not A Problem

> Spark UI cannot be shutdown in spark streaming app
> --
>
> Key: SPARK-20321
> URL: https://issues.apache.org/jira/browse/SPARK-20321
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> When an exception thrown the transform stage is handled in foreachRDD and the 
> streaming context is forced to stop, the SparkUI appears to hang and 
> continually dump the following logs in an infinite loop:
> {noformat}
> ...
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
> select, 0/0 selected
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop woken up from 
> select, 0/0 selected
> 2017-04-12 14:11:13,470 
> [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG 
> org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select
> ...
> {noformat}
> Unfortunately I don't have a minimal example that reproduces this issue but 
> here is what I can share:
> {noformat}
> val dstream = pull data from kafka
> val mapped = dstream.transform { rdd =>
>   val data = getData // Perform a call that potentially throws an exception
>   // broadcast the data
>   // flatMap the RDD using the data
> }
> mapped.foreachRDD {
>   try {
> // write some data in a DB
>   } catch {
> case t: Throwable =>
>   dstream.context.stop(stopSparkContext = true, stopGracefully = false)
>   }
> }
> mapped.foreachRDD {
>   try {
> // write data to Kafka
> // manually checkpoint the Kafka offsets (because I need them in JSON 
> format)
>   } catch {
> case t: Throwable =>
>   dstream.context.stop(stopSparkContext = true, stopGracefully = false)
>   }
> }
> {noformat}
> The issue appears when stop is invoked. At the point when SparkUI is stopped, 
> it enters that infinite loop. Initially I thought it relates to Jetty, as the 
> version used in SparkUI had some bugs (e.g. [this 
> one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to 
> a more recent version (March 2017) and built Spark 2.1.0 with that one but 
> still got the error.
> I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of 
> Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20323) Calling stop in a transform stage causes the app to hang

2017-04-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20323.
---
Resolution: Not A Problem

> Calling stop in a transform stage causes the app to hang
> 
>
> Key: SPARK-20323
> URL: https://issues.apache.org/jira/browse/SPARK-20323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrei Taleanu
>
> I'm not sure if this is a bug or just the way it needs to happen but I've run 
> in this issue with the following code:
> {noformat}
> object ImmortalStreamingJob extends App {
>   val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]")
>   val ssc  = new StreamingContext(conf, Seconds(1))
>   val elems = (1 to 1000).grouped(10)
> .map(seq => ssc.sparkContext.parallelize(seq))
> .toSeq
>   val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*))
>   val transformed = stream.transform { rdd =>
> try {
>   if (Random.nextInt(6) == 5) throw new RuntimeException("boom")
>   else println("lucky bastard")
>   rdd
> } catch {
>   case e: Throwable =>
> println("stopping streaming context", e)
> ssc.stop(stopSparkContext = true, stopGracefully = false)
> throw e
> }
>   }
>   transformed.foreachRDD { rdd =>
> println(rdd.collect().mkString(","))
>   }
>   ssc.start()
>   ssc.awaitTermination()
> }
> {noformat}
> There are two things I can note here:
> * if the exception is thrown in the first transformation (when the first RDD 
> is processed), the spark context is stopped and the app dies
> * if the exception is thrown after at least one RDD has been processed, the 
> app hangs after printing the error message and never stops
> I think there's some sort of deadlock in the second case, is that normal? I 
> also asked this 
> [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624]
>  but up two this point there's no answer pointing exactly to what happens, 
> only guidelines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968753#comment-15968753
 ] 

Hyukjin Kwon commented on SPARK-20336:
--

Oh, probably, I would also appreciate if you are able to test this with json, 
{{wholeFile}} enabled. Both work similarly and trying this out helps to verify 
this case.

> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968749#comment-15968749
 ] 

Hyukjin Kwon commented on SPARK-20336:
--

I just tried to reproduce it as below:

Scala

{code}
scala> spark.read.option("wholeFile", true).option("header", 
true).csv("tmp.csv").show()
+++-+
|col1|col2| col3|
+++-+
|   1|   a| text|
|   2|   b| テキスト|
|   3|   c|  텍스트|
|   4|   d|text
テキスト
텍스트|
|   5|   e| last|
+++-+
{code}

Python

{code}
>>> spark.read.csv("tmp.csv", header=True, wholeFile=True).show()
+++-+
|col1|col2| col3|
+++-+
|   1|   a| text|
|   2|   b| テキスト|
|   3|   c|  텍스트|
|   4|   d|text
テキスト
텍스트|
|   5|   e| last|
+++-+
{code}

It seems working fine. Do you mind if I ask to double check the encoding in the 
file and environments please? I would like to verify this case.
  

> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-20215) ReuseExchange is boken in SparkSQL

2017-04-14 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-20215:
---
Comment: was deleted

(was: Seems to be fixed in SPARK-20229)

> ReuseExchange is boken in SparkSQL
> --
>
> Key: SPARK-20215
> URL: https://issues.apache.org/jira/browse/SPARK-20215
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> Currently if we have query like: A join B Union A join C... with the same 
> join key. Table A will be scanned multiple times in sql. It is because the 
> megastoreRelation are not shared by two joins, and ExprId is different.  
> canonicalized in Expression will not be able to unify them and two Exchange 
> will not compatible and cannot be reused.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20215) ReuseExchange is boken in SparkSQL

2017-04-14 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968734#comment-15968734
 ] 

Zhan Zhang commented on SPARK-20215:


Seems to be fixed in SPARK-20229

> ReuseExchange is boken in SparkSQL
> --
>
> Key: SPARK-20215
> URL: https://issues.apache.org/jira/browse/SPARK-20215
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> Currently if we have query like: A join B Union A join C... with the same 
> join key. Table A will be scanned multiple times in sql. It is because the 
> megastoreRelation are not shared by two joins, and ExprId is different.  
> canonicalized in Expression will not be able to unify them and two Exchange 
> will not compatible and cannot be reused.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-14 Thread HanCheol Cho (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HanCheol Cho updated SPARK-20336:
-
Description: 
I used spark.read.csv() method with wholeFile=True option to load data that has 
multi-line records.
However, non-ASCII characters are not properly loaded.

The following is a sample data for test:
{code:none}
col1,col2,col3
1,a,text
2,b,テキスト
3,c,텍스트
4,d,"text
テキスト
텍스트"
5,e,last
{code}

When it is loaded without wholeFile=True option, non-ASCII characters are shown 
correctly although multi-line records are parsed incorrectly as follows:
{code:none}
testdf_default = spark.read.csv("test.encoding.csv", header=True)
testdf_default.show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b|テキスト|
|   3|   c| 텍스트|
|   4|   d|text|
|テキスト|null|null|
| 텍스트"|null|null|
|   5|   e|last|
++++
{code}

When wholeFile=True option is used, non-ASCII characters are broken as follows:
{code:none}
testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
wholeFile=True)
testdf_wholefile.show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b||
|   3|   c|   �|
|   4|   d|text
...|
|   5|   e|last|
++++
{code}

The result is same even if I use encoding="UTF-8" option with wholeFile=True.




  was:
I used spark.read.csv() method with wholeFile=True option to load data that has 
multi-line records.
However, non-ASCII characters are not properly loaded.

The following is a sample data for test:
{code:none}
col1,col2,col3
1,a,text
2,b,テキスト
3,c,텍스트
4,d,"text
テキスト
텍스트
5,e,last
{code}

When it is loaded without wholeFile=True option, non-ASCII characters are shown 
correctly although multi-line records are parsed incorrectly as follows:
{code:none}
testdf_default = spark.read.csv("test.encoding.csv", header=True)
testdf_default.show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b|テキスト|
|   3|   c| 텍스트|
|   4|   d|text|
|テキスト|null|null|
| 텍스트|null|null|
|   5|   e|last|
++++
{code}

When wholeFile=True option is used, non-ASCII characters are broken as follows:
{code:none}
testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
wholeFile=True)
testdf_wholefile.show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b||
|   3|   c|   �|
|   4|   d|text
...|
++++
{code}

The result is same even if I use encoding="UTF-8" option with wholeFile=True.





> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Created] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-14 Thread HanCheol Cho (JIRA)

HanCheol Cho created SPARK-20336:


 Summary: spark.read.csv() with wholeFile=True option fails to read 
non ASCII unicode characters
 Key: SPARK-20336
 URL: https://issues.apache.org/jira/browse/SPARK-20336
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
 Environment: Spark 2.2.0 (master branch is downloaded from Github)
PySpark
Reporter: HanCheol Cho


I used spark.read.csv() method with wholeFile=True option to load data that has 
multi-line records.
However, non-ASCII characters are not properly loaded.

The following is a sample data for test:
{code:none}
col1,col2,col3
1,a,text
2,b,テキスト
3,c,텍스트
4,d,"text
テキスト
텍스트
5,e,last
{code}

When it is loaded without wholeFile=True option, non-ASCII characters are shown 
correctly although multi-line records are parsed incorrectly as follows:
{code:none}
testdf_default = spark.read.csv("test.encoding.csv", header=True)
testdf_default.show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b|テキスト|
|   3|   c| 텍스트|
|   4|   d|text|
|テキスト|null|null|
| 텍스트|null|null|
|   5|   e|last|
++++
{code}

When wholeFile=True option is used, non-ASCII characters are broken as follows:
{code:none}
testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
wholeFile=True)
testdf_wholefile.show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b||
|   3|   c|   �|
|   4|   d|text
...|
++++
{code}

The result is same even if I use encoding="UTF-8" option with wholeFile=True.






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

61 matches

Mail list logo