[jira] [Created] (SPARK-25111) increment kinesis client/producer lib versions & aws-sdk to match

2018-08-13 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-25111:
--

 Summary: increment kinesis client/producer lib versions & aws-sdk 
to match
 Key: SPARK-25111
 URL: https://issues.apache.org/jira/browse/SPARK-25111
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 2.4.0
Reporter: Steve Loughran


Move up to a more recent version of the kinesis client lib, matching aws SDK 
and producer versions.

Proposed: move from 1.11.76 to 1.11.271. This is what hadoop-aws has been 
shipping in 3.1 and has been problem free so far (stable, no random warning 
messges in logs, etc). The hadoop-aws dependency is the shaded bundle, so not 
that of the kinesis SDK here, but at least it'll be the same version 
everywhere. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22974) CountVectorModel does not attach attributes to output column

2018-08-13 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-22974:
---

Assignee: Liang-Chi Hsieh

> CountVectorModel does not attach attributes to output column
> 
>
> Key: SPARK-22974
> URL: https://issues.apache.org/jira/browse/SPARK-22974
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: William Zhang
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> If CountVectorModel transforms columns, the output column will not have 
> attributes attached to them. If later on, those columns are used in 
> Interaction transformer, an exception will be thrown:
> {quote}"org.apache.spark.SparkException: Vector attributes must be defined 
> for interaction."
> {quote}
> To reproduce it:
> {quote}import org.apache.spark.ml.feature._
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(
>   (0, Array("a", "b", "c"), Array("1", "2")),
>   (1, Array("a", "b", "b", "c", "a", "d"),  Array("1", "2", "3"))
> )).toDF("id", "words", "nums")
> val cvModel: CountVectorizerModel = new CountVectorizer()
>   .setInputCol("nums")
>   .setOutputCol("features2")
>   .setVocabSize(4)
>   .setMinDF(0)
>   .fit(df)
> ]val cvm = new CountVectorizerModel(Array("a", "b", "c"))
>   .setInputCol("words")
>   .setOutputCol("features1")
>   
> val df1 = cvm.transform(df)
> val df2 = cvModel.transform(df1)
> val interaction = new Interaction().setInputCols(Array("features1", 
> "features2")).setOutputCol("features")
> val df3  = interaction.transform(df2)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22974) CountVectorModel does not attach attributes to output column

2018-08-13 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-22974.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20313
[https://github.com/apache/spark/pull/20313]

> CountVectorModel does not attach attributes to output column
> 
>
> Key: SPARK-22974
> URL: https://issues.apache.org/jira/browse/SPARK-22974
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: William Zhang
>Priority: Major
> Fix For: 2.4.0
>
>
> If CountVectorModel transforms columns, the output column will not have 
> attributes attached to them. If later on, those columns are used in 
> Interaction transformer, an exception will be thrown:
> {quote}"org.apache.spark.SparkException: Vector attributes must be defined 
> for interaction."
> {quote}
> To reproduce it:
> {quote}import org.apache.spark.ml.feature._
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(
>   (0, Array("a", "b", "c"), Array("1", "2")),
>   (1, Array("a", "b", "b", "c", "a", "d"),  Array("1", "2", "3"))
> )).toDF("id", "words", "nums")
> val cvModel: CountVectorizerModel = new CountVectorizer()
>   .setInputCol("nums")
>   .setOutputCol("features2")
>   .setVocabSize(4)
>   .setMinDF(0)
>   .fit(df)
> ]val cvm = new CountVectorizerModel(Array("a", "b", "c"))
>   .setInputCol("words")
>   .setOutputCol("features1")
>   
> val df1 = cvm.transform(df)
> val df2 = cvModel.transform(df1)
> val interaction = new Interaction().setInputCols(Array("features1", 
> "features2")).setOutputCol("features")
> val df3  = interaction.transform(df2)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25104) Validate user specified output schema

2018-08-13 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-25104.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22094
[https://github.com/apache/spark/pull/22094]

> Validate user specified output schema
> -
>
> Key: SPARK-25104
> URL: https://issues.apache.org/jira/browse/SPARK-25104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> With code changes in 
> [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,]
>  , Spark can write out data as per user provided output schema.
> To make it more robust and user friendly, we should validate the Avro schema 
> before tasks launched.
> Also we should support output logical decimal type as BYTES (By default we 
> output as FIXED)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25104) Validate user specified output schema

2018-08-13 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-25104:
---

Assignee: Gengliang Wang

> Validate user specified output schema
> -
>
> Key: SPARK-25104
> URL: https://issues.apache.org/jira/browse/SPARK-25104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> With code changes in 
> [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,]
>  , Spark can write out data as per user provided output schema.
> To make it more robust and user friendly, we should validate the Avro schema 
> before tasks launched.
> Also we should support output logical decimal type as BYTES (By default we 
> output as FIXED)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579243#comment-16579243
 ] 

Wenchen Fan commented on SPARK-24771:
-

It's good to pay more attention to compatibility issues, I've added 
release_notes label to this ticket and created 
https://issues.apache.org/jira/browse/SPARK-25110 to track the Flume streaming 
connector.

I'm not sure how many users would use avro with RDD, but I feel it should be 
rare as there was a spark-avro package available for Spark SQL. Shall we send 
an email to user/dev list?

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23050) Structured Streaming with S3 file source duplicates data because of eventual consistency.

2018-08-13 Thread bharath kumar avusherla (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579242#comment-16579242
 ] 

bharath kumar avusherla commented on SPARK-23050:
-

[~ste...@apache.org], I can start working on it.

> Structured Streaming with S3 file source duplicates data because of eventual 
> consistency.
> -
>
> Key: SPARK-23050
> URL: https://issues.apache.org/jira/browse/SPARK-23050
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yash Sharma
>Priority: Major
>
> Spark Structured streaming with S3 file source duplicates data because of 
> eventual consistency.
> Re producing the scenario -
> - Structured streaming reading from S3 source. Writing back to S3.
> - Spark tries to commitTask on completion of a task, by verifying if all the 
> files have been written to Filesystem. 
> {{ManifestFileCommitProtocol.commitTask}}.
> - [Eventual consistency issue] Spark finds that the file is not present and 
> fails the task. {{org.apache.spark.SparkException: Task failed while writing 
> rows. No such file or directory 
> 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet'}}
> - By this time S3 eventually gets the file.
> - Spark reruns the task and completes the task, but gets a new file name this 
> time. {{ManifestFileCommitProtocol.newTaskTempFile. 
> part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet.}}
> - Data duplicates in results and the same data is processed twice and written 
> to S3.
> - There is no data duplication if spark is able to list presence of all 
> committed files and all tasks succeed.
> Code:
> {code}
> query = selected_df.writeStream \
> .format("parquet") \
> .option("compression", "snappy") \
> .option("path", "s3://path/data/") \
> .option("checkpointLocation", "s3://path/checkpoint/") \
> .start()
> {code}
> Same sized duplicate S3 Files:
> {code}
> $ aws s3 ls s3://path/data/ | grep part-00256
> 2018-01-11 03:37:00  17070 
> part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet
> 2018-01-11 03:37:10  17070 
> part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet
> {code}
> Exception on S3 listing and task failure:
> {code}
> [Stage 5:>(277 + 100) / 
> 597]18/01/11 03:36:59 WARN TaskSetManager: Lost task 256.0 in stage 5.0 (TID  
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  Caused by: java.io.FileNotFoundException: No such file or directory 
> 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet'
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:816)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:509)
>   at 
> org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol$$anonfun$4.apply(ManifestFileCommitProtocol.scala:109)
>   at 
> org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol$$anonfun$4.apply(ManifestFileCommitProtocol.scala:109)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitTask(ManifestFileCommitProtocol.scala:109)
>   at 
> 

[jira] [Created] (SPARK-25110) make sure Flume streaming connector works with Spark 2.4

2018-08-13 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25110:
---

 Summary: make sure Flume streaming connector works with Spark 2.4
 Key: SPARK-25110
 URL: https://issues.apache.org/jira/browse/SPARK-25110
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24771:

Labels: release-notes  (was: )

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579213#comment-16579213
 ] 

Marcelo Vanzin edited comment on SPARK-24771 at 8/14/18 3:46 AM:
-

The main problem pointed out in the original attempt is AVRO-1502; it means 
that people with code generated by Avro 1.7 might run into problems if Spark 
ships with Avro 1.8; that basically amounts to a binary compatibility issue, 
which we always try not to break in minor releases.

It may be that it only applies in some specific situations and it may be 
acceptable to release note it. But it would be nice to be sure, since this 
affects existing users of Avro - including the Flume streaming connector which 
is still available in Spark 2.4...

(Edit: the flume connector re-compiles the avro interface during the Spark 
build, but it would be nice to check that the flume libraries themselves don't 
use avro for anything. And we still should check how much other users of avro 
would be affected - e.g. people who use avro directly in RDD code.)


was (Author: vanzin):
The main problem pointed out in the original attempt is AVRO-1502; it means 
that people with code generated by Avro 1.7 might run into problems if Spark 
ships with Avro 1.8; that basically amounts to a binary compatibility issue, 
which we always try not to break in minor releases.

It may be that it only applies in some specific situations and it may be 
acceptable to release note it. But it would be nice to be sure, since this 
affects existing users of Avro - including the Flume streaming connector which 
is still available in Spark 2.4...

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579213#comment-16579213
 ] 

Marcelo Vanzin commented on SPARK-24771:


The main problem pointed out in the original attempt is AVRO-1502; it means 
that people with code generated by Avro 1.7 might run into problems if Spark 
ships with Avro 1.8; that basically amounts to a binary compatibility issue, 
which we always try not to break in minor releases.

It may be that it only applies in some specific situations and it may be 
acceptable to release note it. But it would be nice to be sure, since this 
affects existing users of Avro - including the Flume streaming connector which 
is still available in Spark 2.4...

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25068) High-order function: exists(array, function) → boolean

2018-08-13 Thread Takuya Ueshin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579210#comment-16579210
 ] 

Takuya Ueshin commented on SPARK-25068:
---

I added this because I thought this was a missing primitive operation for 
arrays, with better performance as you mentioned.
I'm not sure whether we need {{forAll}} because generally we can rewrite 
{{forAll}} with {{exists}}.

> High-order function: exists(array, function) → boolean
> -
>
> Key: SPARK-25068
> URL: https://issues.apache.org/jira/browse/SPARK-25068
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> Tests if arrays have those elements for which function returns true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect

2018-08-13 Thread Yuanbo Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated SPARK-25109:
---
Description: 
We use this code to read parquet files from HDFS:

spark.read.parquet('xxx')

and get error as below:

!WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png!

 

What we can get is that one of the replica block cannot be read for some 
reason, but spark python doesn't try to read another replica which can be read 
successfully. So the application fails after throwing exception.

When I use hadoop fs -text to read the file, I can get content correctly. It 
would be great that spark python can retry reading another replica block 
instead of failing.

 

  was:
We use this code to read parquet files from HDFS:

spark.read.parquet('xxx')

and get error as below:

!WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png!

 

What we can get is that one of the replica block cannot be read for some 
reason, but spark python doesn't try to read another replica which can be read 
successfully. So the application fails after throwing exception.

When I use hadoop fs -text to read the file, I can get content correctly. It 
would be great if spark python can retry reading another replica block instead 
of failing.

 


> spark python should retry reading another datanode if the first one fails to 
> connect
> 
>
> Key: SPARK-25109
> URL: https://issues.apache.org/jira/browse/SPARK-25109
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Yuanbo Liu
>Priority: Major
> Attachments: 
> WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png
>
>
> We use this code to read parquet files from HDFS:
> spark.read.parquet('xxx')
> and get error as below:
> !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png!
>  
> What we can get is that one of the replica block cannot be read for some 
> reason, but spark python doesn't try to read another replica which can be 
> read successfully. So the application fails after throwing exception.
> When I use hadoop fs -text to read the file, I can get content correctly. It 
> would be great that spark python can retry reading another replica block 
> instead of failing.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect

2018-08-13 Thread Yuanbo Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated SPARK-25109:
---
Description: 
We use this code to read parquet files from HDFS:

spark.read.parquet('xxx')

and get error as below:

!WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png!

 

What we can get is that one of the replica block cannot be read for some 
reason, but spark python doesn't try to read another replica which can be read 
successfully. So the application fails after throwing exception.

When I use hadoop fs -text to read the file, I can get content correctly. It 
would be great if spark python can retry reading another replica block instead 
of failing.

 

  was:
We used this code to read parquet files from HDFS:

spark.read.parquet('xxx')

and got error as below:

 


> spark python should retry reading another datanode if the first one fails to 
> connect
> 
>
> Key: SPARK-25109
> URL: https://issues.apache.org/jira/browse/SPARK-25109
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Yuanbo Liu
>Priority: Major
> Attachments: 
> WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png
>
>
> We use this code to read parquet files from HDFS:
> spark.read.parquet('xxx')
> and get error as below:
> !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png!
>  
> What we can get is that one of the replica block cannot be read for some 
> reason, but spark python doesn't try to read another replica which can be 
> read successfully. So the application fails after throwing exception.
> When I use hadoop fs -text to read the file, I can get content correctly. It 
> would be great if spark python can retry reading another replica block 
> instead of failing.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-08-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23308.
--
Resolution: Won't Fix

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation

2018-08-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24006.
--
Resolution: Won't Fix

I am resolving this assuming there's no more update on actual numbers. Please 
reopen later when it is found there's actual gain from this.

> ExecutorAllocationManager.onExecutorAdded is an O(n) operation
> --
>
> Key: SPARK-24006
> URL: https://issues.apache.org/jira/browse/SPARK-24006
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Xianjin YE
>Priority: Major
>
> The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I 
> believe it will be a problem when scaling out with large number of Executors 
> as it effectively makes adding N executors at time complexity O(N^2).
>  
> I propose to invoke onExecutorIdle guarded by 
> {code:java}
> if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { 
> // Since we only need to re-remark idle executors when low bound
> executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle)
> } else {
> onExecutorIdle(executorId)
> }{code}
> cc [~zsxwing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25086) Incorrect Default Value For "escape" For CSV Files

2018-08-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25086.
--
Resolution: Duplicate

> Incorrect Default Value For "escape" For CSV Files
> --
>
> Key: SPARK-25086
> URL: https://issues.apache.org/jira/browse/SPARK-25086
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Wilcox
>Priority: Major
>
> The RFC for CSV files ([https://tools.ietf.org/html/rfc4180]) indicates that 
> the way that a double-quote is escaped is by preceding it with another 
> double-quote:
> {code:java}
> 7. If double-quotes are used to enclose fields, then a double-quote appearing 
> inside a field must be escaped by preceding it with another double quote. For 
> example: "aaa","b""bb","ccc"{code}
> Your default value for "escape" violates the RFC. I think that we should fix 
> the default value to be {{"}}, and those that want {{\}} to escape can 
> override for non-RFC-conforming CSV files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect

2018-08-13 Thread Yuanbo Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu updated SPARK-25109:
---
Attachment: WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png

> spark python should retry reading another datanode if the first one fails to 
> connect
> 
>
> Key: SPARK-25109
> URL: https://issues.apache.org/jira/browse/SPARK-25109
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Yuanbo Liu
>Priority: Major
> Attachments: 
> WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png
>
>
> We used this code to read parquet files from HDFS:
> spark.read.parquet('xxx')
> and got error as below:
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579184#comment-16579184
 ] 

Imran Rashid commented on SPARK-24356:
--

Somewhat related to SPARK-24938 -- that explains why these buffers are even on 
the heap at all, as spark configures netty to use offheap buffers by default.

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> --
>
> Key: SPARK-24356
> URL: https://issues.apache.org/jira/browse/SPARK-24356
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Assignee: Misha Dmitriev
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-24356.01.patch, dup-file-strings-details.png
>
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
> high GC pressure due to high object churn. Analysis was done with the jxray 
> tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
> number of well-known memory issues. One problem that it found in this dump is 
> 19.5% of memory wasted due to duplicate strings. Of these duplicates, more 
> than a half come from {{FileInputStream.path}} and {{File.path}}. All the 
> {{FileInputStream}} objects that JXRay shows are garbage - looks like they 
> are used for a very short period and then discarded (I guess there is a 
> separate question of whether that's a good pattern). But {{File}} instances 
> are traceable to 
> {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here 
> is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
> similar, so I think {{FileInputStream}}s are generated by the 
> {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely 
> come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
> that in the above code we create a File with a complete, normalized pathname, 
> that has been already interned. This will prevent the code inside 
> {{java.io.File}} from modifying this string, and thus it will use the 
> interned copy, and will pass it to FileInputStream. Essentially the current 
> line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)), 
> filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) + 
> File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect

2018-08-13 Thread Yuanbo Liu (JIRA)
Yuanbo Liu created SPARK-25109:
--

 Summary: spark python should retry reading another datanode if the 
first one fails to connect
 Key: SPARK-25109
 URL: https://issues.apache.org/jira/browse/SPARK-25109
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.1
Reporter: Yuanbo Liu


We used this code to read parquet files from HDFS:

spark.read.parquet('xxx')

and got error as below:

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-13 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579181#comment-16579181
 ] 

Yuming Wang edited comment on SPARK-25051 at 8/14/18 3:13 AM:
--

Yes. The bug only exist in branch-2.3. I can reproduced by:
{code}
val df1 = spark.range(4).selectExpr("id", "cast(id as string) as name")
val df2 = spark.range(3).selectExpr("id")
df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull).show
{code}


was (Author: q79969786):
Yes. The bug still exists. I can reproduced by:

{code:scala}
val df1 = spark.range(4).selectExpr("id", "cast(id as string) as name")
val df2 = spark.range(3).selectExpr("id")
df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull).show
{code}


> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-13 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579181#comment-16579181
 ] 

Yuming Wang commented on SPARK-25051:
-

Yes. The bug still exists. I can reproduced by:

{code:scala}
val df1 = spark.range(4).selectExpr("id", "cast(id as string) as name")
val df2 = spark.range(3).selectExpr("id")
df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull).show
{code}


> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579177#comment-16579177
 ] 

Imran Rashid commented on SPARK-24938:
--

yeah thats about what I expected.  Its worse than 16MB per "service" in some 
cases, though -- it'll be 16 MB per netty thread, which will max out at 8.  
You'll see that a lot on the driver.  So could save 384 MB on the driver, and I 
think 128 MB in the external shuffle service (where only one service is active, 
I think).

Did you see a corresponding increase in the offheap pools?  I expected them to 
*not* grow (as spark actually only needs a tiny bit of space from these pools, 
so it should be able to find that space in the existing offheap pools).

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character

2018-08-13 Thread xuejianbest (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuejianbest updated SPARK-25108:

Description: 
The Dataset.show() method generates incorrect space padding since column name 
or column value has Unicode Character.
{code:java}
val df = spark.createDataset(Seq(
"γύρος",
"pears",
"linguiça",
"xoriço",
"hamburger",
"éclair",
"smørbrød",
"spätzle",
"包子",
"jamón serrano",
"pêches",
"シュークリーム",
"막걸리",
"寿司",
"おもち",
"crème brûlée",
"fideuà",
"pâté",
"お好み焼き")).toDF("value")

df.show
/*
+-+
| value|
+-+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
|jamón serrano|
| pêches|
| シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+-+

*/{code}
 

Before and after fix, see attached pictures please .
![show](/user/desktop/doge.png)

  was:
The Dataset.show() method generates incorrect space padding since column name 
or column value has Unicode Character.
{code:java}
val df = spark.createDataset(Seq(
"γύρος",
"pears",
"linguiça",
"xoriço",
"hamburger",
"éclair",
"smørbrød",
"spätzle",
"包子",
"jamón serrano",
"pêches",
"シュークリーム",
"막걸리",
"寿司",
"おもち",
"crème brûlée",
"fideuà",
"pâté",
"お好み焼き")).toDF("value")

df.show
/*
+-+
| value|
+-+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
|jamón serrano|
| pêches|
| シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+-+

*/{code}
 

Before and after fix, see attached pictures please .


> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-25108
> URL: https://issues.apache.org/jira/browse/SPARK-25108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: spark-shell on Xshell5
>Reporter: xuejianbest
>Priority: Critical
> Attachments: show.bmp
>
>
> The Dataset.show() method generates incorrect space padding since column name 
> or column value has Unicode Character.
> {code:java}
> val df = spark.createDataset(Seq(
> "γύρος",
> "pears",
> "linguiça",
> "xoriço",
> "hamburger",
> "éclair",
> "smørbrød",
> "spätzle",
> "包子",
> "jamón serrano",
> "pêches",
> "シュークリーム",
> "막걸리",
> "寿司",
> "おもち",
> "crème brûlée",
> "fideuà",
> "pâté",
> "お好み焼き")).toDF("value")
> df.show
> /*
> +-+
> | value|
> +-+
> | γύρος|
> | pears|
> | linguiça|
> | xoriço|
> | hamburger|
> | éclair|
> | smørbrød|
> | spätzle|
> | 包子|
> |jamón serrano|
> | pêches|
> | シュークリーム|
> | 막걸리|
> | 寿司|
> | おもち|
> | crème brûlée|
> | fideuà|
> | pâté|
> | お好み焼き|
> +-+
> */{code}
>  
> Before and after fix, see attached pictures please .
> ![show](/user/desktop/doge.png)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character

2018-08-13 Thread xuejianbest (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuejianbest updated SPARK-25108:

Description: 
The Dataset.show() method generates incorrect space padding since column name 
or column value has Unicode Character.
{code:scala}
val df = spark.createDataset(Seq(
"γύρος",
"pears",
"linguiça",
"xoriço",
"hamburger",
"éclair",
"smørbrød",
"spätzle",
"包子",
"jamón serrano",
"pêches",
"シュークリーム",
"막걸리",
"寿司",
"おもち",
"crème brûlée",
"fideuà",
"pâté",
"お好み焼き")).toDF("value")

df.show
/*
+-+
| value|
+-+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
|jamón serrano|
| pêches|
| シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+-+

*/{code}
 

Before and after fix, see attached pictures please .
![show](https://issues.apache.org/jira/secure/attachment/12935462/show.bmp)

  was:
The Dataset.show() method generates incorrect space padding since column name 
or column value has Unicode Character.
{code:java}
val df = spark.createDataset(Seq(
"γύρος",
"pears",
"linguiça",
"xoriço",
"hamburger",
"éclair",
"smørbrød",
"spätzle",
"包子",
"jamón serrano",
"pêches",
"シュークリーム",
"막걸리",
"寿司",
"おもち",
"crème brûlée",
"fideuà",
"pâté",
"お好み焼き")).toDF("value")

df.show
/*
+-+
| value|
+-+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
|jamón serrano|
| pêches|
| シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+-+

*/{code}
 

Before and after fix, see attached pictures please .
![show](/user/desktop/doge.png)


> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-25108
> URL: https://issues.apache.org/jira/browse/SPARK-25108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: spark-shell on Xshell5
>Reporter: xuejianbest
>Priority: Critical
> Attachments: show.bmp
>
>
> The Dataset.show() method generates incorrect space padding since column name 
> or column value has Unicode Character.
> {code:scala}
> val df = spark.createDataset(Seq(
> "γύρος",
> "pears",
> "linguiça",
> "xoriço",
> "hamburger",
> "éclair",
> "smørbrød",
> "spätzle",
> "包子",
> "jamón serrano",
> "pêches",
> "シュークリーム",
> "막걸리",
> "寿司",
> "おもち",
> "crème brûlée",
> "fideuà",
> "pâté",
> "お好み焼き")).toDF("value")
> df.show
> /*
> +-+
> | value|
> +-+
> | γύρος|
> | pears|
> | linguiça|
> | xoriço|
> | hamburger|
> | éclair|
> | smørbrød|
> | spätzle|
> | 包子|
> |jamón serrano|
> | pêches|
> | シュークリーム|
> | 막걸리|
> | 寿司|
> | おもち|
> | crème brûlée|
> | fideuà|
> | pâté|
> | お好み焼き|
> +-+
> */{code}
>  
> Before and after fix, see attached pictures please .
> ![show](https://issues.apache.org/jira/secure/attachment/12935462/show.bmp)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25108:


Assignee: Apache Spark

> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-25108
> URL: https://issues.apache.org/jira/browse/SPARK-25108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: spark-shell on Xshell5
>Reporter: xuejianbest
>Assignee: Apache Spark
>Priority: Critical
> Attachments: show.bmp
>
>
> The Dataset.show() method generates incorrect space padding since column name 
> or column value has Unicode Character.
> {code:java}
> val df = spark.createDataset(Seq(
> "γύρος",
> "pears",
> "linguiça",
> "xoriço",
> "hamburger",
> "éclair",
> "smørbrød",
> "spätzle",
> "包子",
> "jamón serrano",
> "pêches",
> "シュークリーム",
> "막걸리",
> "寿司",
> "おもち",
> "crème brûlée",
> "fideuà",
> "pâté",
> "お好み焼き")).toDF("value")
> df.show
> /*
> +-+
> | value|
> +-+
> | γύρος|
> | pears|
> | linguiça|
> | xoriço|
> | hamburger|
> | éclair|
> | smørbrød|
> | spätzle|
> | 包子|
> |jamón serrano|
> | pêches|
> | シュークリーム|
> | 막걸리|
> | 寿司|
> | おもち|
> | crème brûlée|
> | fideuà|
> | pâté|
> | お好み焼き|
> +-+
> */{code}
>  
> Before and after fix, see attached pictures please .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25108:


Assignee: (was: Apache Spark)

> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-25108
> URL: https://issues.apache.org/jira/browse/SPARK-25108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: spark-shell on Xshell5
>Reporter: xuejianbest
>Priority: Critical
> Attachments: show.bmp
>
>
> The Dataset.show() method generates incorrect space padding since column name 
> or column value has Unicode Character.
> {code:java}
> val df = spark.createDataset(Seq(
> "γύρος",
> "pears",
> "linguiça",
> "xoriço",
> "hamburger",
> "éclair",
> "smørbrød",
> "spätzle",
> "包子",
> "jamón serrano",
> "pêches",
> "シュークリーム",
> "막걸리",
> "寿司",
> "おもち",
> "crème brûlée",
> "fideuà",
> "pâté",
> "お好み焼き")).toDF("value")
> df.show
> /*
> +-+
> | value|
> +-+
> | γύρος|
> | pears|
> | linguiça|
> | xoriço|
> | hamburger|
> | éclair|
> | smørbrød|
> | spätzle|
> | 包子|
> |jamón serrano|
> | pêches|
> | シュークリーム|
> | 막걸리|
> | 寿司|
> | おもち|
> | crème brûlée|
> | fideuà|
> | pâté|
> | お好み焼き|
> +-+
> */{code}
>  
> Before and after fix, see attached pictures please .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character

2018-08-13 Thread xuejianbest (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuejianbest updated SPARK-25108:

Description: 
The Dataset.show() method generates incorrect space padding since column name 
or column value has Unicode Character.
{code:java}
val df = spark.createDataset(Seq(
"γύρος",
"pears",
"linguiça",
"xoriço",
"hamburger",
"éclair",
"smørbrød",
"spätzle",
"包子",
"jamón serrano",
"pêches",
"シュークリーム",
"막걸리",
"寿司",
"おもち",
"crème brûlée",
"fideuà",
"pâté",
"お好み焼き")).toDF("value")

df.show
/*
+-+
| value|
+-+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
|jamón serrano|
| pêches|
| シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+-+

*/{code}
 

Before and after fix, see attached pictures please .

  was:
The Dataset.show() method generates incorrect space padding since column name 
or column value has Unicode Character.
{code:java}
val df = spark.createDataset(Seq(
"γύρος",
"pears",
"linguiça",
"xoriço",
"hamburger",
"éclair",
"smørbrød",
"spätzle",
"包子",
"jamón serrano",
"pêches",
"シュークリーム",
"막걸리",
"寿司",
"おもち",
"crème brûlée",
"fideuà",
"pâté",
"お好み焼き")).toDF("value")

df.show
/*
+-+
| value|
+-+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
|jamón serrano|
| pêches|
| シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+-+

*/{code}
 

Before and after fix, please see attached pictures.


> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-25108
> URL: https://issues.apache.org/jira/browse/SPARK-25108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: spark-shell on Xshell5
>Reporter: xuejianbest
>Priority: Critical
> Attachments: show.bmp
>
>
> The Dataset.show() method generates incorrect space padding since column name 
> or column value has Unicode Character.
> {code:java}
> val df = spark.createDataset(Seq(
> "γύρος",
> "pears",
> "linguiça",
> "xoriço",
> "hamburger",
> "éclair",
> "smørbrød",
> "spätzle",
> "包子",
> "jamón serrano",
> "pêches",
> "シュークリーム",
> "막걸리",
> "寿司",
> "おもち",
> "crème brûlée",
> "fideuà",
> "pâté",
> "お好み焼き")).toDF("value")
> df.show
> /*
> +-+
> | value|
> +-+
> | γύρος|
> | pears|
> | linguiça|
> | xoriço|
> | hamburger|
> | éclair|
> | smørbrød|
> | spätzle|
> | 包子|
> |jamón serrano|
> | pêches|
> | シュークリーム|
> | 막걸리|
> | 寿司|
> | おもち|
> | crème brûlée|
> | fideuà|
> | pâté|
> | お好み焼き|
> +-+
> */{code}
>  
> Before and after fix, see attached pictures please .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character

2018-08-13 Thread xuejianbest (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuejianbest updated SPARK-25108:

Description: 
The Dataset.show() method generates incorrect space padding since column name 
or column value has Unicode Character.
{code:java}
val df = spark.createDataset(Seq(
"γύρος",
"pears",
"linguiça",
"xoriço",
"hamburger",
"éclair",
"smørbrød",
"spätzle",
"包子",
"jamón serrano",
"pêches",
"シュークリーム",
"막걸리",
"寿司",
"おもち",
"crème brûlée",
"fideuà",
"pâté",
"お好み焼き")).toDF("value")

df.show
/*
+-+
| value|
+-+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
|jamón serrano|
| pêches|
| シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+-+

*/{code}
 

Before and after fix, please see attached pictures.

  was:
The Dataset.show() method generates incorrect space padding since column name 
or column value has Unicode Character.
{code:java}
val df = spark.createDataset(Seq(
"γύρος",
"pears",
"linguiça",
"xoriço",
"hamburger",
"éclair",
"smørbrød",
"spätzle",
"包子",
"jamón serrano",
"pêches",
"シュークリーム",
"막걸리",
"寿司",
"おもち",
"crème brûlée",
"fideuà",
"pâté",
"お好み焼き")).toDF("value")

df.show
/*
+-+
| value|
+-+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
|jamón serrano|
| pêches|
| シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+-+

*/{code}


> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-25108
> URL: https://issues.apache.org/jira/browse/SPARK-25108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: spark-shell on Xshell5
>Reporter: xuejianbest
>Priority: Critical
> Attachments: show.bmp
>
>
> The Dataset.show() method generates incorrect space padding since column name 
> or column value has Unicode Character.
> {code:java}
> val df = spark.createDataset(Seq(
> "γύρος",
> "pears",
> "linguiça",
> "xoriço",
> "hamburger",
> "éclair",
> "smørbrød",
> "spätzle",
> "包子",
> "jamón serrano",
> "pêches",
> "シュークリーム",
> "막걸리",
> "寿司",
> "おもち",
> "crème brûlée",
> "fideuà",
> "pâté",
> "お好み焼き")).toDF("value")
> df.show
> /*
> +-+
> | value|
> +-+
> | γύρος|
> | pears|
> | linguiça|
> | xoriço|
> | hamburger|
> | éclair|
> | smørbrød|
> | spätzle|
> | 包子|
> |jamón serrano|
> | pêches|
> | シュークリーム|
> | 막걸리|
> | 寿司|
> | おもち|
> | crème brûlée|
> | fideuà|
> | pâté|
> | お好み焼き|
> +-+
> */{code}
>  
> Before and after fix, please see attached pictures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character

2018-08-13 Thread xuejianbest (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuejianbest updated SPARK-25108:

Environment: spark-shell on Xshell5  (was: spark-shell on Xshell)

> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-25108
> URL: https://issues.apache.org/jira/browse/SPARK-25108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: spark-shell on Xshell5
>Reporter: xuejianbest
>Priority: Critical
> Attachments: show.bmp
>
>
> The Dataset.show() method generates incorrect space padding since column name 
> or column value has Unicode Character.
> {code:java}
> val df = spark.createDataset(Seq(
> "γύρος",
> "pears",
> "linguiça",
> "xoriço",
> "hamburger",
> "éclair",
> "smørbrød",
> "spätzle",
> "包子",
> "jamón serrano",
> "pêches",
> "シュークリーム",
> "막걸리",
> "寿司",
> "おもち",
> "crème brûlée",
> "fideuà",
> "pâté",
> "お好み焼き")).toDF("value")
> df.show
> /*
> +-+
> | value|
> +-+
> | γύρος|
> | pears|
> | linguiça|
> | xoriço|
> | hamburger|
> | éclair|
> | smørbrød|
> | spätzle|
> | 包子|
> |jamón serrano|
> | pêches|
> | シュークリーム|
> | 막걸리|
> | 寿司|
> | おもち|
> | crème brûlée|
> | fideuà|
> | pâté|
> | お好み焼き|
> +-+
> */{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character

2018-08-13 Thread xuejianbest (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuejianbest updated SPARK-25108:

Attachment: show.bmp

> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-25108
> URL: https://issues.apache.org/jira/browse/SPARK-25108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: spark-shell on Xshell
>Reporter: xuejianbest
>Priority: Critical
> Attachments: show.bmp
>
>
> The Dataset.show() method generates incorrect space padding since column name 
> or column value has Unicode Character.
> {code:java}
> val df = spark.createDataset(Seq(
> "γύρος",
> "pears",
> "linguiça",
> "xoriço",
> "hamburger",
> "éclair",
> "smørbrød",
> "spätzle",
> "包子",
> "jamón serrano",
> "pêches",
> "シュークリーム",
> "막걸리",
> "寿司",
> "おもち",
> "crème brûlée",
> "fideuà",
> "pâté",
> "お好み焼き")).toDF("value")
> df.show
> /*
> +-+
> | value|
> +-+
> | γύρος|
> | pears|
> | linguiça|
> | xoriço|
> | hamburger|
> | éclair|
> | smørbrød|
> | spätzle|
> | 包子|
> |jamón serrano|
> | pêches|
> | シュークリーム|
> | 막걸리|
> | 寿司|
> | おもち|
> | crème brûlée|
> | fideuà|
> | pâté|
> | お好み焼き|
> +-+
> */{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character

2018-08-13 Thread xuejianbest (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuejianbest updated SPARK-25108:

External issue URL:   (was: https://github.com/apache/spark/pull/22048)
   Description: 
The Dataset.show() method generates incorrect space padding since column name 
or column value has Unicode Character.
{code:java}
val df = spark.createDataset(Seq(
"γύρος",
"pears",
"linguiça",
"xoriço",
"hamburger",
"éclair",
"smørbrød",
"spätzle",
"包子",
"jamón serrano",
"pêches",
"シュークリーム",
"막걸리",
"寿司",
"おもち",
"crème brûlée",
"fideuà",
"pâté",
"お好み焼き")).toDF("value")

df.show
/*
+-+
| value|
+-+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
|jamón serrano|
| pêches|
| シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+-+

*/{code}

  was:
The Dataset.show() method generates incorrect space padding since column name 
or column value has Unicode Character.
{code:java}
val df = spark.createDataset(Seq(
"γύρος",
"pears",
"linguiça",
"xoriço",
"hamburger",
"éclair",
"smørbrød",
"spätzle",
"包子",
"jamón serrano",
"pêches",
"シュークリーム",
"막걸리",
"寿司",
"おもち",
"crème brûlée",
"fideuà",
"pâté",
"お好み焼き")).toDF("value")

before:
+-+
| value|
+-+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
|jamón serrano|
| pêches|
| シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+-+

after fix:
+--+
| value|
+--+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
| jamón serrano|
| pêches|
|シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+--+

{code}


> Dataset.show() generates incorrect padding for Unicode Character
> 
>
> Key: SPARK-25108
> URL: https://issues.apache.org/jira/browse/SPARK-25108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: spark-shell on Xshell
>Reporter: xuejianbest
>Priority: Critical
>
> The Dataset.show() method generates incorrect space padding since column name 
> or column value has Unicode Character.
> {code:java}
> val df = spark.createDataset(Seq(
> "γύρος",
> "pears",
> "linguiça",
> "xoriço",
> "hamburger",
> "éclair",
> "smørbrød",
> "spätzle",
> "包子",
> "jamón serrano",
> "pêches",
> "シュークリーム",
> "막걸리",
> "寿司",
> "おもち",
> "crème brûlée",
> "fideuà",
> "pâté",
> "お好み焼き")).toDF("value")
> df.show
> /*
> +-+
> | value|
> +-+
> | γύρος|
> | pears|
> | linguiça|
> | xoriço|
> | hamburger|
> | éclair|
> | smørbrød|
> | spätzle|
> | 包子|
> |jamón serrano|
> | pêches|
> | シュークリーム|
> | 막걸리|
> | 寿司|
> | おもち|
> | crème brûlée|
> | fideuà|
> | pâté|
> | お好み焼き|
> +-+
> */{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character

2018-08-13 Thread xuejianbest (JIRA)
xuejianbest created SPARK-25108:
---

 Summary: Dataset.show() generates incorrect padding for Unicode 
Character
 Key: SPARK-25108
 URL: https://issues.apache.org/jira/browse/SPARK-25108
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1, 2.3.0
 Environment: spark-shell on Xshell
Reporter: xuejianbest


The Dataset.show() method generates incorrect space padding since column name 
or column value has Unicode Character.
{code:java}
val df = spark.createDataset(Seq(
"γύρος",
"pears",
"linguiça",
"xoriço",
"hamburger",
"éclair",
"smørbrød",
"spätzle",
"包子",
"jamón serrano",
"pêches",
"シュークリーム",
"막걸리",
"寿司",
"おもち",
"crème brûlée",
"fideuà",
"pâté",
"お好み焼き")).toDF("value")

before:
+-+
| value|
+-+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
|jamón serrano|
| pêches|
| シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+-+

after fix:
+--+
| value|
+--+
| γύρος|
| pears|
| linguiça|
| xoriço|
| hamburger|
| éclair|
| smørbrød|
| spätzle|
| 包子|
| jamón serrano|
| pêches|
|シュークリーム|
| 막걸리|
| 寿司|
| おもち|
| crème brûlée|
| fideuà|
| pâté|
| お好み焼き|
+--+

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579133#comment-16579133
 ] 

Michael Heuer commented on SPARK-24771:
---

I'm looking forward to testing this with 
[ADAM|https://github.com/bigdatagenomics/adam] and all of our downstream 
projects as part of the 2.4.0 release candidate process.  If it is worth my 
time doing so before then, please let me know.  Parquet + Avro is at the core 
of what we do, and having the 1.8 vs 1.7 internal conflict present in Spark 
resolved would be very welcome.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24886) Increase Jenkins build time

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579131#comment-16579131
 ] 

Apache Spark commented on SPARK-24886:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22098

> Increase Jenkins build time
> ---
>
> Key: SPARK-24886
> URL: https://issues.apache.org/jira/browse/SPARK-24886
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, looks we hit the time limit time to time. Looks better increasing 
> the time a bit.
> For instance, please see https://github.com/apache/spark/pull/21822



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579128#comment-16579128
 ] 

Sean Owen commented on SPARK-24771:
---

I confess I just don't know enough to have a strong opinion. A minor version 
upgrade isn't out of the question for a minor Spark upgrade. You are right that 
this is considered a non-core integration. It sounds like there are 
incompatibility issues. However that cuts two ways; some users are facing 
problems because Spark _isn't_ on 1.8.x. If spark-avro is already on 1.8, I can 
see the need for Spark to update as well. Not blessing this so much as saying I 
don't object.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579126#comment-16579126
 ] 

Wenchen Fan commented on SPARK-24771:
-

cc [~r...@databricks.com] [~srowen]

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16617) Upgrade to Avro 1.8.x

2018-08-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579125#comment-16579125
 ] 

Wenchen Fan commented on SPARK-16617:
-

Sorry I missed this ticket, the upgrade is now done by 
https://issues.apache.org/jira/browse/SPARK-24771

> Upgrade to Avro 1.8.x
> -
>
> Key: SPARK-16617
> URL: https://issues.apache.org/jira/browse/SPARK-16617
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 2.1.0
>Reporter: Ben McCann
>Priority: Major
>
> Avro 1.8 makes Avro objects serializable so that you can easily have an RDD 
> containing Avro objects.
> See https://issues.apache.org/jira/browse/AVRO-1502



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579124#comment-16579124
 ] 

Wenchen Fan commented on SPARK-24771:
-

Sorry I was not aware of https://issues.apache.org/jira/browse/SPARK-16617 .

So we proposed to upgrade AVRO 2 years ago, and gave it up because it's not 
binary compatible and the benefit is not that much.

I think things have changed now. This upgrade is super important to the AVRO 
data source, for date/timestamp/decimal support. Also as people pointed out in 
https://issues.apache.org/jira/browse/SPARK-16617 , this is an important bug 
fix to use Parquet and AVRO.

BTW I don't think the impact is that large. Spark doesn't have a stable API to 
plugin AVRO supports, so AVRO users have to do some manual work to migrate to 
new Spark versions. As an example, I don't think the databricks spark-avro 
package can work with Spark 2.4 without modification.

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25028:
---

Assignee: Marco Gaido

> AnalyzePartitionCommand failed with NPE if value is null
> 
>
> Key: SPARK-25028
> URL: https://issues.apache.org/jira/browse/SPARK-25028
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> on line 143: val partitionColumnValues = 
> partitionColumns.indices.map(r.get(_).toString)
> If the value is NULL the code will fail with NPE
> *sample:*
> {code:scala}
> val df = List((1, null , "first"), (2, null , "second")).toDF("index", 
> "name", "value").withColumn("name", $"name".cast("string"))
> df.write.partitionBy("name").saveAsTable("df13")
> spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS")
> {code}
> output:
> 2018-08-08 09:25:43 WARN  BaseSessionStateBuilder$$anon$1:66 - Max iterations 
> (2) reached for batch Resolution
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.Range.foreach(Range.scala:160)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:142)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand.calculateRowCountsPerPartition(AnalyzePartitionCommand.scala:142)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand.run(AnalyzePartitionCommand.scala:104)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
>   ... 49 elided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25028.
-
   Resolution: Fixed
Fix Version/s: 2.3.2
   2.4.0

Issue resolved by pull request 22036
[https://github.com/apache/spark/pull/22036]

> AnalyzePartitionCommand failed with NPE if value is null
> 
>
> Key: SPARK-25028
> URL: https://issues.apache.org/jira/browse/SPARK-25028
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
> Fix For: 2.4.0, 2.3.2
>
>
> on line 143: val partitionColumnValues = 
> partitionColumns.indices.map(r.get(_).toString)
> If the value is NULL the code will fail with NPE
> *sample:*
> {code:scala}
> val df = List((1, null , "first"), (2, null , "second")).toDF("index", 
> "name", "value").withColumn("name", $"name".cast("string"))
> df.write.partitionBy("name").saveAsTable("df13")
> spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS")
> {code}
> output:
> 2018-08-08 09:25:43 WARN  BaseSessionStateBuilder$$anon$1:66 - Max iterations 
> (2) reached for batch Resolution
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.Range.foreach(Range.scala:160)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:142)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand.calculateRowCountsPerPartition(AnalyzePartitionCommand.scala:142)
>   at 
> org.apache.spark.sql.execution.command.AnalyzePartitionCommand.run(AnalyzePartitionCommand.scala:104)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
>   ... 49 elided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-13 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579042#comment-16579042
 ] 

Marcelo Vanzin commented on SPARK-24918:


I like the idea in general. On the implementation side, instead of 
{{spark.executor.plugins}}, how about using {{java.util.ServiceLoader}}? That's 
one less configuration needed to enable these plugins. The downside is that if 
the jar is visible to Spark, it will be invoked (so it becomes "opt out" 
instead of "opt in", if you want to add an option to disable specific plugins).

I thought about suggesting a new API in SparkContext to programatically add 
plugins, but that might be too messy. Better to start simple.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-08-13 Thread Nihar Sheth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579039#comment-16579039
 ] 

Nihar Sheth commented on SPARK-24938:
-

After making the change and running the tool on a very simple application both 
with and without that change, I saw 3 netty services that dropped from 16mb to 
0. They are:

netty-rpc-client-usedHeapMem

netty-blockTransfer-client-usedHeapMem

netty-external-shuffle-client-usedHeapMem

16mb of onheap memory was allocated for these three services through their 
lifetime without the change, but with the change it disappears in all 3 cases.

Does this sound like the sole source of this particular issue? Or would you 
expect more memory elsewhere to also be freed up?

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-13 Thread MIK (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578752#comment-16578752
 ] 

MIK edited comment on SPARK-25051 at 8/13/18 11:04 PM:
---

Thanks [~yumwang] , with 2.3.2-rc4, not getting the error now but the result is 
not correct (getting 0 records), 
 +-+--+
|id|name|

+-+--+

The sample program should return 2 records.
 +-+---+
|id|name|
|1|one|
|3|three|

+-+---+


was (Author: mik1007):
Thanks [~yumwang] , with 2.3.2-rc4 the error is gone now but the result is not 
correct (getting 0 records), 
 ++---+
|id|name|

++---+

The sample program should return 2 records.
 +++
|id|name|
|1|one|
|3|three|

+++

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-08-13 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579012#comment-16579012
 ] 

Marcelo Vanzin commented on SPARK-24787:


Is the slowness really caused by the use of hsync vs. hflush? I'd expect the 
flushing of the data, not the metadata update, to be the expensive part...

In any case, if you have any ideas, feel free to post a PR.

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8

2018-08-13 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579005#comment-16579005
 ] 

Marcelo Vanzin commented on SPARK-24771:


Hi guys, why was this accepted? It has been tried in the past and we 
re-targeted it to 3.0 because Avro 1.8 is not backwards compatible with Avro 
1.7:

https://issues.apache.org/jira/browse/SPARK-16617

In particular the discussion in the PR is helpful here:
https://github.com/apache/spark/pull/17163

> Upgrade AVRO version from 1.7.7 to 1.8
> --
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors

2018-08-13 Thread Karan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karan updated SPARK-25107:
--
Description: 
I am in the process of upgrading Spark 1.6 to Spark 2.2.

I have two stage query and I am running with hiveContext.

1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid 
ORDER BY date DESC) AS ro 
 FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid 
,VC.data,CASE 
 WHEN pcs.score BETWEEN PC.from AND PC.to 
 AND ((PC.csacnt IS NOT NULL AND CC.status = 4 
 AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag 
 FROM maindata VC 
 INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid 
 INNER JOIN cnfgtable PC ON PC.subid = VC.subid 
 INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID 
 LEFT JOIN casetable CC ON CC.rowid = VC.rowid 
 LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN 
VPM.StartDate and VPM.EndDate) A 
 WHERE A.Flag =1").createOrReplaceTempView.("stage1")

2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* 
 FROM stage1 t1 
 INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID 
 INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID 
 INNER JOIN maindata vct on vct.recordid = t1.recordid
 WHERE t2.ro = PCR.datacount”)

The same query sequency is working fine in Spark 1.6 but failing with below 
exeption in Spark 2,2. It throws exception while parsing above 2nd query.

 

{{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree:+}}
+CatalogRelation `database_name`.`{color:#f79232}maindata{color}`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more 
fields|#1506... 89 more fields]+ 
 \{{ at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}}
 \{{ at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}}
 \{{ at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 

[jira] [Updated] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors

2018-08-13 Thread Karan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karan updated SPARK-25107:
--
Description: 
I am in the process of upgrading Spark 1.6 to Spark 2.2.

I have two stage query and I am running with hiveContext.

1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid 
ORDER BY date DESC) AS ro 
 FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid 
,VC.data,CASE 
 WHEN pcs.score BETWEEN PC.from AND PC.to 
 AND ((PC.csacnt IS NOT NULL AND CC.status = 4 
 AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag 
 FROM maindata VC 
 INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid 
 INNER JOIN cnfgtable PC ON PC.subid = VC.subid 
 INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID 
 LEFT JOIN casetable CC ON CC.rowid = VC.rowid 
 LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN 
VPM.StartDate and VPM.EndDate) A 
 WHERE A.Flag =1").registerTempTable("stage1")

2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* 
 FROM stage1 t1 
 INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID 
 INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID 
 INNER JOIN maindata vct on vct.recordid = t1.recordid
 WHERE t2.ro = PCR.datacount”)

The same query sequency is working fine in Spark 1.6 but failing with below 
exeption in Spark 2,2. It throws exception while parsing above 2nd query.

 

{{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree:+}}
 {{+CatalogRelation `database_name`.`{color:#f79232}maindata{color}`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more 
fields|#1506... 89 more fields]+ }}
 \{{ at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}}
 \{{ at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}}
 \{{ at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 

[jira] [Updated] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors

2018-08-13 Thread Karan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karan updated SPARK-25107:
--
Description: 
I am in the process of upgrading Spark 1.6 to Spark 2.2.

I have two stage query and I am running with hiveContext.

{{1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid 
ORDER BY date DESC) AS ro }}
{{ FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid 
,VC.data,CASE }}
{{ WHEN pcs.score BETWEEN PC.from AND PC.to }}
{{ AND ((PC.csacnt IS NOT NULL AND CC.status = 4 }}
{{ AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag 
}}
{{ FROM maindata VC }}
{{ INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid }}
{{ INNER JOIN cnfgtable PC ON PC.subid = VC.subid }}
{{ INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID }}
{{ LEFT JOIN casetable CC ON CC.rowid = VC.rowid }}
{{ LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN 
VPM.StartDate and VPM.EndDate) A }}
{{ WHERE A.Flag =1").registerTempTable("stage1")}}

{{2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* }}
{{ FROM stage1 t1 }}
{{ INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID }}
{{ INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID }}
{{ INNER JOIN maindata vct on vct.recordid = t1.recordid}}
{{ WHERE t2.ro = PCR.datacount”)}}

The same query sequency is working fine in Spark 1.6 but failing with below 
exeption in Spark 2,2. It throws exception while parsing above 2nd query.

 

{{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree:+}}
 {{+CatalogRelation `database_name`.`{color:#f79232}maindata{color}`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more 
fields|#1506... 89 more fields]+ }}
 \{{ at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}}
 \{{ at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}}
 \{{ at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ 

[jira] [Updated] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors

2018-08-13 Thread Karan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karan updated SPARK-25107:
--
Description: 
I am in the process of upgrading Spark 1.6 to Spark 2.2.

I have two stage query and I am running with hiveContext.

{\{1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, 
rowid ORDER BY date DESC) AS ro }}
 \{{ FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid 
,VC.data,CASE }}
 \{{ WHEN pcs.score BETWEEN PC.from AND PC.to }}
 \{{ AND ((PC.csacnt IS NOT NULL AND CC.status = 4 }}
 \{{ AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS 
Flag }}
 \{{ FROM maindata VC }}
 \{{ INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid }}
 \{{ INNER JOIN cnfgtable PC ON PC.subid = VC.subid }}
 \{{ INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID }}
 \{{ LEFT JOIN casetable CC ON CC.rowid = VC.rowid }}
 \{{ LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN 
VPM.StartDate and VPM.EndDate) A }}
 \{{ WHERE A.Flag =1").registerTempTable("stage1")}}

{\{2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* }}
 \{{ FROM stage1 t1 }}
 \{{ INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID 
}}
 \{{ INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID }}
 \{{ INNER JOIN maindata vct on vct.recordid = t1.recordid}}
 \{{ WHERE t2.ro = PCR.datacount”)}}

The same query sequency is working fine in Spark 1.6 but failing with below 
exeption in Spark 2,2. It throws exception while parsing above 2nd query.

 

{{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree:+}}
 {{+CatalogRelation `database_name`.`{color:#f79232}maindata{color}`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more 
fields|#1506... 89 more fields]+ }}
 \{{ at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}}
 \{{ at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}}
 \{{ at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 

[jira] [Updated] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors

2018-08-13 Thread Karan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karan updated SPARK-25107:
--
Description: 
I am in the process of upgrading Spark 1.6 to Spark 2.2.

I have two stage query and I am running with hiveContext.

{{1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid 
ORDER BY date DESC) AS ro }}
{{ FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid 
,VC.data,CASE }}
{{ WHEN pcs.score BETWEEN PC.from AND PC.to }}
{{ AND ((PC.csacnt IS NOT NULL AND CC.status = 4 }}
{{ AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag 
}}
{{ FROM maindata VC }}
{{ INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid }}
{{ INNER JOIN cnfgtable PC ON PC.subid = VC.subid }}
{{ INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID }}
{{ LEFT JOIN casetable CC ON CC.rowid = VC.rowid }}
{{ LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN 
VPM.StartDate and VPM.EndDate) A }}
{{ WHERE A.Flag =1").registerTempTable("stage1")}}{{2) hiveContext.sql("SELECT 
DISTINCT t1.ConfigID As cnfg_id ,vct.* }}
{{ FROM stage1 t1 }}
{{ INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID }}
{{ INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID }}
{{ INNER JOIN maindata vct on vct.recordid = t1.recordid}}
{{ WHERE t2.ro = PCR.datacount”)}}

The same query sequency is working fine in Spark 1.6 but failing with below 
exeption in Spark 2,2. It throws exception while parsing above 2nd query.

 

{{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree:+}}
 {{+CatalogRelation `ilink_perf_athenaprod`.`maindata`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more 
fields|#1506... 89 more fields]+ }}
 \{{ at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}}
 \{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}}
 \{{ at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}}
 \{{ at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
 \{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
 \{{ at 

[jira] [Created] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors

2018-08-13 Thread Karan (JIRA)
Karan created SPARK-25107:
-

 Summary: Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: 
makeCopy, tree: CatalogRelation Errors
 Key: SPARK-25107
 URL: https://issues.apache.org/jira/browse/SPARK-25107
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.2.0
 Environment: Spark Version : 2.2.0.cloudera2
Reporter: Karan


I am in the process of upgrading Spark 1.6 to Spark 2.2.

I have two stage query and I am running with hiveContext.

{{1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid 
ORDER BY date DESC) AS ro }}
{{ FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid 
,VC.data,CASE }}
{{ WHEN pcs.score BETWEEN PC.from AND PC.to }}
{{ AND ((PC.csacnt IS NOT NULL AND CC.status = 4 }}
{{ AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag 
}}
{{ FROM {color:#f79232}maindata{color} VC }}
{{ INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid }}
{{ INNER JOIN cnfgtable PC ON PC.subid = VC.subid }}
{{ INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID }}
{{ LEFT JOIN casetable CC ON CC.rowid = VC.rowid }}
{{ LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN 
VPM.StartDate and VPM.EndDate) A }}
{{ WHERE A.Flag =1").}}createOrReplaceTempView{{("stage1")}}

{{2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* }}
{{ FROM stage1 t1 }}
{{ INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID }}
{{ INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID }}
{{ INNER JOIN {color:#f79232}maindata{color} vct on vct.recordid = t1.recordid}}
{{ WHERE t2.ro = PCR.datacount")}}

The same query sequency is working fine in Spark 1.6 but failing with below 
exeption in Spark 2,2. It throws exception while parsing above 2nd query.

 

{{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree:+}}
{{+CatalogRelation `ilink_perf_athenaprod`.`maindata`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more 
fields]+ }}
{{ at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}}
{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}}
{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}}
{{ at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}}
{{ at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}}
{{ at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}}
{{ at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}}
{{ at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}}
{{ at 

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578962#comment-16578962
 ] 

Apache Spark commented on SPARK-18057:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22097

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Major
> Fix For: 2.4.0
>
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up

2018-08-13 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578937#comment-16578937
 ] 

Xiao Li commented on SPARK-24156:
-

[~tdas] Can we mark it done?

> Enable no-data micro batches for more eager streaming state clean up 
> -
>
> Key: SPARK-24156
> URL: https://issues.apache.org/jira/browse/SPARK-24156
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Currently, MicroBatchExecution in Structured Streaming runs batches only when 
> there is new data to process. This is sensible in most cases as we dont want 
> to unnecessarily use resources when there is nothing new to process. However, 
> in some cases of stateful streaming queries, this delays state clean up as 
> well as clean-up based output. For example, consider a streaming aggregation 
> query with watermark-based state cleanup. The watermark is updated after 
> every batch with new data completes. The updated value is used in the next 
> batch to clean up state, and output finalized aggregates in append mode. 
> However, if there is no data, then the next batch does not occur, and 
> cleanup/output gets delayed unnecessarily. This is true for all stateful 
> streaming operators - aggregation, deduplication, joins, mapGroupsWithState
> This issue tracks the work to enable no-data batches in MicroBatchExecution. 
> The major challenge is that all the tests of relevant stateful operations add 
> dummy data to force another batch for testing the state cleanup. So a lot of 
> the tests are going to be changed. So my plan is to enable no-data batches 
> for different stateful operators one at a time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578894#comment-16578894
 ] 

Imran Rashid commented on SPARK-24918:
--

With dynamic allocation you don't have a good place to run

{code}
df.mapPartitions { it =>
  MyResource.initIfNeeded()
  it.map(...)
}
{code}

Executors can come and go, you can't ensure that runs everywhere.  Even if you 
make "too many" tasks, it could be your job starts out with a very small number 
of tasks for a while before ramping up.  So after you run your initialization 
with the added initResource code, many executors get torn down during the first 
part of the real job as they sit idle; then when the job ramps up, you get new 
executors, which never had your initialization run.  You'd have to put 
{{MyResource.initIfNeeded()}} inside *every* task.  (Note that for the debug 
use case, the initializer is totally unnecessary for the task to complete -- if 
the task actually depended on it, then of course you should have that logic in 
each task.)

I think there are a large class of users who can add "--conf 
spark.executor.plugins com.mycompany.WhizzBangDebugPlugin --jars 
whizzbangdebug.jar" to the command line arguments, that couldn't add in that 
code sample (even with static allocation).  They're not the ones that are 
*writing* the plugins, they just need to be able to enable it.

{quote}What do you do if init fails? retry or fail?{quote}

good question, Tom asked the same thing on the pr.  I suggested the executor 
just fails to start.  If a plugin wanted to be "safe", it could catch 
exceptions in its own initialization.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25091) Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor memory

2018-08-13 Thread Yunling Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yunling Cai updated SPARK-25091:

Priority: Critical  (was: Major)

> Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor 
> memory
> 
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Critical
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25091) Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor memory

2018-08-13 Thread Yunling Cai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yunling Cai updated SPARK-25091:

Component/s: (was: Spark Core)
 SQL

> Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor 
> memory
> 
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Major
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24736) --py-files not functional for non local URLs. It appears to pass non-local URL's into PYTHONPATH directly.

2018-08-13 Thread Ilan Filonenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578882#comment-16578882
 ] 

Ilan Filonenko commented on SPARK-24736:


The URL, until a resource-staging-server is setup will be unable to resolve the 
file location unless you use SparkFiles.get(file_name)` in your application. As 
such, using a URL in the --py-files will be unresolved. Thus, remote 
dependencies won't be supported by --py-files just yet, but we can support 
local files. 

> --py-files not functional for non local URLs. It appears to pass non-local 
> URL's into PYTHONPATH directly.
> --
>
> Key: SPARK-24736
> URL: https://issues.apache.org/jira/browse/SPARK-24736
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
> Environment: Recent 2.4.0 from master branch, submitted on Linux to a 
> KOPS Kubernetes cluster created on AWS.
>  
>Reporter: Jonathan A Weaver
>Priority: Minor
>
> My spark-submit
> bin/spark-submit \
>         --master 
> k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]
>  \
>         --deploy-mode cluster \
>         --name pytest \
>         --conf 
> spark.kubernetes.container.image=[412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest|http://412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest]
>  \
>         --conf 
> [spark.kubernetes.driver.pod.name|http://spark.kubernetes.driver.pod.name/]=spark-pi-driver
>  \
>         --conf 
> spark.kubernetes.authenticate.submission.caCertFile=[cluster.ca|http://cluster.ca/]
>  \
>         --conf spark.kubernetes.authenticate.submission.oauthToken=$TOK \
>         --conf spark.kubernetes.authenticate.driver.oauthToken=$TOK \
> --py-files "[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]; \
> [https://s3.amazonaws.com/maxar-ids-fids/it.py]
>  
> *screw.zip is successfully downloaded and placed in SparkFIles.getRootPath()*
> 2018-07-01 07:33:43 INFO  SparkContext:54 - Added file 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] at 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] with timestamp 
> 1530430423297
> 2018-07-01 07:33:43 INFO  Utils:54 - Fetching 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] to 
> /var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240/fetchFileTemp1549645948768432992.tmp
> *I print out the  PYTHONPATH and PYSPARK_FILES environment variables from the 
> driver script:*
>      PYTHONPATH 
> /opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-0.10.7-src.zip:/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar:/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip:*[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]*
>     PYSPARK_FILES [https://s3.amazonaws.com/maxar-ids-fids/screw.zip]
>  
> *I print out sys.path*
> ['/tmp/spark-fec3684b-8b63-4f43-91a4-2f2fa41a1914', 
> u'/var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240',
>  '/opt/spark/python/lib/pyspark.zip', 
> '/opt/spark/python/lib/py4j-0.10.7-src.zip', 
> '/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar', 
> '/opt/spark/python/lib/py4j-*.zip', *'/opt/spark/work-dir/https', 
> '//[s3.amazonaws.com/maxar-ids-fids/screw.zip|http://s3.amazonaws.com/maxar-ids-fids/screw.zip]',*
>  '/usr/lib/python27.zip', '/usr/lib/python2.7', 
> '/usr/lib/python2.7/plat-linux2', '/usr/lib/python2.7/lib-tk', 
> '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', 
> '/usr/lib/python2.7/site-packages']
>  
> *URL from PYTHONFILES gets placed in sys.path verbatim with obvious results.*
>  
> *Dump of spark config from container.*
> Spark config dumped:
> [(u'spark.master', 
> u'k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]'),
>  (u'spark.kubernetes.authenticate.submission.oauthToken', 
> u''), 
> (u'spark.kubernetes.authenticate.driver.oauthToken', 
> u''), (u'spark.kubernetes.executor.podNamePrefix', 
> u'pytest-1530430411996'), (u'spark.kubernetes.memoryOverheadFactor', u'0.4'), 
> (u'spark.driver.blockManager.port', u'7079'), 
> (u'[spark.app.id|http://spark.app.id/]', u'spark-application-1530430424433'), 
> (u'[spark.app.name|http://spark.app.name/]', u'pytest'), 
> (u'[spark.executor.id|http://spark.executor.id/]', u'driver'), 
> (u'spark.driver.host', u'pytest-1530430411996-driver-svc.default.svc'), 
> (u'spark.kubernetes.container.image', 
> 

[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578872#comment-16578872
 ] 

Sean Owen commented on SPARK-24918:
---

This is just for per-executor initialization right? What's the issue with 
dynamic allocation – executors still start there, JVMs still initialize; how is 
it particularly hard?

What do you do if init fails? retry or fail?

Would SQL-only users meaningfully be able to use this if they don't know about 
code anyway?

Is turning on debug code not something for config options?

I guess I don't get why this still can't be solved by a static initializer. I'm 
not dead-set against this, just think it will add some complexity and not sure 
it gains a lot.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23984) PySpark Bindings for K8S

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578871#comment-16578871
 ] 

Apache Spark commented on SPARK-23984:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22095

> PySpark Bindings for K8S
> 
>
> Key: SPARK-23984
> URL: https://issues.apache.org/jira/browse/SPARK-23984
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, PySpark
>Affects Versions: 2.3.0
>Reporter: Ilan Filonenko
>Priority: Major
> Fix For: 2.4.0
>
>
> This ticket is tracking the ongoing work of moving the upsteam work from 
> [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python 
> bindings for Spark on Kubernetes. 
> The points of focus are: dependency management, increased non-JVM memory 
> overhead default values, and modified Docker images to include Python 
> Support. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24918) Executor Plugin API

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578858#comment-16578858
 ] 

Imran Rashid edited comment on SPARK-24918 at 8/13/18 8:15 PM:
---

[~lucacanali] OK I see the case for what you're proposing -- its hard to setup 
that communication between the driver & executors without *some* initial setup 
message.

Still ... I'm a bit reluctant to include that now, until we see someone 
actually builds something that uses it.  I realizes you might be hesitant to do 
that until you know it can be built on a stable api, but I don't think we can 
get around that.


was (Author: irashid):
[~lucacanali] OK I see the case for what you're proposing -- its hard too setup 
that communication between the driver & executors without *some* initial setup 
message.

Still ... I'm a bit reluctant to include that now, until we see someone 
actually builds something that uses it.  I realizes you might be hesitant to do 
that until you know it can be built on a stable api, but I don't think we can 
get around that.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578858#comment-16578858
 ] 

Imran Rashid commented on SPARK-24918:
--

[~lucacanali] OK I see the case for what you're proposing -- its hard too setup 
that communication between the driver & executors without *some* initial setup 
message.

Still ... I'm a bit reluctant to include that now, until we see someone 
actually builds something that uses it.  I realizes you might be hesitant to do 
that until you know it can be built on a stable api, but I don't think we can 
get around that.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578856#comment-16578856
 ] 

Imran Rashid commented on SPARK-650:


Folks may be interested in SPARK-24918.  perhaps one should be closed a 
duplicate of the other, but for now there is some discussion on both, so I'll 
leave them open for the time being

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578853#comment-16578853
 ] 

Imran Rashid commented on SPARK-24918:
--

Ah, right, thanks [~vanzin], I knew I had seen this before.

[~srowen], you argued the most against SPARK-650 -- have I made the case here?  
I did indeed at first do exactly what you suggested, using a static 
initializer, but realized it was not great for a couple of very important 
reasons:

* dynamic allocation
* turning on a "debug" mode without any code changes (you'd be surprised how 
big a hurdle this is for something in production)
* "sql only" apps, where the end user barely knows anything about calling a 
mapPartitions function

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5

2018-08-13 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578851#comment-16578851
 ] 

shane knapp commented on SPARK-25079:
-

question:  do we want to upgrade to 3.6 instead?

> [PYTHON] upgrade python 3.4 -> 3.5
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22905) Fix ChiSqSelectorModel, GaussianMixtureModel save implementation for Row order issues

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578850#comment-16578850
 ] 

Apache Spark commented on SPARK-22905:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/22079

> Fix ChiSqSelectorModel, GaussianMixtureModel save implementation for Row 
> order issues
> -
>
> Key: SPARK-22905
> URL: https://issues.apache.org/jira/browse/SPARK-22905
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.3.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, in `ChiSqSelectorModel`, save:
> {code}
> spark.createDataFrame(dataArray).repartition(1).write...
> {code}
> The default partition number used by createDataFrame is "defaultParallelism",
> Current RoundRobinPartitioning won't guarantee the "repartition" generating 
> the same order result with local array. We need fix it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25106) A new Kafka consumer gets created for every batch

2018-08-13 Thread Alexis Seigneurin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Seigneurin updated SPARK-25106:
--
Description: 
I have a fairly simple piece of code that reads from Kafka, applies some 
transformations - including applying a UDF - and writes the result to the 
console. Every time a batch is created, a new consumer is created (and not 
closed), eventually leading to a "too many open files" error.

I created a test case, with the code available here: 
[https://github.com/aseigneurin/spark-kafka-issue]

To reproduce:
 # Start Kafka and create a topic called "persons"
 # Run "Producer" to generate data
 # Run "Consumer"

I am attaching the log where you can see a new consumer being initialized 
between every batch.

Please note this issue does *not* appear with Spark 2.2.2, and it does not 
appear either when I don't apply the UDF.

I am suspecting - although I did go far enough to confirm - that this issue is 
related to the improvement made in SPARK-23623.

  was:
I have a fairly simple piece of code that reads from Kafka, applies some 
transformations - including applying a UDF - and writes the result to the 
console. Every time a batch is created, a new consumer is created (and not 
closed), eventually leading to a "too many open files" error.

I created a test case, with the code available here: 
[https://github.com/aseigneurin/spark-kafka-issue]

To reproduce:
 # Start Kafka and create a topic called "persons"
 # Run "Producer" to generate data
 # Run "Consumer"

I am attaching the log where you can see a new consumer being initialized 
between every batch.

Please note this issue does *not* appear with Spark 2.2.2, and it does not 
appear either when I don't apply the UDF.


> A new Kafka consumer gets created for every batch
> -
>
> Key: SPARK-25106
> URL: https://issues.apache.org/jira/browse/SPARK-25106
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: console.txt
>
>
> I have a fairly simple piece of code that reads from Kafka, applies some 
> transformations - including applying a UDF - and writes the result to the 
> console. Every time a batch is created, a new consumer is created (and not 
> closed), eventually leading to a "too many open files" error.
> I created a test case, with the code available here: 
> [https://github.com/aseigneurin/spark-kafka-issue]
> To reproduce:
>  # Start Kafka and create a topic called "persons"
>  # Run "Producer" to generate data
>  # Run "Consumer"
> I am attaching the log where you can see a new consumer being initialized 
> between every batch.
> Please note this issue does *not* appear with Spark 2.2.2, and it does not 
> appear either when I don't apply the UDF.
> I am suspecting - although I did go far enough to confirm - that this issue 
> is related to the improvement made in SPARK-23623.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25106) A new Kafka consumer gets created for every batch

2018-08-13 Thread Alexis Seigneurin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Seigneurin updated SPARK-25106:
--
Description: 
I have a fairly simple piece of code that reads from Kafka, applies some 
transformations - including applying a UDF - and writes the result to the 
console. Every time a batch is created, a new consumer is created (and not 
closed), eventually leading to a "too many open files" error.

I created a test case, with the code available here: 
[https://github.com/aseigneurin/spark-kafka-issue]

To reproduce:
 # Start Kafka and create a topic called "persons"
 # Run "Producer" to generate data
 # Run "Consumer"

I am attaching the log where you can see a new consumer being initialized 
between every batch.

Please note this issue does *not* appear with Spark 2.2.2, and it does not 
appear either when I don't apply the UDF.

  was:
I have a fairly simple piece of code that reads from Kafka, applies some 
transformations - including applying a UDF - and writes the result to the 
console. Every time a batch is created, a new consumer is created (and not 
closed), eventually leading to a "too many open files" error.

I created a test case, with the code available here: 
[https://github.com/aseigneurin/spark-kafka-issue]

To reproduce:
 # Start Kafka and create a topic called "persons"
 # Run "Producer" to generate data
 # Run "Consumer"

I am attaching the log where you can see a new consumer being initialized 
between every batch.

Please note this issue does *not* appear with Spark 2.2.2.


> A new Kafka consumer gets created for every batch
> -
>
> Key: SPARK-25106
> URL: https://issues.apache.org/jira/browse/SPARK-25106
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: console.txt
>
>
> I have a fairly simple piece of code that reads from Kafka, applies some 
> transformations - including applying a UDF - and writes the result to the 
> console. Every time a batch is created, a new consumer is created (and not 
> closed), eventually leading to a "too many open files" error.
> I created a test case, with the code available here: 
> [https://github.com/aseigneurin/spark-kafka-issue]
> To reproduce:
>  # Start Kafka and create a topic called "persons"
>  # Run "Producer" to generate data
>  # Run "Consumer"
> I am attaching the log where you can see a new consumer being initialized 
> between every batch.
> Please note this issue does *not* appear with Spark 2.2.2, and it does not 
> appear either when I don't apply the UDF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25106) A new Kafka consumer gets created for every batch

2018-08-13 Thread Alexis Seigneurin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Seigneurin updated SPARK-25106:
--
Attachment: console.txt

> A new Kafka consumer gets created for every batch
> -
>
> Key: SPARK-25106
> URL: https://issues.apache.org/jira/browse/SPARK-25106
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: console.txt
>
>
> I have a fairly simple piece of code that reads from Kafka, applies some 
> transformations - including applying a UDF - and writes the result to the 
> console. Every time a batch is created, a new consumer is created (and not 
> closed), eventually leading to a "too many open files" error.
> I created a test case, with the code available here: 
> [https://github.com/aseigneurin/spark-kafka-issue]
> To reproduce:
>  # Start Kafka and create a topic called "persons"
>  # Run "Producer" to generate data
>  # Run "Consumer"
> I am attaching the log where you can see a new consumer being initialized 
> between every batch.
> Please note this issue does *not* appear with Spark 2.2.2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25106) A new Kafka consumer gets created for every batch

2018-08-13 Thread Alexis Seigneurin (JIRA)
Alexis Seigneurin created SPARK-25106:
-

 Summary: A new Kafka consumer gets created for every batch
 Key: SPARK-25106
 URL: https://issues.apache.org/jira/browse/SPARK-25106
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Alexis Seigneurin
 Attachments: console.txt

I have a fairly simple piece of code that reads from Kafka, applies some 
transformations - including applying a UDF - and writes the result to the 
console. Every time a batch is created, a new consumer is created (and not 
closed), eventually leading to a "too many open files" error.

I created a test case, with the code available here: 
[https://github.com/aseigneurin/spark-kafka-issue]

To reproduce:
 # Start Kafka and create a topic called "persons"
 # Run "Producer" to generate data
 # Run "Consumer"

I am attaching the log where you can see a new consumer being initialized 
between every batch.

Please note this issue does *not* appear with Spark 2.2.2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-08-13 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578789#comment-16578789
 ] 

Eyal Farago commented on SPARK-24410:
-

[~viirya], my bad :)

seems there are two distinct issues here: one is general behavior of 
join/aggregate over unions, the other is the guarantees of bucketed 
partitioning.

looking more carefully at the results of your query it seems that the two DFs 
are not co-partitioned (which is a bit surprising), so my apologies.

having that said, there's a more general issue with pushing down shuffle 
related operations over a union, do you guys think this deserves a separate 
issue?

 

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-13 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578767#comment-16578767
 ] 

Marcelo Vanzin commented on SPARK-24918:


For reference: this looks kinda similar to SPARK-650.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-13 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578766#comment-16578766
 ] 

kevin yu commented on SPARK-25105:
--

I will try to fix it. Thanks. Kevin

> Importing all of pyspark.sql.functions should bring PandasUDFType in as well
> 
>
> Key: SPARK-25105
> URL: https://issues.apache.org/jira/browse/SPARK-25105
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
>  
> {code:java}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> Traceback (most recent call last):
>  File "", line 1, in 
> NameError: name 'PandasUDFType' is not defined
>  
> {code}
> When explicitly imported it works fine:
> {code:java}
>  
> >>> from pyspark.sql.functions import PandasUDFType
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> {code}
>  
> We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-08-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578763#comment-16578763
 ] 

Liang-Chi Hsieh commented on SPARK-24410:
-

The above code shows that the two tables in union results are located in 
logically different partitions, even you know they might be physically 
co-partitioned. So we can't just get rid of the shuffle and expect the correct 
results, because of `SparkContext.union`'s current implementation.

That is why cloud-fan suggested to implement Union with RDD.zip for some 
certain case, to preserve the children output partitioning.

Although we can make Union smarter on its output partitioning, from the 
discussion you can see we might need to also consider parallelism and locality.

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-13 Thread MIK (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578752#comment-16578752
 ] 

MIK edited comment on SPARK-25051 at 8/13/18 6:21 PM:
--

Thanks [~yumwang] , with 2.3.2-rc4 the error is gone now but the result is not 
correct (getting 0 records), 
 ++---+
|id|name|

++---+

The sample program should return 2 records.
 +++
|id|name|
|1|one|
|3|three|

+++


was (Author: mik1007):
Thanks [~yumwang] , with 2.3.2-rc4 the error is gone now but the result is not 
correct (getting 0 records), 
+---++
| id|name|
+---++
+---++

The sample program should return 2 records.
+---+-+
| id| name|
+---+-+
|  1|  one|
|  3|three|
+---+-+

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-13 Thread MIK (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578752#comment-16578752
 ] 

MIK commented on SPARK-25051:
-

Thanks [~yumwang] , with 2.3.2-rc4 the error is gone now but the result is not 
correct (getting 0 records), 
+---++
| id|name|
+---++
+---++

The sample program should return 2 records.
+---+-+
| id| name|
+---+-+
|  1|  one|
|  3|three|
+---+-+

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23654) Cut jets3t as a dependency of spark-core

2018-08-13 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-23654:
---
Summary: Cut jets3t as a dependency of spark-core  (was: Cut jets3t and 
bouncy castle as dependencies of spark-core)

> Cut jets3t as a dependency of spark-core
> 
>
> Key: SPARK-23654
> URL: https://issues.apache.org/jira/browse/SPARK-23654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
>
> Spark core declares a dependency on Jets3t, which pulls in other cruft
> # the hadoop-cloud module pulls in the hadoop-aws module with the 
> jets3t-compatible connectors, and the relevant dependencies: the spark-core 
> dependency is incomplete if that module isn't built, and superflous or 
> inconsistent if it is.
> # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop 
> 3.x in favour we're willing to maintain.
> JetS3t was wonderful when it came out, but now the amazon SDKs massively 
> exceed it in functionality, albeit at the expense of week-to-week stability 
> and JAR binary compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23654) Cut jets3t and bouncy castle as dependencies of spark-core

2018-08-13 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-23654:
---
Summary: Cut jets3t and bouncy castle as dependencies of spark-core  (was: 
Cut jets3t as a dependency of spark-core)

> Cut jets3t and bouncy castle as dependencies of spark-core
> --
>
> Key: SPARK-23654
> URL: https://issues.apache.org/jira/browse/SPARK-23654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
>
> Spark core declares a dependency on Jets3t, which pulls in other cruft
> # the hadoop-cloud module pulls in the hadoop-aws module with the 
> jets3t-compatible connectors, and the relevant dependencies: the spark-core 
> dependency is incomplete if that module isn't built, and superflous or 
> inconsistent if it is.
> # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop 
> 3.x in favour we're willing to maintain.
> JetS3t was wonderful when it came out, but now the amazon SDKs massively 
> exceed it in functionality, albeit at the expense of week-to-week stability 
> and JAR binary compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24735) Improve exception when mixing up pandas_udf types

2018-08-13 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578719#comment-16578719
 ] 

holdenk commented on SPARK-24735:
-

So [~bryanc]what do you think of if we add a AggregatePythonUDF and use it for 
grouped_map / grouped_agg so we get treated the correct way by the Scala SQL 
engine?

> Improve exception when mixing up pandas_udf types
> -
>
> Key: SPARK-24735
> URL: https://issues.apache.org/jira/browse/SPARK-24735
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> From the discussion here 
> https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up 
> Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = 
> pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an 
> exception which is hard to understand.  It should tell the user that the UDF 
> type is wrong.  This is the full output:
> {code}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> >>> df.select(foo(df['v'])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", 
> line 353, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
>   at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24735) Improve exception when mixing up pandas_udf types

2018-08-13 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578710#comment-16578710
 ] 

holdenk commented on SPARK-24735:
-

I think we could do better than just improving the exception, if we look at the 
other aggregates in PySpark when we call them with select it does the grouping 
for us:

 
{code:java}
>>> df.select(sumDistinct(df._1)).show()
++
|sum(DISTINCT _1)|
++
| 4950   |
++{code}

> Improve exception when mixing up pandas_udf types
> -
>
> Key: SPARK-24735
> URL: https://issues.apache.org/jira/browse/SPARK-24735
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> From the discussion here 
> https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up 
> Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = 
> pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an 
> exception which is hard to understand.  It should tell the user that the UDF 
> type is wrong.  This is the full output:
> {code}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> >>> df.select(foo(df['v'])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", 
> line 353, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
>   at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2018-08-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578709#comment-16578709
 ] 

Liang-Chi Hsieh commented on SPARK-22347:
-

Agreed. Thanks [~rdblue]


> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-13 Thread holdenk (JIRA)
holdenk created SPARK-25105:
---

 Summary: Importing all of pyspark.sql.functions should bring 
PandasUDFType in as well
 Key: SPARK-25105
 URL: https://issues.apache.org/jira/browse/SPARK-25105
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: holdenk


 
{code:java}
>>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
Traceback (most recent call last):
 File "", line 1, in 
NameError: name 'PandasUDFType' is not defined
 
{code}
When explicitly imported it works fine:
{code:java}
 
>>> from pyspark.sql.functions import PandasUDFType
>>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
{code}
 

We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24735) Improve exception when mixing up pandas_udf types

2018-08-13 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-24735:

Summary: Improve exception when mixing up pandas_udf types  (was: Improve 
exception when mixing pandas_udf types)

> Improve exception when mixing up pandas_udf types
> -
>
> Key: SPARK-24735
> URL: https://issues.apache.org/jira/browse/SPARK-24735
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> From the discussion here 
> https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up 
> Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = 
> pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an 
> exception which is hard to understand.  It should tell the user that the UDF 
> type is wrong.  This is the full output:
> {code}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> >>> df.select(foo(df['v'])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", 
> line 353, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
>   at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25104) Validate user specified output schema

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25104:


Assignee: Apache Spark

> Validate user specified output schema
> -
>
> Key: SPARK-25104
> URL: https://issues.apache.org/jira/browse/SPARK-25104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> With code changes in 
> [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,]
>  , Spark can write out data as per user provided output schema.
> To make it more robust and user friendly, we should validate the Avro schema 
> before tasks launched.
> Also we should support output logical decimal type as BYTES (By default we 
> output as FIXED)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25104) Validate user specified output schema

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25104:


Assignee: (was: Apache Spark)

> Validate user specified output schema
> -
>
> Key: SPARK-25104
> URL: https://issues.apache.org/jira/browse/SPARK-25104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> With code changes in 
> [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,]
>  , Spark can write out data as per user provided output schema.
> To make it more robust and user friendly, we should validate the Avro schema 
> before tasks launched.
> Also we should support output logical decimal type as BYTES (By default we 
> output as FIXED)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25104) Validate user specified output schema

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578637#comment-16578637
 ] 

Apache Spark commented on SPARK-25104:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22094

> Validate user specified output schema
> -
>
> Key: SPARK-25104
> URL: https://issues.apache.org/jira/browse/SPARK-25104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> With code changes in 
> [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,]
>  , Spark can write out data as per user provided output schema.
> To make it more robust and user friendly, we should validate the Avro schema 
> before tasks launched.
> Also we should support output logical decimal type as BYTES (By default we 
> output as FIXED)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24736) --py-files not functional for non local URLs. It appears to pass non-local URL's into PYTHONPATH directly.

2018-08-13 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578624#comment-16578624
 ] 

holdenk commented on SPARK-24736:
-

cc [~ifilonenko]

> --py-files not functional for non local URLs. It appears to pass non-local 
> URL's into PYTHONPATH directly.
> --
>
> Key: SPARK-24736
> URL: https://issues.apache.org/jira/browse/SPARK-24736
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
> Environment: Recent 2.4.0 from master branch, submitted on Linux to a 
> KOPS Kubernetes cluster created on AWS.
>  
>Reporter: Jonathan A Weaver
>Priority: Minor
>
> My spark-submit
> bin/spark-submit \
>         --master 
> k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]
>  \
>         --deploy-mode cluster \
>         --name pytest \
>         --conf 
> spark.kubernetes.container.image=[412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest|http://412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest]
>  \
>         --conf 
> [spark.kubernetes.driver.pod.name|http://spark.kubernetes.driver.pod.name/]=spark-pi-driver
>  \
>         --conf 
> spark.kubernetes.authenticate.submission.caCertFile=[cluster.ca|http://cluster.ca/]
>  \
>         --conf spark.kubernetes.authenticate.submission.oauthToken=$TOK \
>         --conf spark.kubernetes.authenticate.driver.oauthToken=$TOK \
> --py-files "[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]; \
> [https://s3.amazonaws.com/maxar-ids-fids/it.py]
>  
> *screw.zip is successfully downloaded and placed in SparkFIles.getRootPath()*
> 2018-07-01 07:33:43 INFO  SparkContext:54 - Added file 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] at 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] with timestamp 
> 1530430423297
> 2018-07-01 07:33:43 INFO  Utils:54 - Fetching 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] to 
> /var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240/fetchFileTemp1549645948768432992.tmp
> *I print out the  PYTHONPATH and PYSPARK_FILES environment variables from the 
> driver script:*
>      PYTHONPATH 
> /opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-0.10.7-src.zip:/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar:/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip:*[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]*
>     PYSPARK_FILES [https://s3.amazonaws.com/maxar-ids-fids/screw.zip]
>  
> *I print out sys.path*
> ['/tmp/spark-fec3684b-8b63-4f43-91a4-2f2fa41a1914', 
> u'/var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240',
>  '/opt/spark/python/lib/pyspark.zip', 
> '/opt/spark/python/lib/py4j-0.10.7-src.zip', 
> '/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar', 
> '/opt/spark/python/lib/py4j-*.zip', *'/opt/spark/work-dir/https', 
> '//[s3.amazonaws.com/maxar-ids-fids/screw.zip|http://s3.amazonaws.com/maxar-ids-fids/screw.zip]',*
>  '/usr/lib/python27.zip', '/usr/lib/python2.7', 
> '/usr/lib/python2.7/plat-linux2', '/usr/lib/python2.7/lib-tk', 
> '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', 
> '/usr/lib/python2.7/site-packages']
>  
> *URL from PYTHONFILES gets placed in sys.path verbatim with obvious results.*
>  
> *Dump of spark config from container.*
> Spark config dumped:
> [(u'spark.master', 
> u'k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]'),
>  (u'spark.kubernetes.authenticate.submission.oauthToken', 
> u''), 
> (u'spark.kubernetes.authenticate.driver.oauthToken', 
> u''), (u'spark.kubernetes.executor.podNamePrefix', 
> u'pytest-1530430411996'), (u'spark.kubernetes.memoryOverheadFactor', u'0.4'), 
> (u'spark.driver.blockManager.port', u'7079'), 
> (u'[spark.app.id|http://spark.app.id/]', u'spark-application-1530430424433'), 
> (u'[spark.app.name|http://spark.app.name/]', u'pytest'), 
> (u'[spark.executor.id|http://spark.executor.id/]', u'driver'), 
> (u'spark.driver.host', u'pytest-1530430411996-driver-svc.default.svc'), 
> (u'spark.kubernetes.container.image', 
> u'[412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest'|http://412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest']),
>  (u'spark.driver.port', u'7078'), 
> (u'spark.kubernetes.python.mainAppResource', 
> u'[https://s3.amazonaws.com/maxar-ids-fids/it.py']), 
> 

[jira] [Created] (SPARK-25104) Validate user specified output schema

2018-08-13 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-25104:
--

 Summary: Validate user specified output schema
 Key: SPARK-25104
 URL: https://issues.apache.org/jira/browse/SPARK-25104
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Gengliang Wang


With code changes in 
[https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,]
 , Spark can write out data as per user provided output schema.

To make it more robust and user friendly, we should validate the Avro schema 
before tasks launched.

Also we should support output logical decimal type as BYTES (By default we 
output as FIXED)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-13 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23874:
-
Description: 
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support ARROW-2141
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 

  was:
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support SPARK-23555
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 


> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22347:

Fix Version/s: (was: 2.3.0)

> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22347.
-
Resolution: Won't Fix

> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.3.0
>
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-22347:
-

> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.3.0
>
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2018-08-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578566#comment-16578566
 ] 

Wenchen Fan commented on SPARK-22347:
-

we changed our mind during code review and this JIRA is no longer valid, we 
should mark it as won't fix. [~rdblue] thanks for pointing it out!

> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.3.0
>
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25060) PySpark UDF in case statement is always run

2018-08-13 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-25060.
---
Resolution: Won't Fix

I'm closing this issue as "Won't Fix", the same as the issue this duplicates, 
SPARK-22347.

> PySpark UDF in case statement is always run
> ---
>
> Key: SPARK-25060
> URL: https://issues.apache.org/jira/browse/SPARK-25060
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Ryan Blue
>Priority: Major
>
> When evaluating a case statement with a python UDF, Spark will always run the 
> UDF even if the case doesn't use the branch with the UDF call. Here's a repro 
> case:
> {code:lang=python}
> from pyspark.sql.types import StringType
> def fail_if_x(s):
> assert s != 'x'
> return s
> spark.udf.register("fail_if_x", fail_if_x, StringType())
> df = spark.createDataFrame([(1, 'x'), (2, 'y')], ['id', 'str'])
> df.registerTempTable("data")
> spark.sql("select id, case when str <> 'x' then fail_if_x(str) else null end 
> from data").show()
> {code}
> This produces the following error:
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last): 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 189, in main 
> process() 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 184, in process 
> serializer.dump_stream(func(split_index, iterator), outfile) 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 104, in  
> func = lambda _, it: map(mapper, it) 
>   File "", line 1, in  
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 71, in  
> return lambda *a: f(*a) 
>   File "", line 4, in fail_if_x 
> AssertionError
> {code}
> This is because Python UDFs are extracted from expressions and run in the 
> BatchEvalPython node inserted as the child of the expression node:
> {code}
> == Physical Plan ==
> CollectLimit 21
> +- *Project [id#0L, CASE WHEN NOT (str#1 = x) THEN pythonUDF0#14 ELSE null 
> END AS CASE WHEN (NOT (str = x)) THEN fail_if_x(str) ELSE CAST(NULL AS 
> STRING) END#6]
>+- BatchEvalPython [fail_if_x(str#1)], [id#0L, str#1, pythonUDF0#14]
>   +- Scan ExistingRDD[id#0L,str#1]
> {code}
> This doesn't affect correctness, but the behavior doesn't match the Scala API 
> where case can be used to avoid passing data that will cause a UDF to fail 
> into the UDF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25060) PySpark UDF in case statement is always run

2018-08-13 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578557#comment-16578557
 ] 

Ryan Blue commented on SPARK-25060:
---

Thanks, [~hyukjin.kwon], you're right that this is a duplicate. I've closed it.

> PySpark UDF in case statement is always run
> ---
>
> Key: SPARK-25060
> URL: https://issues.apache.org/jira/browse/SPARK-25060
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Ryan Blue
>Priority: Major
>
> When evaluating a case statement with a python UDF, Spark will always run the 
> UDF even if the case doesn't use the branch with the UDF call. Here's a repro 
> case:
> {code:lang=python}
> from pyspark.sql.types import StringType
> def fail_if_x(s):
> assert s != 'x'
> return s
> spark.udf.register("fail_if_x", fail_if_x, StringType())
> df = spark.createDataFrame([(1, 'x'), (2, 'y')], ['id', 'str'])
> df.registerTempTable("data")
> spark.sql("select id, case when str <> 'x' then fail_if_x(str) else null end 
> from data").show()
> {code}
> This produces the following error:
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last): 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 189, in main 
> process() 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 184, in process 
> serializer.dump_stream(func(split_index, iterator), outfile) 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 104, in  
> func = lambda _, it: map(mapper, it) 
>   File "", line 1, in  
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 71, in  
> return lambda *a: f(*a) 
>   File "", line 4, in fail_if_x 
> AssertionError
> {code}
> This is because Python UDFs are extracted from expressions and run in the 
> BatchEvalPython node inserted as the child of the expression node:
> {code}
> == Physical Plan ==
> CollectLimit 21
> +- *Project [id#0L, CASE WHEN NOT (str#1 = x) THEN pythonUDF0#14 ELSE null 
> END AS CASE WHEN (NOT (str = x)) THEN fail_if_x(str) ELSE CAST(NULL AS 
> STRING) END#6]
>+- BatchEvalPython [fail_if_x(str#1)], [id#0L, str#1, pythonUDF0#14]
>   +- Scan ExistingRDD[id#0L,str#1]
> {code}
> This doesn't affect correctness, but the behavior doesn't match the Scala API 
> where case can be used to avoid passing data that will cause a UDF to fail 
> into the UDF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2018-08-13 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578549#comment-16578549
 ] 

Ryan Blue commented on SPARK-22347:
---

[~viirya], [~cloud_fan]: Is there any objection to changing the resolution of 
this issue to "Won't Fix" instead of "Fixed"? Just documenting the behavior is 
not a fix.

If I don't hear anything in the next day or so, I'll update it.

> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.3.0
>
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-13 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23874:
-
Description: 
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support SPARK-23555
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 

  was:
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 


> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support SPARK-23555
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25103) CompletionIterator may delay GC of completed resources

2018-08-13 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578519#comment-16578519
 ] 

Eyal Farago commented on SPARK-25103:
-

CC: [~cloud_fan], [~hvanhovell]

> CompletionIterator may delay GC of completed resources
> --
>
> Key: SPARK-25103
> URL: https://issues.apache.org/jira/browse/SPARK-25103
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0, 2.2.0, 2.3.0
>Reporter: Eyal Farago
>Priority: Major
>
> while working on SPARK-22713 , I fund (and partially fixed) a scenario in 
> which an iterator is already exhausted but still holds a reference to some 
> resources that can be GCed at this point.
> However, these resources can not be GCed because of this reference.
> the specific fix applied in SPARK-22713 was to wrap the iterator with a 
> CompletionIterator that cleans it when exhausted, thing is that it's quite 
> easy to get this wrong by closing over local variables or _this_ reference in 
> the cleanup function itself.
> I propose solving this by modifying CompletionIterator to discard references 
> to the wrapped iterator and cleanup function once exhausted.
>  
>  * a dive into the code showed that most CompletionIterators are eventually 
> used by 
> {code:java}
> org.apache.spark.scheduler.ShuffleMapTask#runTask{code}
> which does:
> {code:java}
> writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: 
> Product2[Any, Any]]]){code}
> looking at 
> {code:java}
> org.apache.spark.shuffle.ShuffleWriter#write{code}
> implementations, it seems all of them first exhaust the iterator and then 
> perform some kind of post-processing: i.e. merging spills, sorting, writing 
> partitions files and then concatenating them into a single file... bottom 
> line the Iterator may actually be 'sitting' for some time after being 
> exhausted.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25103) CompletionIterator may delay GC of completed resources

2018-08-13 Thread Eyal Farago (JIRA)
Eyal Farago created SPARK-25103:
---

 Summary: CompletionIterator may delay GC of completed resources
 Key: SPARK-25103
 URL: https://issues.apache.org/jira/browse/SPARK-25103
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0, 2.2.0, 2.1.0, 2.0.1
Reporter: Eyal Farago


while working on SPARK-22713 , I fund (and partially fixed) a scenario in which 
an iterator is already exhausted but still holds a reference to some resources 
that can be GCed at this point.

However, these resources can not be GCed because of this reference.

the specific fix applied in SPARK-22713 was to wrap the iterator with a 
CompletionIterator that cleans it when exhausted, thing is that it's quite easy 
to get this wrong by closing over local variables or _this_ reference in the 
cleanup function itself.

I propose solving this by modifying CompletionIterator to discard references to 
the wrapped iterator and cleanup function once exhausted.

 
 * a dive into the code showed that most CompletionIterators are eventually 
used by 
{code:java}
org.apache.spark.scheduler.ShuffleMapTask#runTask{code}
which does:

{code:java}
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: 
Product2[Any, Any]]]){code}

looking at 
{code:java}
org.apache.spark.shuffle.ShuffleWriter#write{code}
implementations, it seems all of them first exhaust the iterator and then 
perform some kind of post-processing: i.e. merging spills, sorting, writing 
partitions files and then concatenating them into a single file... bottom line 
the Iterator may actually be 'sitting' for some time after being exhausted.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-08-13 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578489#comment-16578489
 ] 

Eyal Farago commented on SPARK-24410:
-

[~viirya], I think your conclusion about co-partitioning is wrong, the 
following code segment from your comment:
{code:java}
val df1 = spark.table("a1").select(spark_partition_id(), $"key")
val df2 = spark.table("a2").select(spark_partition_id(), $"key")
df1.union(df2).select(spark_partition_id(), $"key").collect
{code}
this prints the partition ids as assigned by union, assuming union simply 
concatenates the partitions from df1 and df2 assigning them a running number 
id, it really makes sense you'd get two partitions per key: one coming from df1 
and the other from df2.

applying this select on each dataframe separately you'd get the exact same 
results meaning a given key will have the same partition id in both dataframes.

I think this code fragment basically shows what's wrong with current 
implementation of Union, no that we can't optimize unions of co-partitioned 
relations.

if union was a bit more 'partitioning aware' it'd be able to identify that both 
children have the same partitioning scheme and 'inherit' it. as you actually 
showed this might be a bit tricky as the same logical attribute from different 
children has a different expression id, but Union eventually maps these child 
attributes into a single output attribute, so this information can be used to 
resolve the partitioning columns and determine their equality.

furthermore, Union being smarter on its output partitioning won't cut it, few 
rules have to be added/modified:

1. applying exchange on a union should sometimes be pushed to the children 
(children can be partitioned to those supporting the required partitioning and 
others not supporting it, the exchange can be applied to a union of the 
non-supporting children and then unioned with the rest of the children)
 2. partial aggregate also has to be pushed to the children resulting with a 
union of partial aggregations, again it's possible to partition children 
according to their support of the required partitioning.
 3. final aggregation over a union introduces an exchange which will then be 
pushed to the children, the aggregation is then applied on top of the 
partitioning aware union (think of the way PartitionerAwareUnionRDD handles 
partitioning).
 * partition children = partitioning an array by a predicate 
(scala.collection.TraversableLike#partition)
 * other operators like join may require additional rules.
 * some of this ideas were discussed offline with [~hvanhovell]

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: 

  1   2   >