date:20190109

[jira] [Updated] (SPARK-26576) Broadcast hint not applied to partitioned table

2019-01-09 Thread John Zhuge (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-26576:
---
Summary: Broadcast hint not applied to partitioned table  (was: Broadcast 
hint not applied to partitioned Parquet table)

> Broadcast hint not applied to partitioned table
> ---
>
> Key: SPARK-26576
> URL: https://issues.apache.org/jira/browse/SPARK-26576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: John Zhuge
>Priority: Major
>
> Broadcast hint is not applied to partitioned Parquet table. Below 
> "SortMergeJoin" is chosen incorrectly and "ResolvedHit(broadcast)" is removed 
> in Optimized Plan.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_with_part (val STRING) 
> PARTITIONED BY (dateint INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_with_part`
> :  +- Relation[val#28,dateint#29] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_with_part`
>   +- Relation[val#32,dateint#33] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- SubqueryAlias `jzhuge`.`parquet_with_part`
>:  +- Relation[val#28,dateint#29] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_with_part`
>  +- Relation[val#32,dateint#33] parquet
> == Optimized Logical Plan ==
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- Project [val#28, dateint#29]
>:  +- Filter isnotnull(dateint#29)
>: +- Relation[val#28,dateint#29] parquet
>+- Project [val#32, dateint#33]
>   +- Filter isnotnull(dateint#33)
>  +- Relation[val#32,dateint#33] parquet
> == Physical Plan ==
> *(5) Project [dateint#29, val#28, val#32]
> +- *(5) SortMergeJoin [dateint#29], [dateint#33], Inner
>:- *(2) Sort [dateint#29 ASC NULLS FIRST], false, 0
>:  +- Exchange(coordinator id: 55629191) hashpartitioning(dateint#29, 
> 500), coordinator[target post-shuffle partition size: 67108864]
>: +- *(1) FileScan parquet jzhuge.parquet_with_part[val#28,dateint#29] 
> Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], 
> PartitionCount: 0, PartitionFilters: [isnotnull(dateint#29)], PushedFilters: 
> [], ReadSchema: struct
>+- *(4) Sort [dateint#33 ASC NULLS FIRST], false, 0
>   +- ReusedExchange [val#32, dateint#33], Exchange(coordinator id: 
> 55629191) hashpartitioning(dateint#29, 500), coordinator[target post-shuffle 
> partition size: 67108864]
> {noformat}
> Broadcast hint is applied to Parquet table without partition. Below 
> "BroadcastHashJoin" is chosen as expected.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_no_part (val STRING, dateint 
> INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_no_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_no_part`
> :  +- Relation[val#44,dateint#45] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_no_part`
>   +- Relation[val#50,dateint#51] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#45, val#44, val#50]
> +- Join Inner, (dateint#45 = dateint#51)
>:- SubqueryAlias `jzhuge`.`parquet_no_part`
>:  +- Relation[val#44,dateint#45] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_no_part`
>  +- Relation[val#50,dateint#51] parquet
> == Optimized Logical Plan ==
> Project [dateint#45, val#44, val#50]
> +- Join Inner, (dateint#45 = dateint#51)
>:- Filter isnotnull(dateint#45)
>:  +- Relation[val#44,dateint#45] parquet
>+- ResolvedHint (broadcast)
>   +- Filter isnotnull(dateint#51)
>  +- Relation[val#50,dateint#51] parquet
> == Physical Plan ==
> *(2) Project [dateint#45, val#44, val#50]
> +- *(2) BroadcastHashJoin [dateint#45], [dateint#51], Inner, BuildRight
>:- *(2) Project [val#44, dateint#45]
>:  +- *(2) Filter isnotnull(dateint#45)
>: +- *(2) FileScan parquet jzhuge.parquet_no_part[val#44,dateint#45] 
> Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], 
> PartitionFilters: [], PushedFilters: [IsNotNull(dateint)], ReadSchema: 
> struct
>+- BroadcastExchange

[jira] [Commented] (SPARK-26491) Use ConfigEntry for hardcoded configs for test categories.

2019-01-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739167#comment-16739167
 ] 

Dongjoon Hyun commented on SPARK-26491:
---

The broken K8S integration compilation is fixed via 
https://github.com/apache/spark/pull/23505 .

> Use ConfigEntry for hardcoded configs for test categories.
> --
>
> Key: SPARK-26491
> URL: https://issues.apache.org/jira/browse/SPARK-26491
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> Make the following hardcoded configs to use ConfigEntry.
> {code}
> spark.test
> spark.testing
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-01-09 Thread deshanxiao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739083#comment-16739083
 ] 

deshanxiao commented on SPARK-26570:


[~hyukjin.kwon] OK, I will try it. Thank you!

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26586) Streaming queries should have isolated SparkSessions and confs

2019-01-09 Thread Mukul Murthy (JIRA)

Mukul Murthy created SPARK-26586:


 Summary: Streaming queries should have isolated SparkSessions and 
confs
 Key: SPARK-26586
 URL: https://issues.apache.org/jira/browse/SPARK-26586
 Project: Spark
  Issue Type: Bug
  Components: SQL, Structured Streaming
Affects Versions: 2.4.0, 2.3.0
Reporter: Mukul Murthy


When a stream is started, the stream's config is supposed to be frozen and all 
batches run with the config at start time. However, due to a race condition in 
creating streams, updating a conf value in the active spark session immediately 
after starting a stream can lead to the stream getting that updated value.

 

The problem is that when StreamingQueryManager creates a MicrobatchExecution 
(or ContinuousExecution), it passes in the shared spark session, and the spark 
session isn't cloned until StreamExecution.start() is called. 
DataStreamWriter.start() should not return until the SparkSession is cloned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2019-01-09 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738955#comment-16738955
 ] 

Hyukjin Kwon commented on SPARK-26570:
--

Would you be able to test this in upper version of Spark?

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26574) Cloud sql stronge

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26574.
--
Resolution: Invalid

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Major
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26574) Cloud sql stronge

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26574:
-
Fix Version/s: (was: 0.8.2)

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: jenkins, Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Major
>  Labels: PA
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26585) [K8S] Add additional integration tests for K8s Scheduler Backend

2019-01-09 Thread Nagaram Prasad Addepally (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738956#comment-16738956
 ] 

Nagaram Prasad Addepally commented on SPARK-26585:
--

https://github.com/apache/spark/pull/23504

> [K8S] Add additional integration tests for K8s Scheduler Backend 
> -
>
> Key: SPARK-26585
> URL: https://issues.apache.org/jira/browse/SPARK-26585
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
>
> I have reviewed the kubernetes integration tests and found out that following 
> cases are missing for testing scheduler backend functionality. 
>  * Run application with driver and executor image specified independently
>  * Request Pods with custom CPU and Limits
>  * Request Pods with custom Memory and memory overhead factor
>  * Request Pods with custom Memory and memory overhead
>  * Pods are relaunched on failures (as per 
> spark.kubernetes.executor.lostCheck.maxAttempts)
> Logging this Jira to add these tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26585) [K8S] Add additional integration tests for K8s Scheduler Backend

2019-01-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26585:


Assignee: (was: Apache Spark)

> [K8S] Add additional integration tests for K8s Scheduler Backend 
> -
>
> Key: SPARK-26585
> URL: https://issues.apache.org/jira/browse/SPARK-26585
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
>
> I have reviewed the kubernetes integration tests and found out that following 
> cases are missing for testing scheduler backend functionality. 
>  * Run application with driver and executor image specified independently
>  * Request Pods with custom CPU and Limits
>  * Request Pods with custom Memory and memory overhead factor
>  * Request Pods with custom Memory and memory overhead
>  * Pods are relaunched on failures (as per 
> spark.kubernetes.executor.lostCheck.maxAttempts)
> Logging this Jira to add these tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26585) [K8S] Add additional integration tests for K8s Scheduler Backend

2019-01-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26585:


Assignee: Apache Spark

> [K8S] Add additional integration tests for K8s Scheduler Backend 
> -
>
> Key: SPARK-26585
> URL: https://issues.apache.org/jira/browse/SPARK-26585
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Nagaram Prasad Addepally
>Assignee: Apache Spark
>Priority: Major
>
> I have reviewed the kubernetes integration tests and found out that following 
> cases are missing for testing scheduler backend functionality. 
>  * Run application with driver and executor image specified independently
>  * Request Pods with custom CPU and Limits
>  * Request Pods with custom Memory and memory overhead factor
>  * Request Pods with custom Memory and memory overhead
>  * Pods are relaunched on failures (as per 
> spark.kubernetes.executor.lostCheck.maxAttempts)
> Logging this Jira to add these tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26579) SparkML DecisionTree, how does the algorithm identify categorical features?

2019-01-09 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738954#comment-16738954
 ] 

Hyukjin Kwon commented on SPARK-26579:
--

Let's ask question to mailing list rather then filing a JIRA here. You could 
have a better answer there.

> SparkML DecisionTree, how does the algorithm identify categorical features?
> ---
>
> Key: SPARK-26579
> URL: https://issues.apache.org/jira/browse/SPARK-26579
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.4.0
> Environment: os: Centos7
> software: pyspark.
>Reporter: Xufeng Wang
>Priority: Major
>
> I am confused about the decision tree and other tree based models. My current 
> project involves data with both nominal and continuous features. I have 
> converted the nominal data to continuous values using the StringIndexer 
> transformer from the ml.feature module. Then I vector assembled all the 
> feature values into a vector type column named features. The feature vector, 
> as I see it, are all double datatype.
> While I keep getting the maxBins should be larger than the largest number for 
> all categorical features error, as I correct the maxBins size, I still see 
> some features (continuous type since the beginning) having the bigger than my 
> maxBins size values. Since the pipeline works with correct maxBins that is 
> not bigger than some continuous values, I should be able to say that the 
> algorithm automatically pick which features are categorical and which ones 
> are continuous. But how did it figure out which is which, as all of the 
> features are of double datatype?
> Another question, if anyone can help, what is the tree type for spark 
> decision tree. Is it CART or else?
> Last question, what are the procedures for treating categorical features in 
> tree based algorithms.
> Thank you in advance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26579) SparkML DecisionTree, how does the algorithm identify categorical features?

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26579.
--
Resolution: Invalid

> SparkML DecisionTree, how does the algorithm identify categorical features?
> ---
>
> Key: SPARK-26579
> URL: https://issues.apache.org/jira/browse/SPARK-26579
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.4.0
> Environment: os: Centos7
> software: pyspark.
>Reporter: Xufeng Wang
>Priority: Major
>
> I am confused about the decision tree and other tree based models. My current 
> project involves data with both nominal and continuous features. I have 
> converted the nominal data to continuous values using the StringIndexer 
> transformer from the ml.feature module. Then I vector assembled all the 
> feature values into a vector type column named features. The feature vector, 
> as I see it, are all double datatype.
> While I keep getting the maxBins should be larger than the largest number for 
> all categorical features error, as I correct the maxBins size, I still see 
> some features (continuous type since the beginning) having the bigger than my 
> maxBins size values. Since the pipeline works with correct maxBins that is 
> not bigger than some continuous values, I should be able to say that the 
> algorithm automatically pick which features are categorical and which ones 
> are continuous. But how did it figure out which is which, as all of the 
> features are of double datatype?
> Another question, if anyone can help, what is the tree type for spark 
> decision tree. Is it CART or else?
> Last question, what are the procedures for treating categorical features in 
> tree based algorithms.
> Thank you in advance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26574) Cloud sql stronge

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26574:
-
Flags:   (was: Patch,Important)

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Major
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26574) Cloud sql stronge

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26574:
-
External issue URL:   (was: https://pakegecloud.atlassian.net)

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Major
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26574) Cloud sql stronge

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26574:
-
Labels:   (was: PA)

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: jenkins, Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Major
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26574) Cloud sql stronge

2019-01-09 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738952#comment-16738952
 ] 

Hyukjin Kwon commented on SPARK-26574:
--

Please fill the JIRA description, and reopen.

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Major
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26574) Cloud sql stronge

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26574:
-
External issue ID:   (was: roufi...@rtat.net)

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Major
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26574) Cloud sql stronge

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26574:
-
Component/s: (was: jenkins)

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Major
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26574) Cloud sql stronge

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26574:
-
Shepherd:   (was: pakegecloud.atlassian.net)

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: jenkins, Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Major
>  Labels: PA
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26574) Cloud sql stronge

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26574:
-
Target Version/s:   (was: 2.4.0)

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: jenkins, Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Major
>  Labels: PA
> Fix For: 0.8.2
>
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26581) Spark Dataset write JSON with Multiline

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26581.
--
Resolution: Invalid

Also, multiline concept is not applicable to write side. Let's also ask 
questions to Spark mailing list before filing an issue.

> Spark Dataset write JSON with Multiline
> ---
>
> Key: SPARK-26581
> URL: https://issues.apache.org/jira/browse/SPARK-26581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Anil
>Priority: Major
>
> Hi,
> Spark currently can only write JSON file for single node, if i have multiple 
> lines or nodes, spark writes nodes with curly braces " \{ }" without comma 
> "," in between both the nodes and there is no square brackets at start and 
> end of the file. How to achive this. i am trying to write the JSON file like:.
> ds.write().format("JSON").option("multiline","true").save(path);
> please help on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26581) Spark Dataset write JSON with Multiline

2019-01-09 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738948#comment-16738948
 ] 

Hyukjin Kwon commented on SPARK-26581:
--

{{multiline}} is not supported in write option. You can easily do it via manual 
conversion with DataFrame APIs. For instance,

{code}
ds.toJSON.mapPartitions { iter => // write [ for the first line, and ] for the 
last line }.write.text("...")
{code}

> Spark Dataset write JSON with Multiline
> ---
>
> Key: SPARK-26581
> URL: https://issues.apache.org/jira/browse/SPARK-26581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Anil
>Priority: Major
>
> Hi,
> Spark currently can only write JSON file for single node, if i have multiple 
> lines or nodes, spark writes nodes with curly braces " \{ }" without comma 
> "," in between both the nodes and there is no square brackets at start and 
> end of the file. How to achive this. i am trying to write the JSON file like:.
> ds.write().format("JSON").option("multiline","true").save(path);
> please help on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26585) [K8S] Add additional integration tests for K8s Scheduler Backend

2019-01-09 Thread Nagaram Prasad Addepally (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738944#comment-16738944
 ] 

Nagaram Prasad Addepally commented on SPARK-26585:
--

I am working on adding these tests.

> [K8S] Add additional integration tests for K8s Scheduler Backend 
> -
>
> Key: SPARK-26585
> URL: https://issues.apache.org/jira/browse/SPARK-26585
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
>
> I have reviewed the kubernetes integration tests and found out that following 
> cases are missing for testing scheduler backend functionality. 
>  * Run application with driver and executor image specified independently
>  * Request Pods with custom CPU and Limits
>  * Request Pods with custom Memory and memory overhead factor
>  * Request Pods with custom Memory and memory overhead
>  * Pods are relaunched on failures (as per 
> spark.kubernetes.executor.lostCheck.maxAttempts)
> Logging this Jira to add these tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26585) [K8S] Add additional integration tests for K8s Scheduler Backend

2019-01-09 Thread Nagaram Prasad Addepally (JIRA)

Nagaram Prasad Addepally created SPARK-26585:


 Summary: [K8S] Add additional integration tests for K8s Scheduler 
Backend 
 Key: SPARK-26585
 URL: https://issues.apache.org/jira/browse/SPARK-26585
 Project: Spark
  Issue Type: Test
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Nagaram Prasad Addepally


I have reviewed the kubernetes integration tests and found out that following 
cases are missing for testing scheduler backend functionality. 
 * Run application with driver and executor image specified independently
 * Request Pods with custom CPU and Limits
 * Request Pods with custom Memory and memory overhead factor
 * Request Pods with custom Memory and memory overhead
 * Pods are relaunched on failures (as per 
spark.kubernetes.executor.lostCheck.maxAttempts)

Logging this Jira to add these tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10781) Allow certain number of failed tasks and allow job to succeed

2019-01-09 Thread nxet (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738932#comment-16738932
 ] 

nxet edited comment on SPARK-10781 at 1/10/19 2:52 AM:
---

I met the same problem as some empty sequence files cause the failure of the 
whole job,but by MR can run 
normally(mapreduce.map.failures.maxpercent,mapreduce.reduce.failures.maxpercent),the
 following is my source files:

_116.1 M 348.3 M /20181226/1545753600402.lzo_deflate_
 _97.0 M 290.9 M /20181226/1545754236750.lzo_deflate_
 _113.3 M 339.8 M /20181226/1545754856515.lzo_deflate_
 _126.5 M 379.5 M /20181226/1545753600402.lzo_deflate_
 _92.9 M 278.6 M /20181226/1545754233009.lzo_deflate_
 _117.7 M 353.2 M /20181226/1545754850857.lzo_deflate_
 _0 M 0 M /20181226/1545755455381.lzo_deflate_
 _0 M 0 M /20181226/1545756056457.lzo_deflate_


was (Author: nxet):
I met the same problem as some empty sequence files cause the failure of the 
whole job,but by MR can run 
normally(mapreduce.map.failures.maxpercent,mapreduce.reduce.failures.maxpercent),the
 following is my source files:

_116.1 M  348.3 M  /20181226/1545753600402.lzo_deflate
97.0 M  290.9 M  /20181226/1545754236750.lzo_deflate
113.3 M  339.8 M  /20181226/1545754856515.lzo_deflate
126.5 M  379.5 M  /20181226/1545753600402.lzo_deflate
92.9 M  278.6 M  /20181226/1545754233009.lzo_deflate
117.7 M  353.2 M  /20181226/1545754850857.lzo_deflate
0 M  0 M  /20181226/1545755455381.lzo_deflate
0 M  0 M  /20181226/1545756056457.lzo_deflate_

> Allow certain number of failed tasks and allow job to succeed
> -
>
> Key: SPARK-10781
> URL: https://issues.apache.org/jira/browse/SPARK-10781
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: SPARK_10781_Proposed_Solution.pdf
>
>
> MapReduce has this config mapreduce.map.failures.maxpercent and 
> mapreduce.reduce.failures.maxpercent which allows for a certain percent of 
> tasks to fail but the job to still succeed.  
> This could be a useful feature in Spark also if a job doesn't need all the 
> tasks to be successful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10781) Allow certain number of failed tasks and allow job to succeed

2019-01-09 Thread nxet (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738932#comment-16738932
 ] 

nxet commented on SPARK-10781:
--

I met the same problem as some empty sequence files cause the failure of the 
whole job,but by MR can run 
normally(mapreduce.map.failures.maxpercent,mapreduce.reduce.failures.maxpercent),the
 following is my source files:

_116.1 M  348.3 M  /20181226/1545753600402.lzo_deflate
97.0 M  290.9 M  /20181226/1545754236750.lzo_deflate
113.3 M  339.8 M  /20181226/1545754856515.lzo_deflate
126.5 M  379.5 M  /20181226/1545753600402.lzo_deflate
92.9 M  278.6 M  /20181226/1545754233009.lzo_deflate
117.7 M  353.2 M  /20181226/1545754850857.lzo_deflate
0 M  0 M  /20181226/1545755455381.lzo_deflate
0 M  0 M  /20181226/1545756056457.lzo_deflate_

> Allow certain number of failed tasks and allow job to succeed
> -
>
> Key: SPARK-10781
> URL: https://issues.apache.org/jira/browse/SPARK-10781
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Priority: Major
> Attachments: SPARK_10781_Proposed_Solution.pdf
>
>
> MapReduce has this config mapreduce.map.failures.maxpercent and 
> mapreduce.reduce.failures.maxpercent which allows for a certain percent of 
> tasks to fail but the job to still succeed.  
> This could be a useful feature in Spark also if a job doesn't need all the 
> tasks to be successful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26546) Caching of DateTimeFormatter

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26546.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23462
[https://github.com/apache/spark/pull/23462]

> Caching of DateTimeFormatter
> 
>
> Key: SPARK-26546
> URL: https://issues.apache.org/jira/browse/SPARK-26546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, instances of java.time.format.DateTimeFormatter are built each 
> time when new instance of Iso8601DateFormatter or Iso8601TimestampFormatter 
> is created which is time consuming operation because it should parse the 
> timestamp/date patterns. It could be useful to create a cache with key = 
> (pattern, locale) and value = instance of java.time.format.DateTimeFormatter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26546) Caching of DateTimeFormatter

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26546:


Assignee: Maxim Gekk

> Caching of DateTimeFormatter
> 
>
> Key: SPARK-26546
> URL: https://issues.apache.org/jira/browse/SPARK-26546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently, instances of java.time.format.DateTimeFormatter are built each 
> time when new instance of Iso8601DateFormatter or Iso8601TimestampFormatter 
> is created which is time consuming operation because it should parse the 
> timestamp/date patterns. It could be useful to create a cache with key = 
> (pattern, locale) and value = instance of java.time.format.DateTimeFormatter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26493) spark.sql.extensions should support multiple extensions

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26493.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23398
[https://github.com/apache/spark/pull/23398]

> spark.sql.extensions should support multiple extensions
> ---
>
> Key: SPARK-26493
> URL: https://issues.apache.org/jira/browse/SPARK-26493
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jamison Bennett
>Assignee: Jamison Bennett
>Priority: Minor
>  Labels: starter
> Fix For: 3.0.0
>
>
> The spark.sql.extensions configuration options should support multiple 
> extensions. It is currently possible to load multiple extensions using the 
> programatic interface (e.g. 
> SparkSession.builder().master("..").withExtensions(sparkSessionExtensions1).withExtensions(sparkSessionExtensions2).getOrCreate()
>  ) but the same cannot currently be done with the command line options 
> without writing a wrapper extensions that combines multiple extensions.
>  
> Allowing multiple spark.sql.extensions, would allow the extensions to be 
> easily changes on the command line or via the configuration file. Multiple 
> extensions could be specified using a comma separated list of class names. 
> Allowing multiple extensions should maintain backwards compatibility because 
> existing spark.sql.extensions configuration settings shouldn't contain a 
> comma because the value is a class name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26493) spark.sql.extensions should support multiple extensions

2019-01-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26493:


Assignee: Jamison Bennett

> spark.sql.extensions should support multiple extensions
> ---
>
> Key: SPARK-26493
> URL: https://issues.apache.org/jira/browse/SPARK-26493
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jamison Bennett
>Assignee: Jamison Bennett
>Priority: Minor
>  Labels: starter
>
> The spark.sql.extensions configuration options should support multiple 
> extensions. It is currently possible to load multiple extensions using the 
> programatic interface (e.g. 
> SparkSession.builder().master("..").withExtensions(sparkSessionExtensions1).withExtensions(sparkSessionExtensions2).getOrCreate()
>  ) but the same cannot currently be done with the command line options 
> without writing a wrapper extensions that combines multiple extensions.
>  
> Allowing multiple spark.sql.extensions, would allow the extensions to be 
> easily changes on the command line or via the configuration file. Multiple 
> extensions could be specified using a comma separated list of class names. 
> Allowing multiple extensions should maintain backwards compatibility because 
> existing spark.sql.extensions configuration settings shouldn't contain a 
> comma because the value is a class name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26584) Remove `spark.sql.orc.copyBatchToSpark` internal configuration

2019-01-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26584:


Assignee: (was: Apache Spark)

> Remove `spark.sql.orc.copyBatchToSpark` internal configuration
> --
>
> Key: SPARK-26584
> URL: https://issues.apache.org/jira/browse/SPARK-26584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to remove internal ORC configuration to simplify the code 
> path for Spark 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26584) Remove `spark.sql.orc.copyBatchToSpark` internal configuration

2019-01-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26584:


Assignee: Apache Spark

> Remove `spark.sql.orc.copyBatchToSpark` internal configuration
> --
>
> Key: SPARK-26584
> URL: https://issues.apache.org/jira/browse/SPARK-26584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> This issue aims to remove internal ORC configuration to simplify the code 
> path for Spark 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26584) Remove `spark.sql.orc.copyBatchToSpark` internal configuration

2019-01-09 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-26584:
-

 Summary: Remove `spark.sql.orc.copyBatchToSpark` internal 
configuration
 Key: SPARK-26584
 URL: https://issues.apache.org/jira/browse/SPARK-26584
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue aims to remove internal ORC configuration to simplify the code path 
for Spark 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26583) Add `paranamer` dependency to `core` module

2019-01-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26583:
--
Component/s: Build

> Add `paranamer` dependency to `core` module
> ---
>
> Key: SPARK-26583
> URL: https://issues.apache.org/jira/browse/SPARK-26583
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> With Scala-2.12 profile, Spark application fails while Spark is okay. For 
> example, our documented `SimpleApp` example succeeds to compile but it fails 
> at runtime because it doesn't use `paranamer 2.8` and hits SPARK-22128.
> https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html
> {code}
> $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple ---
> [INFO] my.test:simple:jar:1.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]   \- org.apache.avro:avro:jar:1.8.2:compile
> [INFO]  \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26583) Add `paranamer` dependency to `core` module

2019-01-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26583:
--
Priority: Major  (was: Blocker)

> Add `paranamer` dependency to `core` module
> ---
>
> Key: SPARK-26583
> URL: https://issues.apache.org/jira/browse/SPARK-26583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> With Scala-2.12 profile, Spark application fails while Spark is okay. For 
> example, our documented `SimpleApp` example succeeds to compile but it fails 
> at runtime because it doesn't use `paranamer 2.8` and hits SPARK-22128.
> https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html
> {code}
> $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple ---
> [INFO] my.test:simple:jar:1.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]   \- org.apache.avro:avro:jar:1.8.2:compile
> [INFO]  \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26583) Add `paranamer` dependency to `core` module

2019-01-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26583:
--
Target Version/s: 2.4.1, 3.0.0

> Add `paranamer` dependency to `core` module
> ---
>
> Key: SPARK-26583
> URL: https://issues.apache.org/jira/browse/SPARK-26583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> With Scala-2.12 profile, Spark application fails while Spark is okay. For 
> example, our documented `SimpleApp` example succeeds to compile but it fails 
> at runtime because it doesn't use `paranamer 2.8` and hits SPARK-22128.
> https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html
> {code}
> $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple ---
> [INFO] my.test:simple:jar:1.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]   \- org.apache.avro:avro:jar:1.8.2:compile
> [INFO]  \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26583) Add `paranamer` dependency to `core` module

2019-01-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26583:
--
Description: 
With Scala-2.12 profile, Spark application fails while Spark is okay. For 
example, our documented `SimpleApp` example succeeds to compile but it fails at 
runtime because it doesn't use `paranamer 2.8` and hits SPARK-22128.

https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html

{code}
$ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple ---
[INFO] my.test:simple:jar:1.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile
[INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile
[INFO]   \- org.apache.avro:avro:jar:1.8.2:compile
[INFO]  \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile
{code}

  was:
With Scala-2.12 profile, Spark application fails while Spark is okay. For 
example, our documented `SimpleApp` example fails because it doesn't use 
`paranamer 2.8` and hits SPARK-22128.

https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html

{code}
$ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple ---
[INFO] my.test:simple:jar:1.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile
[INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile
[INFO]   \- org.apache.avro:avro:jar:1.8.2:compile
[INFO]  \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile
{code}


> Add `paranamer` dependency to `core` module
> ---
>
> Key: SPARK-26583
> URL: https://issues.apache.org/jira/browse/SPARK-26583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> With Scala-2.12 profile, Spark application fails while Spark is okay. For 
> example, our documented `SimpleApp` example succeeds to compile but it fails 
> at runtime because it doesn't use `paranamer 2.8` and hits SPARK-22128.
> https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html
> {code}
> $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple ---
> [INFO] my.test:simple:jar:1.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]   \- org.apache.avro:avro:jar:1.8.2:compile
> [INFO]  \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26583) Add `paranamer` to `core` module

2019-01-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738746#comment-16738746
 ] 

Dongjoon Hyun commented on SPARK-26583:
---

This is a minor issue, but this should be fixed before Spark 3.0.0 because this 
affects Spark applications .
For Spark 2.4.0 Scala 2.12 artifacts, the situation is the same. Of course, 
users can add `paranamer` dependency into their pom.

> Add `paranamer` to `core` module
> 
>
> Key: SPARK-26583
> URL: https://issues.apache.org/jira/browse/SPARK-26583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> With Scala-2.12 profile, Spark application fails while Spark is okay. For 
> example, our documented `SimpleApp` example fails because it doesn't use 
> `paranamer 2.8` and hits SPARK-22128.
> https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html
> {code}
> $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple ---
> [INFO] my.test:simple:jar:1.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]   \- org.apache.avro:avro:jar:1.8.2:compile
> [INFO]  \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26583) Add `paranamer` dependency to `core` module

2019-01-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26583:
--
Summary: Add `paranamer` dependency to `core` module  (was: Add `paranamer` 
to `core` module)

> Add `paranamer` dependency to `core` module
> ---
>
> Key: SPARK-26583
> URL: https://issues.apache.org/jira/browse/SPARK-26583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> With Scala-2.12 profile, Spark application fails while Spark is okay. For 
> example, our documented `SimpleApp` example fails because it doesn't use 
> `paranamer 2.8` and hits SPARK-22128.
> https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html
> {code}
> $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple ---
> [INFO] my.test:simple:jar:1.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]   \- org.apache.avro:avro:jar:1.8.2:compile
> [INFO]  \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26583) Add `paranamer` to `core` module

2019-01-09 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-26583:
-

 Summary: Add `paranamer` to `core` module
 Key: SPARK-26583
 URL: https://issues.apache.org/jira/browse/SPARK-26583
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 3.0.0
Reporter: Dongjoon Hyun


With Scala-2.12 profile, Spark application fails while Spark is okay. For 
example, our documented `SimpleApp` example fails because it doesn't use 
`paranamer 2.8` and hits SPARK-22128.

https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html

{code}
$ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple ---
[INFO] my.test:simple:jar:1.0-SNAPSHOT
[INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile
[INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile
[INFO]   \- org.apache.avro:avro:jar:1.8.2:compile
[INFO]  \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26583) Add `paranamer` to `core` module

2019-01-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26583:
--
Priority: Blocker  (was: Major)

> Add `paranamer` to `core` module
> 
>
> Key: SPARK-26583
> URL: https://issues.apache.org/jira/browse/SPARK-26583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> With Scala-2.12 profile, Spark application fails while Spark is okay. For 
> example, our documented `SimpleApp` example fails because it doesn't use 
> `paranamer 2.8` and hits SPARK-22128.
> https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html
> {code}
> $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple ---
> [INFO] my.test:simple:jar:1.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]   \- org.apache.avro:avro:jar:1.8.2:compile
> [INFO]  \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26576) Broadcast hint not applied to partitioned Parquet table

2019-01-09 Thread John Zhuge (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-26576:
---
Affects Version/s: 2.2.2

> Broadcast hint not applied to partitioned Parquet table
> ---
>
> Key: SPARK-26576
> URL: https://issues.apache.org/jira/browse/SPARK-26576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: John Zhuge
>Priority: Major
>
> Broadcast hint is not applied to partitioned Parquet table. Below 
> "SortMergeJoin" is chosen incorrectly and "ResolvedHit(broadcast)" is removed 
> in Optimized Plan.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_with_part (val STRING) 
> PARTITIONED BY (dateint INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_with_part`
> :  +- Relation[val#28,dateint#29] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_with_part`
>   +- Relation[val#32,dateint#33] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- SubqueryAlias `jzhuge`.`parquet_with_part`
>:  +- Relation[val#28,dateint#29] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_with_part`
>  +- Relation[val#32,dateint#33] parquet
> == Optimized Logical Plan ==
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- Project [val#28, dateint#29]
>:  +- Filter isnotnull(dateint#29)
>: +- Relation[val#28,dateint#29] parquet
>+- Project [val#32, dateint#33]
>   +- Filter isnotnull(dateint#33)
>  +- Relation[val#32,dateint#33] parquet
> == Physical Plan ==
> *(5) Project [dateint#29, val#28, val#32]
> +- *(5) SortMergeJoin [dateint#29], [dateint#33], Inner
>:- *(2) Sort [dateint#29 ASC NULLS FIRST], false, 0
>:  +- Exchange(coordinator id: 55629191) hashpartitioning(dateint#29, 
> 500), coordinator[target post-shuffle partition size: 67108864]
>: +- *(1) FileScan parquet jzhuge.parquet_with_part[val#28,dateint#29] 
> Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], 
> PartitionCount: 0, PartitionFilters: [isnotnull(dateint#29)], PushedFilters: 
> [], ReadSchema: struct
>+- *(4) Sort [dateint#33 ASC NULLS FIRST], false, 0
>   +- ReusedExchange [val#32, dateint#33], Exchange(coordinator id: 
> 55629191) hashpartitioning(dateint#29, 500), coordinator[target post-shuffle 
> partition size: 67108864]
> {noformat}
> Broadcast hint is applied to Parquet table without partition. Below 
> "BroadcastHashJoin" is chosen as expected.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_no_part (val STRING, dateint 
> INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_no_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_no_part`
> :  +- Relation[val#44,dateint#45] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_no_part`
>   +- Relation[val#50,dateint#51] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#45, val#44, val#50]
> +- Join Inner, (dateint#45 = dateint#51)
>:- SubqueryAlias `jzhuge`.`parquet_no_part`
>:  +- Relation[val#44,dateint#45] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_no_part`
>  +- Relation[val#50,dateint#51] parquet
> == Optimized Logical Plan ==
> Project [dateint#45, val#44, val#50]
> +- Join Inner, (dateint#45 = dateint#51)
>:- Filter isnotnull(dateint#45)
>:  +- Relation[val#44,dateint#45] parquet
>+- ResolvedHint (broadcast)
>   +- Filter isnotnull(dateint#51)
>  +- Relation[val#50,dateint#51] parquet
> == Physical Plan ==
> *(2) Project [dateint#45, val#44, val#50]
> +- *(2) BroadcastHashJoin [dateint#45], [dateint#51], Inner, BuildRight
>:- *(2) Project [val#44, dateint#45]
>:  +- *(2) Filter isnotnull(dateint#45)
>: +- *(2) FileScan parquet jzhuge.parquet_no_part[val#44,dateint#45] 
> Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], 
> PartitionFilters: [], PushedFilters: [IsNotNull(dateint)], ReadSchema: 
> struct
>+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, 
> true] as bigint)))
>   +- *(1) Project

[jira] [Resolved] (SPARK-26448) retain the difference between 0.0 and -0.0

2019-01-09 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26448.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> retain the difference between 0.0 and -0.0
> --
>
> Key: SPARK-26448
> URL: https://issues.apache.org/jira/browse/SPARK-26448
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26582) Update the optimizer rule NormalizeFloatingNumbers

2019-01-09 Thread Xiao Li (JIRA)

Xiao Li created SPARK-26582:
---

 Summary: Update the optimizer rule NormalizeFloatingNumbers
 Key: SPARK-26582
 URL: https://issues.apache.org/jira/browse/SPARK-26582
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li
Assignee: Dilip Biswal


See the discussion in 
https://github.com/apache/spark/pull/23388/files#r244421992



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2019-01-09 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738622#comment-16738622
 ] 

Thomas Graves commented on SPARK-24374:
---

[~luzengxiang] are you just saying when spark tries to kill the tasks running 
on tensorflow they don't really get killed?  this could be tensorflow 
spark.task.reaper.killTimeout.

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26414) Race between SparkContext and YARN AM can cause NPE in UI setup code

2019-01-09 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26414.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 3.0.0

> Race between SparkContext and YARN AM can cause NPE in UI setup code
> 
>
> Key: SPARK-26414
> URL: https://issues.apache.org/jira/browse/SPARK-26414
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> There's a super narrow race between the SparkContext and the AM startup code:
> - SC starts the AM and waits for it to go into running state
> - AM goes into running state, unblocking SC
> - AM sends AmIpFilter config to SC, adds the filter to the list and then the 
> filter configs
> - unblocked SC is in the middle of setting up the UI and sees only the 
> filter, but not the configs
> Then you get this:
> {noformat}
> ERROR org.apache.spark.SparkContext  - Error initializing SparkContext.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.init(AmIpFilter.java:81)
>   at 
> org.spark_project.jetty.servlet.FilterHolder.initialize(FilterHolder.java:139)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.initialize(ServletHandler.java:881)
>   at 
> org.spark_project.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:349)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:778)
>   at 
> org.spark_project.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:262)
>   at 
> org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>   at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:520)
>   at 
> org.apache.spark.ui.WebUI$$anonfun$attachHandler$1.apply(WebUI.scala:96)
>   at 
> org.apache.spark.ui.WebUI$$anonfun$attachHandler$1.apply(WebUI.scala:96)
>   at scala.Option.foreach(Option.scala:257)
>   at org.apache.spark.ui.WebUI.attachHandler(WebUI.scala:96)
>   at 
> org.apache.spark.SparkContext$$anonfun$22$$anonfun$apply$8.apply(SparkContext.scala:522)
>   at 
> org.apache.spark.SparkContext$$anonfun$22$$anonfun$apply$8.apply(SparkContext.scala:522)
>   at scala.Option.foreach(Option.scala:257)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26414) Race between SparkContext and YARN AM can cause NPE in UI setup code

2019-01-09 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738580#comment-16738580
 ] 

Marcelo Vanzin commented on SPARK-26414:


I actually ended up fixing this in the change for SPARK-24522, since it 
required adding some thread-safety to the code in question here.

> Race between SparkContext and YARN AM can cause NPE in UI setup code
> 
>
> Key: SPARK-26414
> URL: https://issues.apache.org/jira/browse/SPARK-26414
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> There's a super narrow race between the SparkContext and the AM startup code:
> - SC starts the AM and waits for it to go into running state
> - AM goes into running state, unblocking SC
> - AM sends AmIpFilter config to SC, adds the filter to the list and then the 
> filter configs
> - unblocked SC is in the middle of setting up the UI and sees only the 
> filter, but not the configs
> Then you get this:
> {noformat}
> ERROR org.apache.spark.SparkContext  - Error initializing SparkContext.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.init(AmIpFilter.java:81)
>   at 
> org.spark_project.jetty.servlet.FilterHolder.initialize(FilterHolder.java:139)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.initialize(ServletHandler.java:881)
>   at 
> org.spark_project.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:349)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:778)
>   at 
> org.spark_project.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:262)
>   at 
> org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>   at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:520)
>   at 
> org.apache.spark.ui.WebUI$$anonfun$attachHandler$1.apply(WebUI.scala:96)
>   at 
> org.apache.spark.ui.WebUI$$anonfun$attachHandler$1.apply(WebUI.scala:96)
>   at scala.Option.foreach(Option.scala:257)
>   at org.apache.spark.ui.WebUI.attachHandler(WebUI.scala:96)
>   at 
> org.apache.spark.SparkContext$$anonfun$22$$anonfun$apply$8.apply(SparkContext.scala:522)
>   at 
> org.apache.spark.SparkContext$$anonfun$22$$anonfun$apply$8.apply(SparkContext.scala:522)
>   at scala.Option.foreach(Option.scala:257)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25484) Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark

2019-01-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25484.
---
   Resolution: Fixed
 Assignee: Peter Toth
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/22617

> Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark
> --
>
> Key: SPARK-25484
> URL: https://issues.apache.org/jira/browse/SPARK-25484
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.0.0
>
>
> Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to print the output as a 
> separate file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26576) Broadcast hint not applied to partitioned Parquet table

2019-01-09 Thread John Zhuge (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737964#comment-16737964
 ] 

John Zhuge edited comment on SPARK-26576 at 1/9/19 5:12 PM:


No issue on the master branch. Please note "rightHint=(broadcast)" for the Join 
in Optimized Plan.
{noformat}
scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => 
df.join(broadcast(df), "dateint").explain(true))

== Parsed Logical Plan ==
'Join UsingJoin(Inner,List(dateint))
:- SubqueryAlias `jzhuge`.`parquet_with_part`
:  +- Relation[val#34,dateint#35] parquet
+- ResolvedHint (broadcast)
   +- SubqueryAlias `jzhuge`.`parquet_with_part`
  +- Relation[val#40,dateint#41] parquet

== Analyzed Logical Plan ==
dateint: int, val: string, val: string
Project [dateint#35, val#34, val#40]
+- Join Inner, (dateint#35 = dateint#41)
   :- SubqueryAlias `jzhuge`.`parquet_with_part`
   :  +- Relation[val#34,dateint#35] parquet
   +- ResolvedHint (broadcast)
  +- SubqueryAlias `jzhuge`.`parquet_with_part`
 +- Relation[val#40,dateint#41] parquet

== Optimized Logical Plan ==
Project [dateint#35, val#34, val#40]
+- Join Inner, (dateint#35 = dateint#41), rightHint=(broadcast)
   :- Project [val#34, dateint#35]
   :  +- Filter isnotnull(dateint#35)
   : +- Relation[val#34,dateint#35] parquet
   +- Project [val#40, dateint#41]
  +- Filter isnotnull(dateint#41)
 +- Relation[val#40,dateint#41] parquet

== Physical Plan ==
*(2) Project [dateint#35, val#34, val#40]
+- *(2) BroadcastHashJoin [dateint#35], [dateint#41], Inner, BuildRight
   :- *(2) FileScan parquet jzhuge.parquet_with_part[val#34,dateint#35] 
Batched: true, DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: 
[isnotnull(dateint#35)], PushedFilters: [], ReadSchema: struct
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, 
true] as bigint)))
  +- *(1) FileScan parquet jzhuge.parquet_with_part[val#40,dateint#41] 
Batched: true, DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: 
[isnotnull(dateint#41)], PushedFilters: [], ReadSchema: struct
{noformat}
>From a quick look at the source, EliminateResolvedHint pulls broadcast hint 
>into Join and eliminates the ResolvedHint node. It is called before 
>PruneFileSourcePartitions so the above code in 
>PhysicalOperation.collectProjectsAndFilters is never called on master branch 
>for the few cases I tried.


was (Author: jzhuge):
No issue on the master branch. Please note "rightHint=(broadcast)" for the Join 
in Optimized Plan.
{noformat}
scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => 
df.join(broadcast(df), "dateint").explain(true))

== Parsed Logical Plan ==
'Join UsingJoin(Inner,List(dateint))
:- SubqueryAlias `jzhuge`.`parquet_with_part`
:  +- Relation[val#34,dateint#35] parquet
+- ResolvedHint (broadcast)
   +- SubqueryAlias `jzhuge`.`parquet_with_part`
  +- Relation[val#40,dateint#41] parquet

== Analyzed Logical Plan ==
dateint: int, val: string, val: string
Project [dateint#35, val#34, val#40]
+- Join Inner, (dateint#35 = dateint#41)
   :- SubqueryAlias `jzhuge`.`parquet_with_part`
   :  +- Relation[val#34,dateint#35] parquet
   +- ResolvedHint (broadcast)
  +- SubqueryAlias `jzhuge`.`parquet_with_part`
 +- Relation[val#40,dateint#41] parquet

== Optimized Logical Plan ==
Project [dateint#35, val#34, val#40]
+- Join Inner, (dateint#35 = dateint#41), rightHint=(broadcast)
   :- Project [val#34, dateint#35]
   :  +- Filter isnotnull(dateint#35)
   : +- Relation[val#34,dateint#35] parquet
   +- Project [val#40, dateint#41]
  +- Filter isnotnull(dateint#41)
 +- Relation[val#40,dateint#41] parquet

== Physical Plan ==
*(2) Project [dateint#35, val#34, val#40]
+- *(2) BroadcastHashJoin [dateint#35], [dateint#41], Inner, BuildRight
   :- *(2) FileScan parquet jzhuge.parquet_with_part[val#34,dateint#35] 
Batched: true, DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: 
[isnotnull(dateint#35)], PushedFilters: [], ReadSchema: struct
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, 
true] as bigint)))
  +- *(1) FileScan parquet jzhuge.parquet_with_part[val#40,dateint#41] 
Batched: true, DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: 
[isnotnull(dateint#41)], PushedFilters: [], ReadSchema: struct
{noformat}
>From a quick look at the source, EliminateResolvedHint pulls broadcast hint 
>into Join and eliminates the ResolvedHint node. It is called before 
>PruneFileSourcePartitions so the above code in 
>PhysicalOperation.collectProjectsAndFilters is never called on master branch.

> Broadcast hint not applied to partitioned Parquet table
>

[jira] [Assigned] (SPARK-26254) Move delegation token providers into a separate project

2019-01-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26254:


Assignee: Apache Spark

> Move delegation token providers into a separate project
> ---
>
> Key: SPARK-26254
> URL: https://issues.apache.org/jira/browse/SPARK-26254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>
> There was a discussion in 
> [PR#22598|https://github.com/apache/spark/pull/22598] that there are several 
> provided dependencies inside core project which shouldn't be there (for ex. 
> hive and kafka). This jira is to solve this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26254) Move delegation token providers into a separate project

2019-01-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26254:


Assignee: (was: Apache Spark)

> Move delegation token providers into a separate project
> ---
>
> Key: SPARK-26254
> URL: https://issues.apache.org/jira/browse/SPARK-26254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> There was a discussion in 
> [PR#22598|https://github.com/apache/spark/pull/22598] that there are several 
> provided dependencies inside core project which shouldn't be there (for ex. 
> hive and kafka). This jira is to solve this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26581) Spark Dataset write JSON with Multiline

2019-01-09 Thread Anil (JIRA)

Anil created SPARK-26581:


 Summary: Spark Dataset write JSON with Multiline
 Key: SPARK-26581
 URL: https://issues.apache.org/jira/browse/SPARK-26581
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Anil


Hi,

Spark currently can only write JSON file for single node, if i have multiple 
lines or nodes, spark writes nodes with curly braces " \{ }" without comma "," 
in between both the nodes and there is no square brackets at start and end of 
the file. How to achive this. i am trying to write the JSON file like:.

ds.write().format("JSON").option("multiline","true").save(path);

please help on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark

2019-01-09 Thread JIRA



[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738190#comment-16738190
 ] 

Fabian Höring commented on SPARK-25433:
---

[~dongjoon] [~hyukjin.kwon]
If you think the blog post is interesting to other spark users maybe it can be 
shared somehow on the mailing list or to watchers of this ticket SPARK-13587. I 
didn't want to spam that's why I have only referenced it here. 

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where pex comes in. It is a nice way to create a single 
> executable zip file with all dependencies included. You have the pex command 
> line tool to build your package and when it is built you are sure it works. 
> This is in my opinion the most elegant way to ship python code (better than 
> virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark

2019-01-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738125#comment-16738125
 ] 

Dongjoon Hyun commented on SPARK-25433:
---

Thank you for the pointer, [~hyukjin.kwon] and [~fhoering].

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where pex comes in. It is a nice way to create a single 
> executable zip file with all dependencies included. You have the pex command 
> line tool to build your package and when it is built you are sure it works. 
> This is in my opinion the most elegant way to ship python code (better than 
> virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26580) remove Scala 2.11 hack for Scala UDF

2019-01-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26580:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove Scala 2.11 hack for Scala UDF
> 
>
> Key: SPARK-26580
> URL: https://issues.apache.org/jira/browse/SPARK-26580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26580) remove Scala 2.11 hack for Scala UDF

2019-01-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26580:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove Scala 2.11 hack for Scala UDF
> 
>
> Key: SPARK-26580
> URL: https://issues.apache.org/jira/browse/SPARK-26580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2019-01-09 Thread Oscar Bonilla (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oscar Bonilla updated SPARK-26365:
--
Affects Version/s: 2.4.0

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Oscar Bonilla
>Priority: Minor
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26576) Broadcast hint not applied to partitioned Parquet table

2019-01-09 Thread John Zhuge (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737964#comment-16737964
 ] 

John Zhuge commented on SPARK-26576:


No issue on the master branch. Please note "rightHint=(broadcast)" for the Join 
in Optimized Plan.
{noformat}
scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => 
df.join(broadcast(df), "dateint").explain(true))

== Parsed Logical Plan ==
'Join UsingJoin(Inner,List(dateint))
:- SubqueryAlias `jzhuge`.`parquet_with_part`
:  +- Relation[val#34,dateint#35] parquet
+- ResolvedHint (broadcast)
   +- SubqueryAlias `jzhuge`.`parquet_with_part`
  +- Relation[val#40,dateint#41] parquet

== Analyzed Logical Plan ==
dateint: int, val: string, val: string
Project [dateint#35, val#34, val#40]
+- Join Inner, (dateint#35 = dateint#41)
   :- SubqueryAlias `jzhuge`.`parquet_with_part`
   :  +- Relation[val#34,dateint#35] parquet
   +- ResolvedHint (broadcast)
  +- SubqueryAlias `jzhuge`.`parquet_with_part`
 +- Relation[val#40,dateint#41] parquet

== Optimized Logical Plan ==
Project [dateint#35, val#34, val#40]
+- Join Inner, (dateint#35 = dateint#41), rightHint=(broadcast)
   :- Project [val#34, dateint#35]
   :  +- Filter isnotnull(dateint#35)
   : +- Relation[val#34,dateint#35] parquet
   +- Project [val#40, dateint#41]
  +- Filter isnotnull(dateint#41)
 +- Relation[val#40,dateint#41] parquet

== Physical Plan ==
*(2) Project [dateint#35, val#34, val#40]
+- *(2) BroadcastHashJoin [dateint#35], [dateint#41], Inner, BuildRight
   :- *(2) FileScan parquet jzhuge.parquet_with_part[val#34,dateint#35] 
Batched: true, DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: 
[isnotnull(dateint#35)], PushedFilters: [], ReadSchema: struct
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, 
true] as bigint)))
  +- *(1) FileScan parquet jzhuge.parquet_with_part[val#40,dateint#41] 
Batched: true, DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: 
[isnotnull(dateint#41)], PushedFilters: [], ReadSchema: struct
{noformat}
>From a quick look at the source, EliminateResolvedHint pulls broadcast hint 
>into Join and eliminates the ResolvedHint node. It is called before 
>PruneFileSourcePartitions so the above code in 
>PhysicalOperation.collectProjectsAndFilters is never called on master branch.

> Broadcast hint not applied to partitioned Parquet table
> ---
>
> Key: SPARK-26576
> URL: https://issues.apache.org/jira/browse/SPARK-26576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: John Zhuge
>Priority: Major
>
> Broadcast hint is not applied to partitioned Parquet table. Below 
> "SortMergeJoin" is chosen incorrectly and "ResolvedHit(broadcast)" is removed 
> in Optimized Plan.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_with_part (val STRING) 
> PARTITIONED BY (dateint INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_with_part`
> :  +- Relation[val#28,dateint#29] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_with_part`
>   +- Relation[val#32,dateint#33] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- SubqueryAlias `jzhuge`.`parquet_with_part`
>:  +- Relation[val#28,dateint#29] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_with_part`
>  +- Relation[val#32,dateint#33] parquet
> == Optimized Logical Plan ==
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- Project [val#28, dateint#29]
>:  +- Filter isnotnull(dateint#29)
>: +- Relation[val#28,dateint#29] parquet
>+- Project [val#32, dateint#33]
>   +- Filter isnotnull(dateint#33)
>  +- Relation[val#32,dateint#33] parquet
> == Physical Plan ==
> *(5) Project [dateint#29, val#28, val#32]
> +- *(5) SortMergeJoin [dateint#29], [dateint#33], Inner
>:- *(2) Sort [dateint#29 ASC NULLS FIRST], false, 0
>:  +- Exchange(coordinator id: 55629191) hashpartitioning(dateint#29, 
> 500), coordinator[target post-shuffle partition size: 67108864]
>: +- *(1) FileScan parquet jzhuge.parquet_with_part[val#28,dateint#29] 
> Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], 
> PartitionCount: 0, PartitionFilters:

58 matches

Mail list logo