[jira] [Updated] (SPARK-25450) PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation

2018-09-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25450:

Description: 
The problem was cause by the PushProjectThroughUnion rule, which, when creating 
new Project for each child of Union, uses the same exprId for expressions of 
the same position. This is wrong because, for each child of Union, the 
expressions are all independent, and it can lead to a wrong result if other 
rules like FoldablePropagation kicks in, taking two different expressions as 
the same.


> PushProjectThroughUnion rule uses the same exprId for project expressions in 
> each Union child, causing mistakes in constant propagation
> ---
>
> Key: SPARK-25450
> URL: https://issues.apache.org/jira/browse/SPARK-25450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Major
>
> The problem was cause by the PushProjectThroughUnion rule, which, when 
> creating new Project for each child of Union, uses the same exprId for 
> expressions of the same position. This is wrong because, for each child of 
> Union, the expressions are all independent, and it can lead to a wrong result 
> if other rules like FoldablePropagation kicks in, taking two different 
> expressions as the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25450) PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation

2018-09-17 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618505#comment-16618505
 ] 

Xiao Li commented on SPARK-25450:
-

https://github.com/apache/spark/pull/22447

> PushProjectThroughUnion rule uses the same exprId for project expressions in 
> each Union child, causing mistakes in constant propagation
> ---
>
> Key: SPARK-25450
> URL: https://issues.apache.org/jira/browse/SPARK-25450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-17 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618495#comment-16618495
 ] 

Xiangrui Meng edited comment on SPARK-25321 at 9/18/18 5:21 AM:


[~WeichenXu123] Could you check whether mleap is compatible with the tree Node 
breaking changes? This line is relevant: 
https://github.com/combust/mleap/blob/master/mleap-runtime/src/main/scala/ml/combust/mleap/bundle/ops/classification/DecisionTreeClassifierOp.scala

If it is hard to make MLeap upgrade, we should revert the change in 2.4.

cc: [~hollinwilkins]


was (Author: mengxr):
[~WeichenXu123] Could you check whether mleap is compatible with the tree Node 
breaking changes? This line is relevant: 
https://github.com/combust/mleap/blob/master/mleap-runtime/src/main/scala/ml/combust/mleap/bundle/ops/classification/DecisionTreeClassifierOp.scala

If it is hard to make MLeap upgrade, we should revert the change.

cc: [~hollinwilkins]

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-17 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618495#comment-16618495
 ] 

Xiangrui Meng commented on SPARK-25321:
---

[~WeichenXu123] Could you check whether mleap is compatible with the tree Node 
breaking changes? This line is relevant: 
https://github.com/combust/mleap/blob/master/mleap-runtime/src/main/scala/ml/combust/mleap/bundle/ops/classification/DecisionTreeClassifierOp.scala

If it is hard to make MLeap upgrade, we should revert the change.

cc: [~hollinwilkins]

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25230) Upper behavior incorrect for string contains "ß"

2018-09-17 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618473#comment-16618473
 ] 

Yuming Wang commented on SPARK-25230:
-

May be a JDK bug: [https://bugs.openjdk.java.net/browse/JDK-8186073]

 

> Upper behavior incorrect for string contains "ß"
> 
>
> Key: SPARK-25230
> URL: https://issues.apache.org/jira/browse/SPARK-25230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yuming Wang
>Priority: Major
> Attachments: MySQL.png, Oracle.png, Teradata.jpeg
>
>
> How to reproduce:
> {code:sql}
> spark-sql> SELECT upper('Haßler');
> HASSLER
> {code}
> Mainstream databases returns {{HAßLER}}.
>  !MySQL.png!
>  
> This behavior may lead to data inconsistency:
> {code:sql}
> create temporary view SPARK_25230 as select * from values
>   ("Hassler"),
>   ("Haßler")
> as EMPLOYEE(name);
> select UPPER(name) from SPARK_25230 group by 1;
> -- result
> HASSLER{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25444) Refactor GenArrayData.genCodeToCreateArrayData() method

2018-09-17 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-25444.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.5.0

Issue resolved by pull request 22439
https://github.com/apache/spark/pull/22439

> Refactor GenArrayData.genCodeToCreateArrayData() method
> ---
>
> Key: SPARK-25444
> URL: https://issues.apache.org/jira/browse/SPARK-25444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.5.0
>
>
> {{GenArrayData.genCodeToCreateArrayData()}} generated Java code to create a 
> temporary Java array to create  {{ArrayData}}. It can be eliminated by using 
> {{ArrayData.createArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24360) Support Hive 3.1 metastore

2018-09-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24360:
--
Description: Hive 3.1.0 is released. This issue aims to support Hive 
Metastore 3.1.  (was: Hive 3.0.0 is released. This issue aims to support Hive 
Metastore 3.0.)

> Support Hive 3.1 metastore
> --
>
> Key: SPARK-24360
> URL: https://issues.apache.org/jira/browse/SPARK-24360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Hive 3.1.0 is released. This issue aims to support Hive Metastore 3.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25450) PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation

2018-09-17 Thread Maryann Xue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated SPARK-25450:

Issue Type: Bug  (was: Improvement)

> PushProjectThroughUnion rule uses the same exprId for project expressions in 
> each Union child, causing mistakes in constant propagation
> ---
>
> Key: SPARK-25450
> URL: https://issues.apache.org/jira/browse/SPARK-25450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25450) PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation

2018-09-17 Thread Maryann Xue (JIRA)
Maryann Xue created SPARK-25450:
---

 Summary: PushProjectThroughUnion rule uses the same exprId for 
project expressions in each Union child, causing mistakes in constant 
propagation
 Key: SPARK-25450
 URL: https://issues.apache.org/jira/browse/SPARK-25450
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maryann Xue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25443) fix issues when building docs with release scripts in docker

2018-09-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25443.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22438
[https://github.com/apache/spark/pull/22438]

> fix issues when building docs with release scripts in docker
> 
>
> Key: SPARK-25443
> URL: https://issues.apache.org/jira/browse/SPARK-25443
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25409) Speed up Spark History at start if there are tens of thousands of applications.

2018-09-17 Thread Rong Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618213#comment-16618213
 ] 

Rong Tang edited comment on SPARK-25409 at 9/17/18 10:12 PM:
-

 Pull request created for it. [https://github.com/apache/spark/pull/22444]

 


was (Author: trjianjianjiao):
Create a pull request for it. [https://github.com/apache/spark/pull/22444]

 

> Speed up Spark History at start if there are tens of thousands of 
> applications.
> ---
>
> Key: SPARK-25409
> URL: https://issues.apache.org/jira/browse/SPARK-25409
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rong Tang
>Priority: Major
> Attachments: SPARK-25409.0001.patch
>
>
> We have a spark history server, storing 7 days' applications. it usually has 
> 10K to 20K attempts.
> We found that it can take hours at start up,loading/replaying the logs in 
> event-logs folder.  thus, new finished applications have to wait several 
> hours to be seem. So I made 2 improvements for it.
>  # As we run spark on yarn. the on-going applications' information can also 
> be seen via resource manager, so I introduce in a flag 
> spark.history.fs.load.incomplete to say loading logs for incomplete attempts 
> or not.
>  # Incremental loading applications. as I said, we have more then 10K 
> applications stored, it can take hours to load all of them at the first time. 
> so I introduced in a config spark.history.fs.appRefreshNum to say how many 
> application to load each time, then it gets a chance the check the latest 
> updates.
> Here are the benchmark I did.  our system has 1K incomplete application ( it 
> was not cleaned up for some reason, that is another issue that I need 
> investigate), and applications' log size can be gigabytes. 
>  
> Not load incomplete attempts.
> | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|All|No|13K|31 minutes| yes|
>  
>  
> Limit each time how much to load.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|3000|Yes|13K|42 minutes except last 1.6K
> (The last 1.6K attempts cost extremely long 2.5 hours)|NO|
>  
>  
> Limit each time how many to load, and not load incomplete jobs.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes|
> |2|3000|NO|12K|17minutes
>  |10 minutes
> ( 41 minutes in total)|NO|
>  
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes|
> |2|3000|NO|18.5K|20minutes|18 minutes
> (2 hours 18 minutes in total)
>  |NO|
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25409) Speed up Spark History at start if there are tens of thousands of applications.

2018-09-17 Thread Rong Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618213#comment-16618213
 ] 

Rong Tang commented on SPARK-25409:
---

Create a pull request for it. [https://github.com/apache/spark/pull/22444]

 

> Speed up Spark History at start if there are tens of thousands of 
> applications.
> ---
>
> Key: SPARK-25409
> URL: https://issues.apache.org/jira/browse/SPARK-25409
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rong Tang
>Priority: Major
> Attachments: SPARK-25409.0001.patch
>
>
> We have a spark history server, storing 7 days' applications. it usually has 
> 10K to 20K attempts.
> We found that it can take hours at start up,loading/replaying the logs in 
> event-logs folder.  thus, new finished applications have to wait several 
> hours to be seem. So I made 2 improvements for it.
>  # As we run spark on yarn. the on-going applications' information can also 
> be seen via resource manager, so I introduce in a flag 
> spark.history.fs.load.incomplete to say loading logs for incomplete attempts 
> or not.
>  # Incremental loading applications. as I said, we have more then 10K 
> applications stored, it can take hours to load all of them at the first time. 
> so I introduced in a config spark.history.fs.appRefreshNum to say how many 
> application to load each time, then it gets a chance the check the latest 
> updates.
> Here are the benchmark I did.  our system has 1K incomplete application ( it 
> was not cleaned up for some reason, that is another issue that I need 
> investigate), and applications' log size can be gigabytes. 
>  
> Not load incomplete attempts.
> | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|All|No|13K|31 minutes| yes|
>  
>  
> Limit each time how much to load.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|3000|Yes|13K|42 minutes except last 1.6K
> (The last 1.6K attempts cost extremely long 2.5 hours)|NO|
>  
>  
> Limit each time how many to load, and not load incomplete jobs.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes|
> |2|3000|NO|12K|17minutes
>  |10 minutes
> ( 41 minutes in total)|NO|
>  
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes|
> |2|3000|NO|18.5K|20minutes|18 minutes
> (2 hours 18 minutes in total)
>  |NO|
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25449) Don't send zero accumulators in heartbeats

2018-09-17 Thread Mukul Murthy (JIRA)
Mukul Murthy created SPARK-25449:


 Summary: Don't send zero accumulators in heartbeats
 Key: SPARK-25449
 URL: https://issues.apache.org/jira/browse/SPARK-25449
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Mukul Murthy


Heartbeats sent from executors to the driver every 10 seconds contain metrics 
and are generally on the order of a few KBs. However, for large jobs with lots 
of tasks, heartbeats can be on the order of tens of MBs, causing tasks to die 
with heartbeat failures. We can mitigate this by not sending zero metrics to 
the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-17 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618118#comment-16618118
 ] 

Joseph K. Bradley commented on SPARK-25321:
---

[~WeichenXu123] Have you been able to look into reverting those changes or 
discussed with [~mengxr] about reverting them?  Thanks!

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2018-09-17 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618104#comment-16618104
 ] 

Bruce Robbins commented on SPARK-22036:
---

[~mgaido] In this change, you modified how precision and scale are determined 
when literals are promoted to decimal. For example, before the change, an 
integer literal's precision and scale would be hardcoded to DecimalType(10, 0). 
After the change, it's based on the number of digits in the literal.

However, that new behavior for literals is not toggled by 
{{spark.sql.decimalOperations.allowPrecisionLoss}} like the other changes in 
behavior introduced by the PR.

As a result, there are cases where we see truncation and rounding in 2.3/2.4 
that we don't see in 2.2, and this change in behavior is not controllable via 
the configuration setting. E.g,:

In 2.2:
{noformat}
scala> sql("select 26393499451/(1e6 * 1000) as c1").printSchema
root
 |-- c1: decimal(27,13) (nullable = true) <== 13 decimal digits
scala> sql("select 26393499451/(1e6 * 1000) as c1").show
++
|  c1|
++
|26.393499451|
++
{noformat}
In 2.3 and up:
{noformat}
scala> sql("set spark.sql.decimalOperations.allowPrecisionLoss").show
++-+
| key|value|
++-+
|spark.sql.decimal...| true|
++-+
scala> sql("select 26393499451/(1e6 * 1000) as c1").printSchema
root
 |-- c1: decimal(12,7) (nullable = true)
scala> sql("select 26393499451/(1e6 * 1000) as c1").show
+--+
|c1|
+--+
|26.3934995| <== result is truncated and rounded up.
+--+
scala> sql("set spark.sql.decimalOperations.allowPrecisionLoss=false").show
++-+
| key|value|
++-+
|spark.sql.decimal...|false|
++-+
scala> sql("select 26393499451/(1e6 * 1000) as c1").printSchema
root
 |-- c1: decimal(12,7) (nullable = true)
scala> sql("select 26393499451/(1e6 * 1000) as c1").show
+--+
|c1|
+--+
|26.3934995| <== result is still truncated and rounded up.
+--+
scala> 
{noformat}
I can force it to behave the old way, at least for this case, by explicitly 
casting the literal:
{noformat}
scala> sql("select 26393499451/(1e6 * cast(1000 as decimal(10, 0))) as c1").show
++
|  c1|
++
|26.393499451|
++
{noformat}
Do you think it makes sense for 
{{spark.sql.decimalOperations.allowPrecisionLoss}} to also toggle how literal 
promotion happens (the old way vs. the new way)?

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> The multiplication of two BigDecimal numbers sometimes returns null. Here is 
> a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks

2018-09-17 Thread Sandish Kumar HN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618098#comment-16618098
 ] 

Sandish Kumar HN commented on SPARK-20760:
--

I do see the issue in spark 2.1.1 & 2.2.0 and I was able to replicate the issue 
with above code snippets. 

> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
>Reporter: Binzi Cao
>Priority: Major
> Attachments: RDD Blocks .png, RDD blocks in spark 2.1.1.png, Storage 
> in spark 2.1.1.png
>
>
> Memory leak for RDD blocks for a long time running rdd process.
> We  have a long term running application, which is doing computations of 
> RDDs. and we found the RDD blocks are keep increasing in the spark ui page. 
> The rdd blocks and memory usage do not mach the cached rdds and memory. It 
> looks like spark keeps old rdd in memory and never released it or never got a 
> chance to release it. The job will eventually die of out of memory. 
> In addition, I'm not seeing this issue in spark 1.6. We are seeing the same 
> issue in Yarn Cluster mode both in kafka streaming and batch applications. 
> The issue in streaming is similar, however, it seems the rdd blocks grows a 
> bit slower than batch jobs. 
> The below is the sample code and it is reproducible by justing running it in 
> local mode. 
> Scala file:
> {code}
> import scala.concurrent.duration.Duration
> import scala.util.{Try, Failure, Success}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import scala.concurrent._
> import ExecutionContext.Implicits.global
> case class Person(id: String, name: String)
> object RDDApp {
>   def run(sc: SparkContext) = {
> while (true) {
>   val r = scala.util.Random
>   val data = (1 to r.nextInt(100)).toList.map { a =>
> Person(a.toString, a.toString)
>   }
>   val rdd = sc.parallelize(data)
>   rdd.cache
>   println("running")
>   val a = (1 to 100).toList.map { x =>
> Future(rdd.filter(_.id == x.toString).collect)
>   }
>   a.foreach { f =>
> println(Await.ready(f, Duration.Inf).value.get)
>   }
>   rdd.unpersist()
> }
>   }
>   def main(args: Array[String]): Unit = {
>val conf = new SparkConf().setAppName("test")
> val sc   = new SparkContext(conf)
> run(sc)
>   }
> }
> {code}
> build sbt file:
> {code}
> name := "RDDTest"
> version := "0.1.1"
> scalaVersion := "2.11.5"
> libraryDependencies ++= Seq (
> "org.scalaz" %% "scalaz-core" % "7.2.0",
> "org.scalaz" %% "scalaz-concurrent" % "7.2.0",
> "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
> "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
>   )
> addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
> mainClass in assembly := Some("RDDApp")
> test in assembly := {}
> {code}
> To reproduce it: 
> Just 
> {code}
> spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
> --executor-memory 4G \
> --executor-cores 1 \
> --num-executors 1 \
> --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20074) Make buffer size in unsafe external sorter configurable

2018-09-17 Thread Kevin English (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618061#comment-16618061
 ] 

Kevin English commented on SPARK-20074:
---

I found this issue from external content that indicated that this limits IO 
write block sizes, for instance for parquet files following a N:1 
re-partitioning.  Can someone confirm that being able to radically increase 
this value would reduce spilling when aggregating a large number of small files?

> Make buffer size in unsafe external sorter configurable
> ---
>
> Key: SPARK-20074
> URL: https://issues.apache.org/jira/browse/SPARK-20074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Sital Kedia
>Priority: Major
>
> Currently, it is hardcoded to 32kb, see - 
> https://github.com/sitalkedia/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L123



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16323) Avoid unnecessary cast when doing integral divide

2018-09-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-16323.
---
Resolution: Fixed

Issue resolved by pull request 22395
[https://github.com/apache/spark/pull/22395]

> Avoid unnecessary cast when doing integral divide
> -
>
> Key: SPARK-16323
> URL: https://issues.apache.org/jira/browse/SPARK-16323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Sean Zhong
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.5.0
>
>
> This is a follow up of issue SPARK-15776
> *Problem:*
> For Integer divide operator div:
> {code}
> scala> spark.sql("select 6 div 3").explain(true)
> ...
> == Analyzed Logical Plan ==
> CAST((6 / 3) AS BIGINT): bigint
> Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / 
> 3) AS BIGINT)#5L]
> +- OneRowRelation$
> ...
> {code}
> For performance reason, we should not do unnecessary cast {{cast(xx as 
> double)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16323) Avoid unnecessary cast when doing integral divide

2018-09-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-16323:
-

Assignee: Marco Gaido

> Avoid unnecessary cast when doing integral divide
> -
>
> Key: SPARK-16323
> URL: https://issues.apache.org/jira/browse/SPARK-16323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Sean Zhong
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.5.0
>
>
> This is a follow up of issue SPARK-15776
> *Problem:*
> For Integer divide operator div:
> {code}
> scala> spark.sql("select 6 div 3").explain(true)
> ...
> == Analyzed Logical Plan ==
> CAST((6 / 3) AS BIGINT): bigint
> Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / 
> 3) AS BIGINT)#5L]
> +- OneRowRelation$
> ...
> {code}
> For performance reason, we should not do unnecessary cast {{cast(xx as 
> double)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16323) Avoid unnecessary cast when doing integral divide

2018-09-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16323:
--
Affects Version/s: 2.5.0

> Avoid unnecessary cast when doing integral divide
> -
>
> Key: SPARK-16323
> URL: https://issues.apache.org/jira/browse/SPARK-16323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Sean Zhong
>Priority: Minor
> Fix For: 2.5.0
>
>
> This is a follow up of issue SPARK-15776
> *Problem:*
> For Integer divide operator div:
> {code}
> scala> spark.sql("select 6 div 3").explain(true)
> ...
> == Analyzed Logical Plan ==
> CAST((6 / 3) AS BIGINT): bigint
> Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / 
> 3) AS BIGINT)#5L]
> +- OneRowRelation$
> ...
> {code}
> For performance reason, we should not do unnecessary cast {{cast(xx as 
> double)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16323) Avoid unnecessary cast when doing integral divide

2018-09-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16323:
--
Fix Version/s: 2.5.0

> Avoid unnecessary cast when doing integral divide
> -
>
> Key: SPARK-16323
> URL: https://issues.apache.org/jira/browse/SPARK-16323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Sean Zhong
>Priority: Minor
> Fix For: 2.5.0
>
>
> This is a follow up of issue SPARK-15776
> *Problem:*
> For Integer divide operator div:
> {code}
> scala> spark.sql("select 6 div 3").explain(true)
> ...
> == Analyzed Logical Plan ==
> CAST((6 / 3) AS BIGINT): bigint
> Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / 
> 3) AS BIGINT)#5L]
> +- OneRowRelation$
> ...
> {code}
> For performance reason, we should not do unnecessary cast {{cast(xx as 
> double)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata

2018-09-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25423.
---
   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22435
[https://github.com/apache/spark/pull/22435]

> Output "dataFilters" in DataSourceScanExec.metadata
> ---
>
> Key: SPARK-25423
> URL: https://issues.apache.org/jira/browse/SPARK-25423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Maryann Xue
>Assignee: Yuming Wang
>Priority: Trivial
>  Labels: starter
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25339) Refactor FilterPushdownBenchmark to use main method

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25339:


Assignee: (was: Apache Spark)

> Refactor FilterPushdownBenchmark to use main method
> ---
>
> Key: SPARK-25339
> URL: https://issues.apache.org/jira/browse/SPARK-25339
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> Wenchen commented on the PR: 
> https://github.com/apache/spark/pull/22336#issuecomment-418604019



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25339) Refactor FilterPushdownBenchmark to use main method

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617853#comment-16617853
 ] 

Apache Spark commented on SPARK-25339:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22443

> Refactor FilterPushdownBenchmark to use main method
> ---
>
> Key: SPARK-25339
> URL: https://issues.apache.org/jira/browse/SPARK-25339
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> Wenchen commented on the PR: 
> https://github.com/apache/spark/pull/22336#issuecomment-418604019



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25339) Refactor FilterPushdownBenchmark to use main method

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25339:


Assignee: Apache Spark

> Refactor FilterPushdownBenchmark to use main method
> ---
>
> Key: SPARK-25339
> URL: https://issues.apache.org/jira/browse/SPARK-25339
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Wenchen commented on the PR: 
> https://github.com/apache/spark/pull/22336#issuecomment-418604019



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23906) Add UDF trunc(numeric)

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617828#comment-16617828
 ] 

Apache Spark commented on SPARK-23906:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22419

> Add UDF trunc(numeric)
> --
>
> Key: SPARK-23906
> URL: https://issues.apache.org/jira/browse/SPARK-23906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-14582
> We already have {{date_trunc}} and {{trunc}}. Need to discuss whether we 
> should introduce a new name or reuse {{trunc}} for truncating numbers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617831#comment-16617831
 ] 

Apache Spark commented on SPARK-25442:
--

User 'suryag10' has created a pull request for this issue:
https://github.com/apache/spark/pull/22433

> Support STS to run in K8S deployment with spark deployment mode as cluster
> --
>
> Key: SPARK-25442
> URL: https://issues.apache.org/jira/browse/SPARK-25442
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, SQL
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Suryanarayana Garlapati
>Priority: Major
>
> STS fails to start in kubernetes deployments with spark deploy mode as 
> cluster.  Support should be added to make it run in K8S deployments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25430) Add map parameter for withColumnRenamed

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25430:


Assignee: (was: Apache Spark)

> Add map parameter for withColumnRenamed
> ---
>
> Key: SPARK-25430
> URL: https://issues.apache.org/jira/browse/SPARK-25430
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Goun Na
>Priority: Major
>
> WithColumnRenamed method should work with map parameter. It removes code 
> redundancy.
> {code:java}
> // example
> df.withColumnRenamed(Map( "c1" -> "first_column", "c2" -> "second_column" 
> )){code}
> {code:java}
> // from abbr columns to desc columns
> val m = Map( "c1" -> "first_column", "c2" -> "second_column" )
> df1.withColumnRenamed(m) 
> df2.withColumnRenamed(m)
> {code}
> It is useful for CJK users when they are working on analysis in notebook 
> environment such as Zeppelin, Databricks, Apache Toree. 
> {code:java}
> // for CJK users once define dictionary into map, reuse column map to 
> translate columns whenever report visualization is required
> val m = Map( "c1" -> "컬럼_1", "c2" -> "컬럼_2") 
> df1.withColumnRenamed(m) 
> df2.withColumnRenamed(m)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25424) Window duration and slide duration with negative values should fail fast

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617832#comment-16617832
 ] 

Apache Spark commented on SPARK-25424:
--

User 'raghavgautam' has created a pull request for this issue:
https://github.com/apache/spark/pull/22414

> Window duration and slide duration with negative values should fail fast
> 
>
> Key: SPARK-25424
> URL: https://issues.apache.org/jira/browse/SPARK-25424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Raghav Kumar Gautam
>Priority: Major
> Fix For: 2.4.0
>
>
> In TimeWindow class window duration and slide duration should not be allowed 
> to take negative values.
> Currently this behaviour enforced by catalyst. It can be enforced by 
> constructor of TimeWindow allowing it to fail fast.
> For e.g. the code below throws following error. Note that the error is 
> produced at the time of count() call instead of window() call.
> {code:java}
> val df = spark.readStream
>   .format("rate")
>   .option("numPartitions", "2")
>   .option("rowsPerSecond", "10")
>   .load()
>   .filter("value % 20 == 0")
>   .withWatermark("timestamp", "10 seconds")
>   .groupBy(window($"timestamp", "-10 seconds", "5 seconds"))
>   .count()
> {code}
> Error:
> {code:java}
> cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data 
> type mismatch: The window duration (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
> org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 
> -1000, 500, 0)' due to data type mismatch: The window duration 
> (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
>   at 
> 

[jira] [Assigned] (SPARK-25440) Dump query execution info to a file

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25440:


Assignee: Apache Spark

> Dump query execution info to a file
> ---
>
> Key: SPARK-25440
> URL: https://issues.apache.org/jira/browse/SPARK-25440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Output of the explain() doesn't contain full information and in some cases 
> can be truncated. Besides of that it saves info to a string in memory which 
> can cause OOM. The ticket aims to solve the problem and dump info about query 
> execution to a file. Need to add new method to queryExecution.debug which 
> accepts a path to a file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23367) Include python document style checking

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617827#comment-16617827
 ] 

Apache Spark commented on SPARK-23367:
--

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/22425

> Include python document style checking
> --
>
> Key: SPARK-23367
> URL: https://issues.apache.org/jira/browse/SPARK-23367
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rekha Joshi
>Priority: Minor
>
> As per discussions [PR#20378 |https://github.com/apache/spark/pull/20378] 
> this jira is to include python doc style checking in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25442:


Assignee: Apache Spark

> Support STS to run in K8S deployment with spark deployment mode as cluster
> --
>
> Key: SPARK-25442
> URL: https://issues.apache.org/jira/browse/SPARK-25442
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, SQL
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Suryanarayana Garlapati
>Assignee: Apache Spark
>Priority: Major
>
> STS fails to start in kubernetes deployments with spark deploy mode as 
> cluster.  Support should be added to make it run in K8S deployments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25430) Add map parameter for withColumnRenamed

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25430:


Assignee: Apache Spark

> Add map parameter for withColumnRenamed
> ---
>
> Key: SPARK-25430
> URL: https://issues.apache.org/jira/browse/SPARK-25430
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Goun Na
>Assignee: Apache Spark
>Priority: Major
>
> WithColumnRenamed method should work with map parameter. It removes code 
> redundancy.
> {code:java}
> // example
> df.withColumnRenamed(Map( "c1" -> "first_column", "c2" -> "second_column" 
> )){code}
> {code:java}
> // from abbr columns to desc columns
> val m = Map( "c1" -> "first_column", "c2" -> "second_column" )
> df1.withColumnRenamed(m) 
> df2.withColumnRenamed(m)
> {code}
> It is useful for CJK users when they are working on analysis in notebook 
> environment such as Zeppelin, Databricks, Apache Toree. 
> {code:java}
> // for CJK users once define dictionary into map, reuse column map to 
> translate columns whenever report visualization is required
> val m = Map( "c1" -> "컬럼_1", "c2" -> "컬럼_2") 
> df1.withColumnRenamed(m) 
> df2.withColumnRenamed(m)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25429:


Assignee: Apache Spark

> SparkListenerBus inefficient due to 
> 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
> 
>
> Key: SPARK-25429
> URL: https://issues.apache.org/jira/browse/SPARK-25429
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> private def updateStageMetrics(
>   stageId: Int,
>   attemptId: Int,
>   taskId: Long,
>   accumUpdates: Seq[AccumulableInfo],
>   succeeded: Boolean): Unit = {
> Option(stageMetrics.get(stageId)).foreach { metrics =>
>   if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) {
> return
>   }
>   val oldTaskMetrics = metrics.taskMetrics.get(taskId)
>   if (oldTaskMetrics != null && oldTaskMetrics.succeeded) {
> return
>   }
>   val updates = accumUpdates
> .filter { acc => acc.update.isDefined && 
> metrics.accumulatorIds.contains(acc.id) }
> .sortBy(_.id)
>   if (updates.isEmpty) {
> return
>   }
>   val ids = new Array[Long](updates.size)
>   val values = new Array[Long](updates.size)
>   updates.zipWithIndex.foreach { case (acc, idx) =>
> ids(idx) = acc.id
> // In a live application, accumulators have Long values, but when 
> reading from event
> // logs, they have String values. For now, assume all accumulators 
> are Long and covert
> // accordingly.
> values(idx) = acc.update.get match {
>   case s: String => s.toLong
>   case l: Long => l
>   case o => throw new IllegalArgumentException(s"Unexpected: $o")
> }
>   }
>   // TODO: storing metrics by task ID can cause metrics for the same task 
> index to be
>   // counted multiple times, for example due to speculation or 
> re-attempts.
>   metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, 
> succeeded))
> }
>   }
> {code}
> 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated 
> many accumulator, it's inefficient use Arrray#contains.
> Actually, application may timeout while quit and will killed by RM on YARN 
> mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25440) Dump query execution info to a file

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617830#comment-16617830
 ] 

Apache Spark commented on SPARK-25440:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22429

> Dump query execution info to a file
> ---
>
> Key: SPARK-25440
> URL: https://issues.apache.org/jira/browse/SPARK-25440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Output of the explain() doesn't contain full information and in some cases 
> can be truncated. Besides of that it saves info to a string in memory which 
> can cause OOM. The ticket aims to solve the problem and dump info about query 
> execution to a file. Need to add new method to queryExecution.debug which 
> accepts a path to a file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25442:


Assignee: (was: Apache Spark)

> Support STS to run in K8S deployment with spark deployment mode as cluster
> --
>
> Key: SPARK-25442
> URL: https://issues.apache.org/jira/browse/SPARK-25442
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, SQL
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Suryanarayana Garlapati
>Priority: Major
>
> STS fails to start in kubernetes deployments with spark deploy mode as 
> cluster.  Support should be added to make it run in K8S deployments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25440) Dump query execution info to a file

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25440:


Assignee: (was: Apache Spark)

> Dump query execution info to a file
> ---
>
> Key: SPARK-25440
> URL: https://issues.apache.org/jira/browse/SPARK-25440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Output of the explain() doesn't contain full information and in some cases 
> can be truncated. Besides of that it saves info to a string in memory which 
> can cause OOM. The ticket aims to solve the problem and dump info about query 
> execution to a file. Need to add new method to queryExecution.debug which 
> accepts a path to a file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25429:


Assignee: (was: Apache Spark)

> SparkListenerBus inefficient due to 
> 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
> 
>
> Key: SPARK-25429
> URL: https://issues.apache.org/jira/browse/SPARK-25429
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Priority: Major
>
> {code:java}
> private def updateStageMetrics(
>   stageId: Int,
>   attemptId: Int,
>   taskId: Long,
>   accumUpdates: Seq[AccumulableInfo],
>   succeeded: Boolean): Unit = {
> Option(stageMetrics.get(stageId)).foreach { metrics =>
>   if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) {
> return
>   }
>   val oldTaskMetrics = metrics.taskMetrics.get(taskId)
>   if (oldTaskMetrics != null && oldTaskMetrics.succeeded) {
> return
>   }
>   val updates = accumUpdates
> .filter { acc => acc.update.isDefined && 
> metrics.accumulatorIds.contains(acc.id) }
> .sortBy(_.id)
>   if (updates.isEmpty) {
> return
>   }
>   val ids = new Array[Long](updates.size)
>   val values = new Array[Long](updates.size)
>   updates.zipWithIndex.foreach { case (acc, idx) =>
> ids(idx) = acc.id
> // In a live application, accumulators have Long values, but when 
> reading from event
> // logs, they have String values. For now, assume all accumulators 
> are Long and covert
> // accordingly.
> values(idx) = acc.update.get match {
>   case s: String => s.toLong
>   case l: Long => l
>   case o => throw new IllegalArgumentException(s"Unexpected: $o")
> }
>   }
>   // TODO: storing metrics by task ID can cause metrics for the same task 
> index to be
>   // counted multiple times, for example due to speculation or 
> re-attempts.
>   metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, 
> succeeded))
> }
>   }
> {code}
> 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated 
> many accumulator, it's inefficient use Arrray#contains.
> Actually, application may timeout while quit and will killed by RM on YARN 
> mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23906) Add UDF trunc(numeric)

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23906:


Assignee: Yuming Wang  (was: Apache Spark)

> Add UDF trunc(numeric)
> --
>
> Key: SPARK-23906
> URL: https://issues.apache.org/jira/browse/SPARK-23906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-14582
> We already have {{date_trunc}} and {{trunc}}. Need to discuss whether we 
> should introduce a new name or reuse {{trunc}} for truncating numbers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617825#comment-16617825
 ] 

Apache Spark commented on SPARK-25429:
--

User 'hellodengfei' has created a pull request for this issue:
https://github.com/apache/spark/pull/22420

> SparkListenerBus inefficient due to 
> 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
> 
>
> Key: SPARK-25429
> URL: https://issues.apache.org/jira/browse/SPARK-25429
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Priority: Major
>
> {code:java}
> private def updateStageMetrics(
>   stageId: Int,
>   attemptId: Int,
>   taskId: Long,
>   accumUpdates: Seq[AccumulableInfo],
>   succeeded: Boolean): Unit = {
> Option(stageMetrics.get(stageId)).foreach { metrics =>
>   if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) {
> return
>   }
>   val oldTaskMetrics = metrics.taskMetrics.get(taskId)
>   if (oldTaskMetrics != null && oldTaskMetrics.succeeded) {
> return
>   }
>   val updates = accumUpdates
> .filter { acc => acc.update.isDefined && 
> metrics.accumulatorIds.contains(acc.id) }
> .sortBy(_.id)
>   if (updates.isEmpty) {
> return
>   }
>   val ids = new Array[Long](updates.size)
>   val values = new Array[Long](updates.size)
>   updates.zipWithIndex.foreach { case (acc, idx) =>
> ids(idx) = acc.id
> // In a live application, accumulators have Long values, but when 
> reading from event
> // logs, they have String values. For now, assume all accumulators 
> are Long and covert
> // accordingly.
> values(idx) = acc.update.get match {
>   case s: String => s.toLong
>   case l: Long => l
>   case o => throw new IllegalArgumentException(s"Unexpected: $o")
> }
>   }
>   // TODO: storing metrics by task ID can cause metrics for the same task 
> index to be
>   // counted multiple times, for example due to speculation or 
> re-attempts.
>   metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, 
> succeeded))
> }
>   }
> {code}
> 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated 
> many accumulator, it's inefficient use Arrray#contains.
> Actually, application may timeout while quit and will killed by RM on YARN 
> mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25303) A DStream that is checkpointed should allow its parent(s) to be removed and not persisted

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25303:


Assignee: (was: Apache Spark)

> A DStream that is checkpointed should allow its parent(s) to be removed and 
> not persisted
> -
>
> Key: SPARK-25303
> URL: https://issues.apache.org/jira/browse/SPARK-25303
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Nikunj Bansal
>Priority: Major
>  Labels: Streaming, streaming
>
> A checkpointed DStream is supposed to cut the lineage to its parent(s) such 
> that any persisted RDDs for the parent(s) are removed. However, combined with 
> the issue in SPARK-25302, they result in the Input Stream RDDs being 
> persisted a lot longer than they are actually required.
> See also related bug SPARK-25302.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25424) Window duration and slide duration with negative values should fail fast

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25424:


Assignee: (was: Apache Spark)

> Window duration and slide duration with negative values should fail fast
> 
>
> Key: SPARK-25424
> URL: https://issues.apache.org/jira/browse/SPARK-25424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Raghav Kumar Gautam
>Priority: Major
> Fix For: 2.4.0
>
>
> In TimeWindow class window duration and slide duration should not be allowed 
> to take negative values.
> Currently this behaviour enforced by catalyst. It can be enforced by 
> constructor of TimeWindow allowing it to fail fast.
> For e.g. the code below throws following error. Note that the error is 
> produced at the time of count() call instead of window() call.
> {code:java}
> val df = spark.readStream
>   .format("rate")
>   .option("numPartitions", "2")
>   .option("rowsPerSecond", "10")
>   .load()
>   .filter("value % 20 == 0")
>   .withWatermark("timestamp", "10 seconds")
>   .groupBy(window($"timestamp", "-10 seconds", "5 seconds"))
>   .count()
> {code}
> Error:
> {code:java}
> cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data 
> type mismatch: The window duration (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
> org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 
> -1000, 500, 0)' due to data type mismatch: The window duration 
> (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122)
>   at 
> 

[jira] [Assigned] (SPARK-23906) Add UDF trunc(numeric)

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23906:


Assignee: Apache Spark  (was: Yuming Wang)

> Add UDF trunc(numeric)
> --
>
> Key: SPARK-23906
> URL: https://issues.apache.org/jira/browse/SPARK-23906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-14582
> We already have {{date_trunc}} and {{trunc}}. Need to discuss whether we 
> should introduce a new name or reuse {{trunc}} for truncating numbers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25303) A DStream that is checkpointed should allow its parent(s) to be removed and not persisted

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617829#comment-16617829
 ] 

Apache Spark commented on SPARK-25303:
--

User 'nikunjb' has created a pull request for this issue:
https://github.com/apache/spark/pull/22424

> A DStream that is checkpointed should allow its parent(s) to be removed and 
> not persisted
> -
>
> Key: SPARK-25303
> URL: https://issues.apache.org/jira/browse/SPARK-25303
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Nikunj Bansal
>Priority: Major
>  Labels: Streaming, streaming
>
> A checkpointed DStream is supposed to cut the lineage to its parent(s) such 
> that any persisted RDDs for the parent(s) are removed. However, combined with 
> the issue in SPARK-25302, they result in the Input Stream RDDs being 
> persisted a lot longer than they are actually required.
> See also related bug SPARK-25302.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25430) Add map parameter for withColumnRenamed

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617824#comment-16617824
 ] 

Apache Spark commented on SPARK-25430:
--

User 'goungoun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22428

> Add map parameter for withColumnRenamed
> ---
>
> Key: SPARK-25430
> URL: https://issues.apache.org/jira/browse/SPARK-25430
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Goun Na
>Priority: Major
>
> WithColumnRenamed method should work with map parameter. It removes code 
> redundancy.
> {code:java}
> // example
> df.withColumnRenamed(Map( "c1" -> "first_column", "c2" -> "second_column" 
> )){code}
> {code:java}
> // from abbr columns to desc columns
> val m = Map( "c1" -> "first_column", "c2" -> "second_column" )
> df1.withColumnRenamed(m) 
> df2.withColumnRenamed(m)
> {code}
> It is useful for CJK users when they are working on analysis in notebook 
> environment such as Zeppelin, Databricks, Apache Toree. 
> {code:java}
> // for CJK users once define dictionary into map, reuse column map to 
> translate columns whenever report visualization is required
> val m = Map( "c1" -> "컬럼_1", "c2" -> "컬럼_2") 
> df1.withColumnRenamed(m) 
> df2.withColumnRenamed(m)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25303) A DStream that is checkpointed should allow its parent(s) to be removed and not persisted

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25303:


Assignee: Apache Spark

> A DStream that is checkpointed should allow its parent(s) to be removed and 
> not persisted
> -
>
> Key: SPARK-25303
> URL: https://issues.apache.org/jira/browse/SPARK-25303
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Nikunj Bansal
>Assignee: Apache Spark
>Priority: Major
>  Labels: Streaming, streaming
>
> A checkpointed DStream is supposed to cut the lineage to its parent(s) such 
> that any persisted RDDs for the parent(s) are removed. However, combined with 
> the issue in SPARK-25302, they result in the Input Stream RDDs being 
> persisted a lot longer than they are actually required.
> See also related bug SPARK-25302.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25424) Window duration and slide duration with negative values should fail fast

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25424:


Assignee: Apache Spark

> Window duration and slide duration with negative values should fail fast
> 
>
> Key: SPARK-25424
> URL: https://issues.apache.org/jira/browse/SPARK-25424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Raghav Kumar Gautam
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> In TimeWindow class window duration and slide duration should not be allowed 
> to take negative values.
> Currently this behaviour enforced by catalyst. It can be enforced by 
> constructor of TimeWindow allowing it to fail fast.
> For e.g. the code below throws following error. Note that the error is 
> produced at the time of count() call instead of window() call.
> {code:java}
> val df = spark.readStream
>   .format("rate")
>   .option("numPartitions", "2")
>   .option("rowsPerSecond", "10")
>   .load()
>   .filter("value % 20 == 0")
>   .withWatermark("timestamp", "10 seconds")
>   .groupBy(window($"timestamp", "-10 seconds", "5 seconds"))
>   .count()
> {code}
> Error:
> {code:java}
> cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data 
> type mismatch: The window duration (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
> org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 
> -1000, 500, 0)' due to data type mismatch: The window duration 
> (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
>   at 
> 

[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617823#comment-16617823
 ] 

Apache Spark commented on SPARK-25433:
--

User 'fhoering' has created a pull request for this issue:
https://github.com/apache/spark/pull/22422

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where pex comes in. It is a nice way to create a single 
> executable zip file with all dependencies included. You have the pex command 
> line tool to build your package and when it is built you are sure it works. 
> This is in my opinion the most elegant way to ship python code (better than 
> virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25423:


Assignee: Apache Spark  (was: Yuming Wang)

> Output "dataFilters" in DataSourceScanExec.metadata
> ---
>
> Key: SPARK-25423
> URL: https://issues.apache.org/jira/browse/SPARK-25423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Maryann Xue
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25433:


Assignee: (was: Apache Spark)

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where pex comes in. It is a nice way to create a single 
> executable zip file with all dependencies included. You have the pex command 
> line tool to build your package and when it is built you are sure it works. 
> This is in my opinion the most elegant way to ship python code (better than 
> virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25302) ReducedWindowedDStream not using checkpoints for reduced RDDs

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25302:


Assignee: Apache Spark

> ReducedWindowedDStream not using checkpoints for reduced RDDs
> -
>
> Key: SPARK-25302
> URL: https://issues.apache.org/jira/browse/SPARK-25302
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Nikunj Bansal
>Assignee: Apache Spark
>Priority: Major
>  Labels: Streaming, streaming
>
> When using reduceByKeyAndWindow() using inverse reduce function, it 
> eventually creates a ReducedWindowedDStream. This class creates a 
> reducedDStream but only persists it and does not checkpoint it. The result is 
> that it ends up using cached RDDs and does not cut lineage to the input 
> DStream resulting in eventually caching the input RDDs for much longer than 
> they are needed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25423:


Assignee: Yuming Wang  (was: Apache Spark)

> Output "dataFilters" in DataSourceScanExec.metadata
> ---
>
> Key: SPARK-25423
> URL: https://issues.apache.org/jira/browse/SPARK-25423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Maryann Xue
>Assignee: Yuming Wang
>Priority: Trivial
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25302) ReducedWindowedDStream not using checkpoints for reduced RDDs

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617822#comment-16617822
 ] 

Apache Spark commented on SPARK-25302:
--

User 'nikunjb' has created a pull request for this issue:
https://github.com/apache/spark/pull/22423

> ReducedWindowedDStream not using checkpoints for reduced RDDs
> -
>
> Key: SPARK-25302
> URL: https://issues.apache.org/jira/browse/SPARK-25302
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Nikunj Bansal
>Priority: Major
>  Labels: Streaming, streaming
>
> When using reduceByKeyAndWindow() using inverse reduce function, it 
> eventually creates a ReducedWindowedDStream. This class creates a 
> reducedDStream but only persists it and does not checkpoint it. The result is 
> that it ends up using cached RDDs and does not cut lineage to the input 
> DStream resulting in eventually caching the input RDDs for much longer than 
> they are needed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25433:


Assignee: Apache Spark

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Assignee: Apache Spark
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where pex comes in. It is a nice way to create a single 
> executable zip file with all dependencies included. You have the pex command 
> line tool to build your package and when it is built you are sure it works. 
> This is in my opinion the most elegant way to ship python code (better than 
> virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25302) ReducedWindowedDStream not using checkpoints for reduced RDDs

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25302:


Assignee: (was: Apache Spark)

> ReducedWindowedDStream not using checkpoints for reduced RDDs
> -
>
> Key: SPARK-25302
> URL: https://issues.apache.org/jira/browse/SPARK-25302
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Nikunj Bansal
>Priority: Major
>  Labels: Streaming, streaming
>
> When using reduceByKeyAndWindow() using inverse reduce function, it 
> eventually creates a ReducedWindowedDStream. This class creates a 
> reducedDStream but only persists it and does not checkpoint it. The result is 
> that it ends up using cached RDDs and does not cut lineage to the input 
> DStream resulting in eventually caching the input RDDs for much longer than 
> they are needed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24800) Refactor Avro Serializer and Deserializer

2018-09-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24800:

Description: Currently in Avro data source module, the Avro Deserializer 
converts input Avro format data to Row, and then convert the Row to 
InternalRow. The Avro Serializer converts InternalRow to Row, and then output 
Avro format data. To improve the performance, we need to make a direct 
conversion between InternalRow and Avro format data.

> Refactor Avro Serializer and Deserializer
> -
>
> Key: SPARK-24800
> URL: https://issues.apache.org/jira/browse/SPARK-24800
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently in Avro data source module, the Avro Deserializer converts input 
> Avro format data to Row, and then convert the Row to InternalRow. The Avro 
> Serializer converts InternalRow to Row, and then output Avro format data. To 
> improve the performance, we need to make a direct conversion between 
> InternalRow and Avro format data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617617#comment-16617617
 ] 

Apache Spark commented on SPARK-25291:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22415

> Flakiness of tests in terms of executor memory (SecretsTestSuite)
> -
>
> Key: SPARK-25291
> URL: https://issues.apache.org/jira/browse/SPARK-25291
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> SecretsTestSuite shows flakiness in terms of correct setting of executor 
> memory: 
> Run SparkPi with env and mount secrets. *** FAILED ***
>  "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272)
> When ran with default settings 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25291:


Assignee: (was: Apache Spark)

> Flakiness of tests in terms of executor memory (SecretsTestSuite)
> -
>
> Key: SPARK-25291
> URL: https://issues.apache.org/jira/browse/SPARK-25291
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> SecretsTestSuite shows flakiness in terms of correct setting of executor 
> memory: 
> Run SparkPi with env and mount secrets. *** FAILED ***
>  "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272)
> When ran with default settings 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25291:


Assignee: Apache Spark

> Flakiness of tests in terms of executor memory (SecretsTestSuite)
> -
>
> Key: SPARK-25291
> URL: https://issues.apache.org/jira/browse/SPARK-25291
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Assignee: Apache Spark
>Priority: Major
>
> SecretsTestSuite shows flakiness in terms of correct setting of executor 
> memory: 
> Run SparkPi with env and mount secrets. *** FAILED ***
>  "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272)
> When ran with default settings 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23829) spark-sql-kafka source in spark 2.3 causes reading stream failure frequently

2018-09-17 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617594#comment-16617594
 ] 

Gabor Somogyi commented on SPARK-23829:
---

In 2.4 it's fixed as it's using 2.0.0. I think an upgrade will solve this issue 
(if description about versions is correct).

> spark-sql-kafka source in spark 2.3 causes reading stream failure frequently
> 
>
> Key: SPARK-23829
> URL: https://issues.apache.org/jira/browse/SPARK-23829
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Norman Bai
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In spark 2.3 , it provides a source "spark-sql-kafka-0-10_2.11".
>  
> When I wanted to read from my kafka-0.10.2.1 cluster, it throws out an error 
> "*java.util.concurrent.TimeoutException: Cannot fetch record  for offset 
> in 12000 milliseconds*"  frequently , and the job thus failed.
>  
> I searched on google & stackoverflow for a while, and found many other people 
> who got this excption too, and nobody gave an answer.
>  
> I debuged the source code, found nothing, but I guess it's because the 
> dependency spark-sql-kafka-0-10_2.11 is using.
>  
> {code:java}
> 
>  org.apache.spark
>  spark-sql-kafka-0-10_2.11
>  2.3.0
>  
>  
>  kafka-clients
>  org.apache.kafka
>  
>  
> 
> 
>  org.apache.kafka
>  kafka-clients
>  0.10.2.1
> {code}
> I excluded it from maven ,and added another version , rerun the code , and 
> now it works.
>  
> I guess something is wrong on kafka-clients0.10.0.1 working with 
> kafka0.10.2.1, or more kafka versions. 
>  
> Hope for an explanation.
> Here is the error stack.
> {code:java}
> [ERROR] 2018-03-30 13:34:11,404 [stream execution thread for [id = 
> 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = 
> b3e18aa6-358f-43f6-a077-e34db0822df6]] 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution logError - Query 
> [id = 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = 
> b3e18aa6-358f-43f6-a077-e34db0822df6] terminated with error
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 
> (TID 6, localhost, executor driver): java.util.concurrent.TimeoutException: 
> Cannot fetch record for offset 6481521 in 12 milliseconds
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:230)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
> at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105)
> at 
> 

[jira] [Commented] (SPARK-25426) Remove the duplicate fallback logic in UnsafeProjection

2018-09-17 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617587#comment-16617587
 ] 

Li Yuanjian commented on SPARK-25426:
-

Resolved by https://github.com/apache/spark/pull/22417.

> Remove the duplicate fallback logic in UnsafeProjection
> ---
>
> Key: SPARK-25426
> URL: https://issues.apache.org/jira/browse/SPARK-25426
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25448) [Spark Job History] Job Staged page shows 1000 Jobs only

2018-09-17 Thread ABHISHEK KUMAR GUPTA (JIRA)
ABHISHEK KUMAR GUPTA created SPARK-25448:


 Summary: [Spark Job History] Job Staged page shows 1000 Jobs only
 Key: SPARK-25448
 URL: https://issues.apache.org/jira/browse/SPARK-25448
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1
 Environment: Server OS :-SUSE 11
No. of Cluster Node:- 6
Spark Version:- 2.3.1
Reporter: ABHISHEK KUMAR GUPTA


1. configure spark.ui.retainedJobs = 10 in spark-default.conf file of Job 
History
2. Submit 1 Lakh job from beeline
3. Go to the application ID from Job History Page " Incomplete Application Link"
4. Job tab will list only max 1000 jobs under the application
Actual output
Completed Jobs: 24952, only showing 952

Staged page should list all completed Jobs in this case 24952



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections

2018-09-17 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617473#comment-16617473
 ] 

Wenchen Fan edited comment on SPARK-23580 at 9/17/18 12:57 PM:
---

I'm re-targeting to `2.5.0`. There are more tickets coming: SafeProjection with 
fallback, Predicate with fallback, Ordering with fallback, etc.


was (Author: cloud_fan):
I'm adding `2.5.0` as a target version. There are more tickets coming: 
SafeProjection with fallback, Predicate with fallback, Ordering with fallback, 
etc.  

> Interpreted mode fallback should be implemented for all expressions & 
> projections
> -
>
> Key: SPARK-23580
> URL: https://issues.apache.org/jira/browse/SPARK-23580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> Spark SQL currently does not support interpreted mode for all expressions and 
> projections. This is a problem for scenario's where were code generation does 
> not work, or blows past the JVM class limits. We currently cannot gracefully 
> fallback.
> This ticket is an umbrella to fix this class of problem in Spark SQL. This 
> work can be divided into two main area's:
> - Add interpreted versions for all dataset related expressions.
> - Add an interpreted version of {{GenerateUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections

2018-09-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23580:

Target Version/s: 2.5.0  (was: 2.4.0, 2.5.0)

> Interpreted mode fallback should be implemented for all expressions & 
> projections
> -
>
> Key: SPARK-23580
> URL: https://issues.apache.org/jira/browse/SPARK-23580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> Spark SQL currently does not support interpreted mode for all expressions and 
> projections. This is a problem for scenario's where were code generation does 
> not work, or blows past the JVM class limits. We currently cannot gracefully 
> fallback.
> This ticket is an umbrella to fix this class of problem in Spark SQL. This 
> work can be divided into two main area's:
> - Add interpreted versions for all dataset related expressions.
> - Add an interpreted version of {{GenerateUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections

2018-09-17 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617473#comment-16617473
 ] 

Wenchen Fan commented on SPARK-23580:
-

I'm adding `2.5.0` as a target version. There are more tickets coming: 
SafeProjection with fallback, Predicate with fallback, Ordering with fallback, 
etc.  

> Interpreted mode fallback should be implemented for all expressions & 
> projections
> -
>
> Key: SPARK-23580
> URL: https://issues.apache.org/jira/browse/SPARK-23580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> Spark SQL currently does not support interpreted mode for all expressions and 
> projections. This is a problem for scenario's where were code generation does 
> not work, or blows past the JVM class limits. We currently cannot gracefully 
> fallback.
> This ticket is an umbrella to fix this class of problem in Spark SQL. This 
> work can be divided into two main area's:
> - Add interpreted versions for all dataset related expressions.
> - Add an interpreted version of {{GenerateUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections

2018-09-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23580:

Target Version/s: 2.4.0, 2.5.0  (was: 2.4.0, 3.0.0)

> Interpreted mode fallback should be implemented for all expressions & 
> projections
> -
>
> Key: SPARK-23580
> URL: https://issues.apache.org/jira/browse/SPARK-23580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> Spark SQL currently does not support interpreted mode for all expressions and 
> projections. This is a problem for scenario's where were code generation does 
> not work, or blows past the JVM class limits. We currently cannot gracefully 
> fallback.
> This ticket is an umbrella to fix this class of problem in Spark SQL. This 
> work can be divided into two main area's:
> - Add interpreted versions for all dataset related expressions.
> - Add an interpreted version of {{GenerateUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections

2018-09-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23580:

Target Version/s: 2.4.0, 3.0.0  (was: 3.0.0)

> Interpreted mode fallback should be implemented for all expressions & 
> projections
> -
>
> Key: SPARK-23580
> URL: https://issues.apache.org/jira/browse/SPARK-23580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> Spark SQL currently does not support interpreted mode for all expressions and 
> projections. This is a problem for scenario's where were code generation does 
> not work, or blows past the JVM class limits. We currently cannot gracefully 
> fallback.
> This ticket is an umbrella to fix this class of problem in Spark SQL. This 
> work can be divided into two main area's:
> - Add interpreted versions for all dataset related expressions.
> - Add an interpreted version of {{GenerateUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections

2018-09-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23580:

Target Version/s: 3.0.0  (was: 2.4.0)

> Interpreted mode fallback should be implemented for all expressions & 
> projections
> -
>
> Key: SPARK-23580
> URL: https://issues.apache.org/jira/browse/SPARK-23580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> Spark SQL currently does not support interpreted mode for all expressions and 
> projections. This is a problem for scenario's where were code generation does 
> not work, or blows past the JVM class limits. We currently cannot gracefully 
> fallback.
> This ticket is an umbrella to fix this class of problem in Spark SQL. This 
> work can be divided into two main area's:
> - Add interpreted versions for all dataset related expressions.
> - Add an interpreted version of {{GenerateUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25447) Support JSON options by schema_of_json

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617462#comment-16617462
 ] 

Apache Spark commented on SPARK-25447:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22442

> Support JSON options by schema_of_json
> --
>
> Key: SPARK-25447
> URL: https://issues.apache.org/jira/browse/SPARK-25447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The function schema_of_json doesn't accept any options currently but the 
> options can impact on schema inferring. Need to support the same options that 
> from_json() can use on schema inferring. Here is examples of options that 
> could impact on schema inferring:
> * primitivesAsString
> * prefersDecimal
> * allowComments
> * allowUnquotedFieldNames
> * allowSingleQuotes
> * allowNumericLeadingZeros
> * allowNonNumericNumbers
> * allowBackslashEscapingAnyCharacter
> * allowUnquotedControlChars
> Below is possible signature:
> {code:scala}
> def schema_of_json(e: Column, options: java.util.Map[String, String]): Column
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25447) Support JSON options by schema_of_json

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617461#comment-16617461
 ] 

Apache Spark commented on SPARK-25447:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22442

> Support JSON options by schema_of_json
> --
>
> Key: SPARK-25447
> URL: https://issues.apache.org/jira/browse/SPARK-25447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The function schema_of_json doesn't accept any options currently but the 
> options can impact on schema inferring. Need to support the same options that 
> from_json() can use on schema inferring. Here is examples of options that 
> could impact on schema inferring:
> * primitivesAsString
> * prefersDecimal
> * allowComments
> * allowUnquotedFieldNames
> * allowSingleQuotes
> * allowNumericLeadingZeros
> * allowNonNumericNumbers
> * allowBackslashEscapingAnyCharacter
> * allowUnquotedControlChars
> Below is possible signature:
> {code:scala}
> def schema_of_json(e: Column, options: java.util.Map[String, String]): Column
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25447) Support JSON options by schema_of_json

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25447:


Assignee: Apache Spark

> Support JSON options by schema_of_json
> --
>
> Key: SPARK-25447
> URL: https://issues.apache.org/jira/browse/SPARK-25447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> The function schema_of_json doesn't accept any options currently but the 
> options can impact on schema inferring. Need to support the same options that 
> from_json() can use on schema inferring. Here is examples of options that 
> could impact on schema inferring:
> * primitivesAsString
> * prefersDecimal
> * allowComments
> * allowUnquotedFieldNames
> * allowSingleQuotes
> * allowNumericLeadingZeros
> * allowNonNumericNumbers
> * allowBackslashEscapingAnyCharacter
> * allowUnquotedControlChars
> Below is possible signature:
> {code:scala}
> def schema_of_json(e: Column, options: java.util.Map[String, String]): Column
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25447) Support JSON options by schema_of_json

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25447:


Assignee: (was: Apache Spark)

> Support JSON options by schema_of_json
> --
>
> Key: SPARK-25447
> URL: https://issues.apache.org/jira/browse/SPARK-25447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The function schema_of_json doesn't accept any options currently but the 
> options can impact on schema inferring. Need to support the same options that 
> from_json() can use on schema inferring. Here is examples of options that 
> could impact on schema inferring:
> * primitivesAsString
> * prefersDecimal
> * allowComments
> * allowUnquotedFieldNames
> * allowSingleQuotes
> * allowNumericLeadingZeros
> * allowNonNumericNumbers
> * allowBackslashEscapingAnyCharacter
> * allowUnquotedControlChars
> Below is possible signature:
> {code:scala}
> def schema_of_json(e: Column, options: java.util.Map[String, String]): Column
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25447) Support JSON options by schema_of_json

2018-09-17 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25447:
--

 Summary: Support JSON options by schema_of_json
 Key: SPARK-25447
 URL: https://issues.apache.org/jira/browse/SPARK-25447
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The function schema_of_json doesn't accept any options currently but the 
options can impact on schema inferring. Need to support the same options that 
from_json() can use on schema inferring. Here is examples of options that could 
impact on schema inferring:
* primitivesAsString
* prefersDecimal
* allowComments
* allowUnquotedFieldNames
* allowSingleQuotes
* allowNumericLeadingZeros
* allowNonNumericNumbers
* allowBackslashEscapingAnyCharacter
* allowUnquotedControlChars

Below is possible signature:
{code:scala}
def schema_of_json(e: Column, options: java.util.Map[String, String]): Column
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25431) Fix function examples and unify the format of the example results.

2018-09-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25431:
-
Fix Version/s: 2.4.1

> Fix function examples and unify the format of the example results.
> --
>
> Key: SPARK-25431
> URL: https://issues.apache.org/jira/browse/SPARK-25431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
> Fix For: 3.0.0, 2.4.1
>
>
> There are some mistakes in examples of newly added functions. Also the format 
> of the example results are not unified. We should fix and unify them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25431) Fix function examples and unify the format of the example results.

2018-09-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25431.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22437
[https://github.com/apache/spark/pull/22437]

> Fix function examples and unify the format of the example results.
> --
>
> Key: SPARK-25431
> URL: https://issues.apache.org/jira/browse/SPARK-25431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
> Fix For: 3.0.0
>
>
> There are some mistakes in examples of newly added functions. Also the format 
> of the example results are not unified. We should fix and unify them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25374) SafeProjection supports fallback to an interpreted mode

2018-09-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617453#comment-16617453
 ] 

Liang-Chi Hsieh commented on SPARK-25374:
-

I do think so.

> SafeProjection supports fallback to an interpreted mode
> ---
>
> Key: SPARK-25374
> URL: https://issues.apache.org/jira/browse/SPARK-25374
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-23711, UnsafeProjection supports fallback to an interpreted mode. 
> SafeProjection needs to support, too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25446) Add schema_of_json() to R

2018-09-17 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25446:
--

 Summary: Add schema_of_json() to R
 Key: SPARK-25446
 URL: https://issues.apache.org/jira/browse/SPARK-25446
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The function schem_of_json() is exposed in Scala/Java and Python but not in R. 
Need to add the function to R too. Function declaration can be found there: 
https://github.com/apache/spark/blob/d749d034a80f528932f613ac97f13cfb99acd207/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3612
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25374) SafeProjection supports fallback to an interpreted mode

2018-09-17 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617437#comment-16617437
 ] 

Takeshi Yamamuro commented on SPARK-25374:
--

I do not have a strong opinion though, I feel it is too late to push this into 
2.4.

> SafeProjection supports fallback to an interpreted mode
> ---
>
> Key: SPARK-25374
> URL: https://issues.apache.org/jira/browse/SPARK-25374
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-23711, UnsafeProjection supports fallback to an interpreted mode. 
> SafeProjection needs to support, too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25443) fix issues when building docs with release scripts in docker

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617428#comment-16617428
 ] 

Apache Spark commented on SPARK-25443:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22438

> fix issues when building docs with release scripts in docker
> 
>
> Key: SPARK-25443
> URL: https://issues.apache.org/jira/browse/SPARK-25443
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25443) fix issues when building docs with release scripts in docker

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25443:


Assignee: Wenchen Fan  (was: Apache Spark)

> fix issues when building docs with release scripts in docker
> 
>
> Key: SPARK-25443
> URL: https://issues.apache.org/jira/browse/SPARK-25443
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25443) fix issues when building docs with release scripts in docker

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25443:


Assignee: Apache Spark  (was: Wenchen Fan)

> fix issues when building docs with release scripts in docker
> 
>
> Key: SPARK-25443
> URL: https://issues.apache.org/jira/browse/SPARK-25443
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25443) fix issues when building docs with release scripts in docker

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617427#comment-16617427
 ] 

Apache Spark commented on SPARK-25443:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22438

> fix issues when building docs with release scripts in docker
> 
>
> Key: SPARK-25443
> URL: https://issues.apache.org/jira/browse/SPARK-25443
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated SPARK-25433:
--
Description: 
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
 (disadvantages are that you have a separate conda package repo and ship the 
python interpreter all the time)

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing PYSPARK_PYTHON env 
variable should already work.

I also have seen this 
[blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the packages on 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
 (disadvantages are that you have a separate conda package repo and ship the 
python interpreter all the time)

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the packages on 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to 

[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated SPARK-25433:
--
Description: 
The goal of this ticket is to ship and use custom code inside the spark 
executors using [PEX|https://github.com/pantsbuild/pex] 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
 (disadvantages are that you have a separate conda package repo and ship the 
python interpreter all the time)

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing PYSPARK_PYTHON env 
variable should already work.

I also have seen this 
[blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the packages on 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where pex comes in. It is a nice way to create a single executable 
zip file with all dependencies included. You have the pex command line tool to 
build your package and when it is built you are sure it works. This is in my 
opinion the most elegant way to ship python code (better than virtual env and 
conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
 (disadvantages are that you have a separate conda package repo and ship the 
python interpreter all the time)

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing PYSPARK_PYTHON env 
variable should already work.

I also have seen this 
[blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the packages on 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 

[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated SPARK-25433:
--
Description: 
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
 (disadvantages are that you have a separate conda package repo and ship the 
python interpreter all the time)

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the packages on 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
 (disadvantages are that you have a separate conda package repo and ship the 
python interpreter all the time)

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide 

[jira] [Commented] (SPARK-25431) Fix function examples and unify the format of the example results.

2018-09-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617418#comment-16617418
 ] 

Apache Spark commented on SPARK-25431:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22437

> Fix function examples and unify the format of the example results.
> --
>
> Key: SPARK-25431
> URL: https://issues.apache.org/jira/browse/SPARK-25431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
>
> There are some mistakes in examples of newly added functions. Also the format 
> of the example results are not unified. We should fix and unify them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25431) Fix function examples and unify the format of the example results.

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25431:


Assignee: Apache Spark  (was: Takuya Ueshin)

> Fix function examples and unify the format of the example results.
> --
>
> Key: SPARK-25431
> URL: https://issues.apache.org/jira/browse/SPARK-25431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Minor
>
> There are some mistakes in examples of newly added functions. Also the format 
> of the example results are not unified. We should fix and unify them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25431) Fix function examples and unify the format of the example results.

2018-09-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25431:


Assignee: Takuya Ueshin  (was: Apache Spark)

> Fix function examples and unify the format of the example results.
> --
>
> Key: SPARK-25431
> URL: https://issues.apache.org/jira/browse/SPARK-25431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
>
> There are some mistakes in examples of newly added functions. Also the format 
> of the example results are not unified. We should fix and unify them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated SPARK-25433:
--
Description: 
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
 (disadvantages are that you have a separate conda package repo and ship the 
python interpreter all the time)

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
 (disadvantages are that you have a separate conda package repo and ship the 
python interpreter all the time)

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide 

[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated SPARK-25433:
--
Description: 
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
 (disadvantages are that you have a separate conda package repo and ship the 
python interpreter all the time)

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
 (disadvantages are that you have a separate conda package repo and ship the 
python interpreter all the time.)

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to 

[jira] [Comment Edited] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617399#comment-16617399
 ] 

Fabian Höring edited comment on SPARK-25433 at 9/17/18 11:40 AM:
-

[~hyukjin.kwon] I changed the description of the ticket including links to 
existing attempts.


was (Author: fhoering):
[~hyukjin.kwon] I changed the description of the ticket including link to 
existing attempts.

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors. 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing the PYSPARK_PYTHON 
> should already work.
> I also have seen this 
> [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the package from 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
> nice way to create a single executable zip file with all dependencies 
> included. You have the pex command line tool to build your package and when 
> it is built you are sure it works. This is in my opinion the most elegant way 
> to ship python code (better than virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617399#comment-16617399
 ] 

Fabian Höring commented on SPARK-25433:
---

[~hyukjin.kwon] I changed the description of the ticket including link to 
existing attempts.

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors. 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time.)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing the PYSPARK_PYTHON 
> should already work.
> I also have seen this 
> [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the package from 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
> nice way to create a single executable zip file with all dependencies 
> included. You have the pex command line tool to build your package and when 
> it is built you are sure it works. This is in my opinion the most elegant way 
> to ship python code (better than virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated SPARK-25433:
--
Description: 
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
 (disadvantages are that you have a separate conda package repo and ship the 
python interpreter all the time.)

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 


> Add support for 

[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated SPARK-25433:
--
Description: 
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable]
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 


> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: 

[jira] [Resolved] (SPARK-25427) Add BloomFilter creation test cases

2018-09-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25427.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22418
[https://github.com/apache/spark/pull/22418]

> Add BloomFilter creation test cases
> ---
>
> Key: SPARK-25427
> URL: https://issues.apache.org/jira/browse/SPARK-25427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark supports BloomFilter creation for ORC files. This issue aims to add 
> test coverages to prevent regressions like SPARK-12417



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated SPARK-25433:
--
Description: 
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think it can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable]
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think its can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable]
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 


> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: 

[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated SPARK-25433:
--
Description: 
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|https://github.com/conda/conda-pack] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think its can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this 
[blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable]
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|[https://github.com/conda/conda-pack]] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think its can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this [blogpost 
|[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].]
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable,]
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 


> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> 

[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated SPARK-25433:
--
Description: 
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|[https://github.com/conda/conda-pack]] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think its can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this [blogpost 
|[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].]
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal SPARK-16367 from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable,]
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|[https://github.com/conda/conda-pack]] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think its can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this [blogpost 
|[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].]
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable,]
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 


> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: 

[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated SPARK-25433:
--
Description: 
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|[https://github.com/conda/conda-pack]] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think its can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this [blogpost 
|[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].]
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with this proposal from what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable,]
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|[https://github.com/conda/conda-pack]] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think its can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this [blogpost 
|[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].]
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with [this proposal|https://issues.apache.org/jira/browse/SPARK-16367] from 
what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable,]
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 


> Add support for PEX in PySpark
> --
>
> 

[jira] [Updated] (SPARK-25433) Add support for PEX in PySpark

2018-09-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated SPARK-25433:
--
Description: 
The goal of this ticket is to ship and use custom code inside the spark 
executors. 

This currently works fine with 
[conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:

Basically the workflow is
 * to zip the local conda environment ([conda 
pack|[https://github.com/conda/conda-pack]] also works)
 * ship it to each executor as an archive
 * modify PYSPARK_PYTHON to the local conda environment

I think its can work the same way with virtual env. There is the SPARK-13587 
ticket to provide nice entry points to spark-submit and SparkContext but 
zipping your local virtual env and then just changing the PYSPARK_PYTHON should 
already work.

I also have seen this [blogpost 
|[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].]
 But recreating the virtual env each time doesn't seem to be a very scalable 
solution. If you have hundreds of executors it will retrieve the package from 
each excecutor and recreate your virtual environment each time. Same problem 
with [this proposal|https://issues.apache.org/jira/browse/SPARK-16367] from 
what I understood.

Another problem with virtual env is that your local environment is not easily 
shippable to another machine. In particular there is the relocatable option 
(see 
[https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable,]
 
[https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
 which makes it very complicated for the user to ship the virtual env and be 
sure it works.

And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a 
nice way to create a single executable zip file with all dependencies included. 
You have the pex command line tool to build your package and when it is built 
you are sure it works. This is in my opinion the most elegant way to ship 
python code (better than virtual env and conda)

The problem why it doesn't work out of the box is that there can be only one 
single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
to the pex files doesn't work. You can nevertheless tune the env variable 
[PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
 and runtime to provide different entry points.

PR: [https://github.com/apache/spark/pull/22422/files]

 

 

 

  was:
This has been partly discussed in SPARK-13587

I would like to provision the executors with a PEX package. I created a PR with 
minimal necessary changes in PythonWorkerFactory.

PR: [https://github.com/apache/spark/pull/22422/files]

To run it one needs to set PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON variables to 
the pex file and upload the pex file to the executors via sparkContext.addFile 
or by setting the spark config spark.yarn.dist.files/spark.file properties

Also it is necessary to set the PEX_ROOT environment variable. By default 
inside the executors it tries to access /home/.pex and this fails.

Ideally, as this configuration is quite cumbersome, it would be interesting to 
also add a parameter --pexFile to SparkContext and spark-submit in order to 
directly provide a pexFile and then everything else is handled. Please tell me 
what you think of this.

 

 


> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors. 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]:
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|[https://github.com/conda/conda-pack]] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think its can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing the PYSPARK_PYTHON 
> should already work.
> I also have seen this [blogpost 
> |[https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].]
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the package from 
> each excecutor and recreate your virtual environment each time. Same problem 
> with [this proposal|https://issues.apache.org/jira/browse/SPARK-16367] 

  1   2   >