date:20190403

[jira] [Resolved] (SPARK-27349) Dealing with TimeVars removed in Hive 2.x

2019-04-03 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27349.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

> Dealing with TimeVars removed in Hive 2.x
> -
>
> Key: SPARK-27349
> URL: https://issues.apache.org/jira/browse/SPARK-27349
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> {{hive.stats.jdbc.timeout}} and {{hive.stats.retries.wait}} were removed by 
> HIVE-12164. We need dealing with this change when upgrading built-in Hive to 
> 2.3.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27383) Avoid using hard-coded jar names in Hive tests

2019-04-03 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27383:

Description: Avoid using hard-coded jar names({{hive-contrib-0.13.1.jar}} 
and {{hive-hcatalog-core-0.13.1.jar}}) in Hive tests. This makes it easy to 
change when upgrading the built-in Hive to 2.3.4.

> Avoid using hard-coded jar names in Hive tests
> --
>
> Key: SPARK-27383
> URL: https://issues.apache.org/jira/browse/SPARK-27383
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Avoid using hard-coded jar names({{hive-contrib-0.13.1.jar}} and 
> {{hive-hcatalog-core-0.13.1.jar}}) in Hive tests. This makes it easy to 
> change when upgrading the built-in Hive to 2.3.4.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27383) Avoid using hard-coded jar names in Hive tests

2019-04-03 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-27383:
---

 Summary: Avoid using hard-coded jar names in Hive tests
 Key: SPARK-27383
 URL: https://issues.apache.org/jira/browse/SPARK-27383
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27363) Mesos support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809511#comment-16809511
 ] 

Xiangrui Meng commented on SPARK-27363:
---

[~felixcheung] [~srowen] Anyone you recommend to lead design and implementation 
for Mesos?

> Mesos support for GPU-aware scheduling
> --
>
> Key: SPARK-27363
> URL: https://issues.apache.org/jira/browse/SPARK-27363
> Project: Spark
>  Issue Type: Story
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Design and implement Mesos support for GPU-aware scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27363) Mesos support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27363:
--
Description: Design and implement Mesos support for GPU-aware scheduling.

> Mesos support for GPU-aware scheduling
> --
>
> Key: SPARK-27363
> URL: https://issues.apache.org/jira/browse/SPARK-27363
> Project: Spark
>  Issue Type: Story
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Design and implement Mesos support for GPU-aware scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27362) Kubernetes support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27362:
--
Shepherd: Yinan Li

> Kubernetes support for GPU-aware scheduling
> ---
>
> Key: SPARK-27362
> URL: https://issues.apache.org/jira/browse/SPARK-27362
> Project: Spark
>  Issue Type: Story
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Design and implement k8s support for GPU-aware scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27365) Spark Jenkins supports testing GPU-aware scheduling features

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27365:
--
Description: 
Upgrade Spark Jenkins to install GPU cards and run GPU integration tests 
triggered by "GPU" in PRs.

cc: [~afeng] [~shaneknapp]

  was:Upgrade Spark Jenkins to install GPU cards and run GPU integration tests 
triggered by "GPU" in PRs.


> Spark Jenkins supports testing GPU-aware scheduling features
> 
>
> Key: SPARK-27365
> URL: https://issues.apache.org/jira/browse/SPARK-27365
> Project: Spark
>  Issue Type: Story
>  Components: jenkins
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Upgrade Spark Jenkins to install GPU cards and run GPU integration tests 
> triggered by "GPU" in PRs.
> cc: [~afeng] [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27361) YARN support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27361:
--
Description: 
Design and implement YARN support for GPU-aware scheduling:

* User can request GPU resources at Spark application level.
* YARN can pass GPU info to Spark executor.
* Integrate with YARN 3.2 GPU support.

  was:
Design and implement YARN support for GPU-aware scheduling:

* User can request GPU resources at Spark application level.
* YARN can pass GPU info to Spark executor.
* Integrate with YARN 3.2 GPU support.
* Dynamic allocation.


> YARN support for GPU-aware scheduling
> -
>
> Key: SPARK-27361
> URL: https://issues.apache.org/jira/browse/SPARK-27361
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Design and implement YARN support for GPU-aware scheduling:
> * User can request GPU resources at Spark application level.
> * YARN can pass GPU info to Spark executor.
> * Integrate with YARN 3.2 GPU support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27361) YARN support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27361:
--
Description: 
Design and implement YARN support for GPU-aware scheduling:

* User can request GPU resources at Spark application level.
* YARN can pass GPU info to Spark executor.
* Integrate with YARN 3.2 GPU support.
* Dynamic allocation.

  was:
Design and implement YARN support for GPU-aware scheduling:

* User can request GPU resources at Spark application level.
* YARN can pass GPU info to Spark executor.
* Integrate with YARN 3.x GPU support.
* Dynamic allocation.


> YARN support for GPU-aware scheduling
> -
>
> Key: SPARK-27361
> URL: https://issues.apache.org/jira/browse/SPARK-27361
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Design and implement YARN support for GPU-aware scheduling:
> * User can request GPU resources at Spark application level.
> * YARN can pass GPU info to Spark executor.
> * Integrate with YARN 3.2 GPU support.
> * Dynamic allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27377) Upgrade YARN to 3.1.2+ to support GPU

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27377:
--
Description: This task should be covered by SPARK-23710. Just a placeholder 
here.

> Upgrade YARN to 3.1.2+ to support GPU
> -
>
> Key: SPARK-27377
> URL: https://issues.apache.org/jira/browse/SPARK-27377
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> This task should be covered by SPARK-23710. Just a placeholder here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23710) Upgrade the built-in Hive to 2.3.4 for hadoop-3.2

2019-04-03 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23710:

Summary: Upgrade the built-in Hive to 2.3.4 for hadoop-3.2  (was: Upgrade 
the built-in Hive to 2.3.4 for hadoop-3.1)

> Upgrade the built-in Hive to 2.3.4 for hadoop-3.2
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Critical
>
> Spark fail to run on Hadoop 3.x, because Hive's ShimLoader considers Hadoop 
> 3.x to be an unknown Hadoop version. see SPARK-18673 and HIVE-16081 for more 
> details. So we need to upgrade the built-in Hive for Hadoop-3.x. This is an 
> umbrella JIRA to track this upgrade.
>  
> *Upgrade Plan*:
>  # SPARK-27054 Remove the Calcite dependency. This can avoid some jar 
> conflicts.
>  # SPARK-23749 Replace built-in Hive API (isSub/toKryo) and remove 
> OrcProto.Type usage
>  # SPARK-27158, SPARK-27130 Update dev/* to support dynamic change profiles 
> when testing
>  # Fix ORC dependency conflict to makes it test passed on Hive 1.2.1 and 
> compile passed on Hive 2.3.4
>  # Add an empty hive-thriftserverV2 module. then we could test all test cases 
> in next step
>  # Make Hadoop-3.1 with Hive 2.3.4 test passed
>  # Adapted hive-thriftserverV2 from hive-thriftserver with Hive 2.3.4's 
> [TCLIService.thrift|https://github.com/apache/hive/blob/rel/release-2.3.4/service-rpc/if/TCLIService.thrift]
>  
> I have completed the [initial 
> work|https://github.com/apache/spark/pull/24044] and plan to finish this 
> upgrade step by step.
>   
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27382) Update Spark 2.4.x testing in HiveExternalCatalogVersionsSuite

2019-04-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27382:
--
Priority: Minor  (was: Trivial)

> Update Spark 2.4.x testing in HiveExternalCatalogVersionsSuite
> --
>
> Key: SPARK-27382
> URL: https://issues.apache.org/jira/browse/SPARK-27382
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 2.4.2, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27382) Update Spark 2.4.x testing in HiveExternalCatalogVersionsSuite

2019-04-03 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-27382:
-

 Summary: Update Spark 2.4.x testing in 
HiveExternalCatalogVersionsSuite
 Key: SPARK-27382
 URL: https://issues.apache.org/jira/browse/SPARK-27382
 Project: Spark
  Issue Type: Task
  Components: SQL, Tests
Affects Versions: 2.4.2, 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27024) Executor interface for cluster managers to support GPU resources

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27024:
--
Issue Type: Story  (was: Task)

> Executor interface for cluster managers to support GPU resources
> 
>
> Key: SPARK-27024
> URL: https://issues.apache.org/jira/browse/SPARK-27024
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Thomas Graves
>Priority: Major
>
> The executor interface shall deal with the resources allocated to the 
> executor by cluster managers(Standalone, YARN, Kubernetes), so the Spark 
> Executor don’t need to involve into the GPU discovery and allocation, which 
> shall be handled by cluster managers. However, an executor need to sync with 
> the driver to expose available resources to support task scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24615:
--
Epic Name: GPU-aware Scheduling  (was: Support GPU Scheduling)

> SPIP: Accelerator-aware task scheduling for Spark
> -
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Xingbo Jiang
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> (The JIRA received a major update on 2019/02/28. Some comments were based on 
> an earlier version. Please ignore them. New comments start at 
> [#comment-16778026].)
> h2. Background and Motivation
> GPUs and other accelerators have been widely used for accelerating special 
> workloads, e.g., deep learning and signal processing. While users from the AI 
> community use GPUs heavily, they often need Apache Spark to load and process 
> large datasets and to handle complex data scenarios like streaming. YARN and 
> Kubernetes already support GPUs in their recent releases. Although Spark 
> supports those two cluster managers, Spark itself is not aware of GPUs 
> exposed by them and hence Spark cannot properly request GPUs and schedule 
> them for users. This leaves a critical gap to unify big data and AI workloads 
> and make life simpler for end users.
> To make Spark be aware of GPUs, we shall make two major changes at high level:
> * At cluster manager level, we update or upgrade cluster managers to include 
> GPU support. Then we expose user interfaces for Spark to request GPUs from 
> them.
> * Within Spark, we update its scheduler to understand available GPUs 
> allocated to executors, user task requests, and assign GPUs to tasks properly.
> Based on the work done in YARN and Kubernetes to support GPUs and some 
> offline prototypes, we could have necessary features implemented in the next 
> major release of Spark. You can find a detailed scoping doc here, where we 
> listed user stories and their priorities.
> h2. Goals
> * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
> * No regression on scheduler performance for normal jobs.
> h2. Non-goals
> * Fine-grained scheduling within one GPU card.
> ** We treat one GPU card and its memory together as a non-divisible unit.
> * Support TPU.
> * Support Mesos.
> * Support Windows.
> h2. Target Personas
> * Admins who need to configure clusters to run Spark with GPU nodes.
> * Data scientists who need to build DL applications on Spark.
> * Developers who need to integrate DL features on Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27380) Get and install GPU cards to Jenkins machines

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27380:
-

 Summary: Get and install GPU cards to Jenkins machines
 Key: SPARK-27380
 URL: https://issues.apache.org/jira/browse/SPARK-27380
 Project: Spark
  Issue Type: Sub-task
  Components: jenkins
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27381) Design: Spark Jenkins supports GPU integration tests

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27381:
-

 Summary: Design: Spark Jenkins supports GPU integration tests
 Key: SPARK-27381
 URL: https://issues.apache.org/jira/browse/SPARK-27381
 Project: Spark
  Issue Type: Sub-task
  Components: jenkins
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27365) Spark Jenkins supports testing GPU-aware scheduling features

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27365:
--
Description: Upgrade Spark Jenkins to install GPU cards and run GPU 
integration tests triggered by "GPU" in PRs.

> Spark Jenkins supports testing GPU-aware scheduling features
> 
>
> Key: SPARK-27365
> URL: https://issues.apache.org/jira/browse/SPARK-27365
> Project: Spark
>  Issue Type: Story
>  Components: jenkins
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Upgrade Spark Jenkins to install GPU cards and run GPU integration tests 
> triggered by "GPU" in PRs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27379) YARN passes GPU info to Spark executor

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27379:
-

 Summary: YARN passes GPU info to Spark executor
 Key: SPARK-27379
 URL: https://issues.apache.org/jira/browse/SPARK-27379
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, YARN
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27378) spark-submit requests GPUs in YARN mode

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27378:
-

 Summary: spark-submit requests GPUs in YARN mode
 Key: SPARK-27378
 URL: https://issues.apache.org/jira/browse/SPARK-27378
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Submit, YARN
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27377) Upgrade YARN to 3.1.2+ to support GPU

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27377:
-

 Summary: Upgrade YARN to 3.1.2+ to support GPU
 Key: SPARK-27377
 URL: https://issues.apache.org/jira/browse/SPARK-27377
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27361) YARN support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27361:
--
Description: 
Design and implement YARN support for GPU-aware scheduling:

* User can request GPU resources at Spark application level.
* YARN can pass GPU info to Spark executor.
* Integrate with YARN 3.x GPU support.
* Dynamic allocation.

  was:
Design and implement YARN support for GPU-aware scheduling:

* User can request GPU resources at Spark application level.
* YARN can pass GPU info to Spark executor.
* Integrate with YARN GPU support.
* Dynamic allocation.


> YARN support for GPU-aware scheduling
> -
>
> Key: SPARK-27361
> URL: https://issues.apache.org/jira/browse/SPARK-27361
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Design and implement YARN support for GPU-aware scheduling:
> * User can request GPU resources at Spark application level.
> * YARN can pass GPU info to Spark executor.
> * Integrate with YARN 3.x GPU support.
> * Dynamic allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27376) Design: YARN supports Spark GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27376:
-

 Summary: Design: YARN supports Spark GPU-aware scheduling
 Key: SPARK-27376
 URL: https://issues.apache.org/jira/browse/SPARK-27376
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27361) YARN support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27361:
--
Description: 
Design and implement YARN support for GPU-aware scheduling:

* User can request GPU resources at Spark application level.
* YARN can pass GPU info to Spark executor.
* Integrate with YARN GPU support.
* Dynamic allocation.

> YARN support for GPU-aware scheduling
> -
>
> Key: SPARK-27361
> URL: https://issues.apache.org/jira/browse/SPARK-27361
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Design and implement YARN support for GPU-aware scheduling:
> * User can request GPU resources at Spark application level.
> * YARN can pass GPU info to Spark executor.
> * Integrate with YARN GPU support.
> * Dynamic allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform(df)

2019-04-03 Thread Zhenyi Lin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenyi Lin updated SPARK-27375:
---
Summary: cache not working after discretizer.fit(df).transform(df)  (was: 
cache not working after discretizer.fit(df).transform operation)

> cache not working after discretizer.fit(df).transform(df)
> -
>
> Key: SPARK-27375
> URL: https://issues.apache.org/jira/browse/SPARK-27375
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.0
>Reporter: Zhenyi Lin
>Priority: Major
>
> Below gives an example.
> If cache works, col(r1) should be equal to col(r2) in the output dfj.show(). 
> However, after using discretizer fit and transform DF, col(r1) and col(r2) 
> are different.
>  
> {noformat}
> spark.catalog.clearCache()
> import random
> random.seed(123)
> @udf(IntegerType())
> def ri():
> return random.choice([1,2,3,4,5,6,7,8,9])
> df = spark.range(100).repartition("id")
> #remove discretizer part, col(r1) will be equal to col(r2)
> discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
> outputCol="quantileNo") 
> df = discretizer.fit(df).transform(df)
> # if we add following 1 line copy df, col(r1) will also become equal to 
> col(r2)
> # df = df.rdd.toDF()
> df = df.withColumn("r", ri()).cache()
> df1 = df.withColumnRenamed("r", "r1")
> df2 = df.withColumnRenamed("r", "r2")
> df1.join(df2, "id").explain()
> dfj = df1.join(df2, "id")
> dfj.select("id", "r1", "r2").show(5)
>  
> The result is shown as below, we see that col(r1) and col(r2) are different. 
> The physical plan shows that the cache() is missed in join operation. 
> To avoid it, I either add df = df.rdd.toDF() before creating df1 and df2, or 
> if we remove discretizer fit and transformation, col(r1) and col(r2) become 
> identical. 
>  
> == Physical Plan ==
> *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, 
> r2#15649]
> +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
>  :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
> quantileNo#15622, pythonUDF0#15661 AS r1#15645]
>  : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
>  : +- Exchange hashpartitioning(id#15612L, 24)
>  : +- *(1) Range (0, 100, step=1, splits=6)
>  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]))
>  +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
> quantileNo#15653, pythonUDF0#15662 AS r2#15649]
>  +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
>  +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
> +---+---+---+
> | id| r1| r2|
> +---+---+---+
> | 28| 9| 3|
> | 30| 3| 6|
> | 88| 1| 9|
> | 67| 3| 3|
> | 66| 1| 5|
> +---+---+---+
> only showing top 5 rows
>  
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform operation

2019-04-03 Thread Zhenyi Lin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenyi Lin updated SPARK-27375:
---
Description: 
Below gives an example.

If cache works, col(r1) should be equal to col(r2) in the output dfj.show(). 
However, after using discretizer fit and transform DF, col(r1) and col(r2) are 
different.

 
{noformat}
spark.catalog.clearCache()
import random
random.seed(123)

@udf(IntegerType())
def ri():
return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo") 
df = discretizer.fit(df).transform(df)

# if we add following 1 line copy df, col(r1) will also become equal to col(r2)
# df = df.rdd.toDF()


df = df.withColumn("r", ri()).cache()
df1 = df.withColumnRenamed("r", "r1")
df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
dfj = df1.join(df2, "id")
dfj.select("id", "r1", "r2").show(5)


 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan shows that the cache() is missed in join operation. 

To avoid it, I either add df = df.rdd.toDF() before creating df1 and df2, or if 
we remove discretizer fit and transformation, col(r1) and col(r2) become 
identical. 

 

== Physical Plan ==
*(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
+- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
 +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
+---+---+---+
| id| r1| r2|
+---+---+---+
| 28| 9| 3|
| 30| 3| 6|
| 88| 1| 9|
| 67| 3| 3|
| 66| 1| 5|
+---+---+---+
only showing top 5 rows

 

{noformat}
 

  was:
Below gives an example.

If cache works, col(r1) should be equal to col(r2) in the output dfj.show(). 
However, after using discretizer fit and transform DF, col(r1) and col(r2) are 
different.

 
{noformat}
spark.catalog.clearCache()
 import random
 random.seed(123)

@udf(IntegerType())
 def ri():
     return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
 discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo")
 df = discretizer.fit(df).transform(df)

# if we add following 1 line copy df, col(r1) will also become equal to col(r2)
# df = df.rdd.toDF()

df = df.withColumn("r", ri()).cache()
 df1 = df.withColumnRenamed("r", "r1")
 df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
 dfj = df1.join(df2, "id")
 dfj.select("id", "r1", "r2").show(5)

 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan shows that the cache() is missed in join operation. 

To avoid it, I either add df = df.rdd.toDF() before creating df1 and df2, or if 
we remove discretizer fit and transformation, col(r1) and col(r2) become 
identical. 

 

== Physical Plan ==
*(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
+- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
 +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
+---+---+---+
| id| r1| r2|
+---+---+---+
| 28| 9| 3|
| 30| 3| 6|
| 88| 1| 9|
| 67| 3| 3|
| 66| 1| 5|
+---+---+---+
only showing top 5 rows

 

{noformat}
 


> cache not working after discretizer.fit(df).transform operation
> ---
>
> Key: SPARK-27375
> URL: https://issues.apache.org/jira/browse/SPARK-27375
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.0
>Reporter: Zhenyi Lin
>Priority: Major
>
> Below gives an example.
> If cache works,

[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform operation

2019-04-03 Thread Zhenyi Lin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenyi Lin updated SPARK-27375:
---
Description: 
Below gives an example.

If cache works, col(r1) should be equal to col(r2) in the output dfj.show(). 
However, after using discretizer fit and transform DF, col(r1) and col(r2) are 
different.

 
{noformat}
spark.catalog.clearCache()
 import random
 random.seed(123)

@udf(IntegerType())
 def ri():
     return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
 discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo")
 df = discretizer.fit(df).transform(df)

# if we add following 1 line copy df, col(r1) will also become equal to col(r2)
# df = df.rdd.toDF()

df = df.withColumn("r", ri()).cache()
 df1 = df.withColumnRenamed("r", "r1")
 df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
 dfj = df1.join(df2, "id")
 dfj.select("id", "r1", "r2").show(5)

 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan shows that the cache() is missed in join operation. 

To avoid it, I either add df = df.rdd.toDF() before creating df1 and df2, or if 
we remove discretizer fit and transformation, col(r1) and col(r2) become 
identical. 

 

== Physical Plan ==
*(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
+- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
 +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
+---+---+---+
| id| r1| r2|
+---+---+---+
| 28| 9| 3|
| 30| 3| 6|
| 88| 1| 9|
| 67| 3| 3|
| 66| 1| 5|
+---+---+---+
only showing top 5 rows

 

{noformat}
 

  was:
Below gives an example.

If cache works, col(r1) in the output should be equal to col(r2). However, 
after using discretizer fit and transform DF, col(r1) and col(r2) becomes 
different.

 
{noformat}
spark.catalog.clearCache()
 import random
 random.seed(123)

@udf(IntegerType())
 def ri():
     return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
 discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo")
 df = discretizer.fit(df).transform(df)

# if we add following 1 line copy df, col(r1) will also become equal to col(r2)
# df = df.rdd.toDF()

df = df.withColumn("r", ri()).cache()
 df1 = df.withColumnRenamed("r", "r1")
 df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
 dfj = df1.join(df2, "id")
 dfj.select("id", "r1", "r2").show(5)

 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan shows that the cache() is missed in join operation. 

To avoid it, I either add df = df.rdd.toDF() before creating df1 and df2, or if 
we remove discretizer fit and transformation, col(r1) and col(r2) become 
identical. 

 

== Physical Plan ==
*(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
+- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
 +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
+---+---+---+
| id| r1| r2|
+---+---+---+
| 28| 9| 3|
| 30| 3| 6|
| 88| 1| 9|
| 67| 3| 3|
| 66| 1| 5|
+---+---+---+
only showing top 5 rows

 

{noformat}
 


> cache not working after discretizer.fit(df).transform operation
> ---
>
> Key: SPARK-27375
> URL: https://issues.apache.org/jira/browse/SPARK-27375
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.0
>Reporter: Zhenyi Lin
>Priority: Major
>
> Below gives an example.
> If cache works,

[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform operation

2019-04-03 Thread Zhenyi Lin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenyi Lin updated SPARK-27375:
---
Description: 
Below gives an example.

If cache works, col(r1) in the output should be equal to col(r2). However, 
after using discretizer fit and transform DF, col(r1) and col(r2) becomes 
different.

 
{noformat}
spark.catalog.clearCache()
 import random
 random.seed(123)

@udf(IntegerType())
 def ri():
     return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
 discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo")
 df = discretizer.fit(df).transform(df)

# if we add following 1 line copy df, col(r1) will also become equal to col(r2)
# df = df.rdd.toDF()

df = df.withColumn("r", ri()).cache()
 df1 = df.withColumnRenamed("r", "r1")
 df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
 dfj = df1.join(df2, "id")
 dfj.select("id", "r1", "r2").show(5)

 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan shows that the cache() is missed in join operation. 

To avoid it, I either add df = df.rdd.toDF() before creating df1 and df2, or if 
we remove discretizer fit and transformation, col(r1) and col(r2) become 
identical. 

 

== Physical Plan ==
*(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
+- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
 +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
+---+---+---+
| id| r1| r2|
+---+---+---+
| 28| 9| 3|
| 30| 3| 6|
| 88| 1| 9|
| 67| 3| 3|
| 66| 1| 5|
+---+---+---+
only showing top 5 rows

 

{noformat}
 

  was:
Below gives an example. col(r1) should be equal to col(r2) if cache operation 
works. However, after using discretizer fit and transformation DF, col(r1) and 
col(r2) becomes different

 

{noformat}

spark.catalog.clearCache()
 import random
 random.seed(123)

@udf(IntegerType())
 def ri():
     return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
 discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo")
 df = discretizer.fit(df).transform(df)

# if we add following 1 line copy df, col(r1) will also become equal to col(r2)
# df = df.rdd.toDF()

df = df.withColumn("r", ri()).cache()
 df1 = df.withColumnRenamed("r", "r1")
 df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
 dfj = df1.join(df2, "id")
 dfj.select("id", "r1", "r2").show(5)

 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan of join operation shows that the cache() is missed. On the 
other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, 
or if we remove discretizer fit and transformation, col(r1) and col(r2) become 
the same. 

 

== Physical Plan ==
*(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
+- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
 +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
+---+---+---+
| id| r1| r2|
+---+---+---+
| 28| 9| 3|
| 30| 3| 6|
| 88| 1| 9|
| 67| 3| 3|
| 66| 1| 5|
+---+---+---+
only showing top 5 rows

 

{noformat}

 


> cache not working after discretizer.fit(df).transform operation
> ---
>
> Key: SPARK-27375
> URL: https://issues.apache.org/jira/browse/SPARK-27375
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.0
>Reporter: Zhenyi Lin
>Priority: Major
>
> Below gives an example.
> If cache

[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform operation

2019-04-03 Thread Zhenyi Lin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenyi Lin updated SPARK-27375:
---
Description: 
Below gives an example. col(r1) should be equal to col(r2) if cache operation 
works. However, after using discretizer fit and transformation DF, col(r1) and 
col(r2) becomes different

 

{noformat}

spark.catalog.clearCache()
 import random
 random.seed(123)

@udf(IntegerType())
 def ri():
     return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
 discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo")
 df = discretizer.fit(df).transform(df)

# if we add following 1 line copy df, col(r1) will also become equal to col(r2)
# df = df.rdd.toDF()

df = df.withColumn("r", ri()).cache()
 df1 = df.withColumnRenamed("r", "r1")
 df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
 dfj = df1.join(df2, "id")
 dfj.select("id", "r1", "r2").show(5)

 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan of join operation shows that the cache() is missed. On the 
other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, 
or if we remove discretizer fit and transformation, col(r1) and col(r2) become 
the same. 

 

== Physical Plan ==
*(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
+- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
 +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
+---+---+---+
| id| r1| r2|
+---+---+---+
| 28| 9| 3|
| 30| 3| 6|
| 88| 1| 9|
| 67| 3| 3|
| 66| 1| 5|
+---+---+---+
only showing top 5 rows

 

{noformat}

 

  was:
Below gives an example. col(r1) should be equal to col(r2) if cache operation 
works. However, after using discretizer fit and transformation DF, col(r1) and 
col(r2) becomes different

 

 

spark.catalog.clearCache()
 import random
 random.seed(123)

@udf(IntegerType())
 def ri():
     return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
 discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo")
 df = discretizer.fit(df).transform(df)

 

# if we add following 1 line copy df, col(r1) will also become equal to col(r2)
# df = df.rdd.toDF()

df = df.withColumn("r", ri()).cache()
 df1 = df.withColumnRenamed("r", "r1")
 df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
 dfj = df1.join(df2, "id")
 dfj.select("id", "r1", "r2").show(5)

 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan of join operation shows that the cache() is missed. On the 
other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, 
or if we remove discretizer fit and transformation, col(r1) and col(r2) become 
the same. 

 

== Physical Plan ==
 *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, 
r2#15649|#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
 +- *(4) BroadcastHashJoin [id#15612L|#15612L], [id#15655L|#15655L], Inner, 
BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645|#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661|#15612L, 
pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649|#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662|#15655L, 
pythonUDF0#15662]
 +- ReusedExchange [id#15655L|#15655L], Exchange hashpartitioning(id#15612L, 24)
 ++--++---
|id|r1|r2|

++--++---
|28|9|3|
|30|3|6|
|88|1|9|
|67|3|3|
|66|1|5|

++--++---
 only showing top 5 rows

+-+-++---
 only showing top 5 rows


> cache not working after discretizer.fit(df).transform operation
> ---
>
> Key:

[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform operation

2019-04-03 Thread Zhenyi Lin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenyi Lin updated SPARK-27375:
---
Description: 
Below gives an example. col(r1) should be equal to col(r2) if cache operation 
works. However, after using discretizer fit and transformation DF, col(r1) and 
col(r2) becomes different

 

 

spark.catalog.clearCache()
 import random
 random.seed(123)

@udf(IntegerType())
 def ri():
     return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
 discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo")
 df = discretizer.fit(df).transform(df)

 

# if we add following 1 line copy df, col(r1) will also become equal to col(r2)
# df = df.rdd.toDF()

df = df.withColumn("r", ri()).cache()
 df1 = df.withColumnRenamed("r", "r1")
 df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
 dfj = df1.join(df2, "id")
 dfj.select("id", "r1", "r2").show(5)

 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan of join operation shows that the cache() is missed. On the 
other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, 
or if we remove discretizer fit and transformation, col(r1) and col(r2) become 
the same. 

 

== Physical Plan ==
 *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, 
r2#15649|#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
 +- *(4) BroadcastHashJoin [id#15612L|#15612L], [id#15655L|#15655L], Inner, 
BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645|#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661|#15612L, 
pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649|#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662|#15655L, 
pythonUDF0#15662]
 +- ReusedExchange [id#15655L|#15655L], Exchange hashpartitioning(id#15612L, 24)
 ++--++---
|id|r1|r2|

++--++---
|28|9|3|
|30|3|6|
|88|1|9|
|67|3|3|
|66|1|5|

++--++---
 only showing top 5 rows

+-+-++---
 only showing top 5 rows

  was:
Below gives an example. col(r1) should be equal to col(r2) if cache operation 
works. However, after using discretizer fit and transformation DF, col(r1) and 
col(r2) becomes different

 

 

spark.catalog.clearCache()
 import random
 random.seed(123)

@udf(IntegerType())
 def ri():
    return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
 discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo")
 df = discretizer.fit(df).transform(df)

df = df.withColumn("r", ri()).cache()
 df1 = df.withColumnRenamed("r", "r1")
 df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
 dfj = df1.join(df2, "id")
 dfj.select("id", "r1", "r2").show(5)

 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan of join operation shows that the cache() is missed. On the 
other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, 
or if we remove discretizer fit and transformation, col(r1) and col(r2) become 
the same. 

 

== Physical Plan ==
*(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
+- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
 +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
+---+---+---+
| id| r1| r2|
+---+---+---+
| 28| 9| 3|
| 30| 3| 6|
| 88| 1| 9|
| 67| 3| 3|
| 66| 1| 5|
+---+---+---+
only showing top 5 rows

++--++---
 only showing top 5 rows


> cache not working after discretizer.fit(df).transform operation
> ---
>
> Key: SPARK-27375
> URL: https://issues.apache.org/jira/browse/SPARK-27375
>

[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform operation

2019-04-03 Thread Zhenyi Lin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenyi Lin updated SPARK-27375:
---
Summary: cache not working after discretizer.fit(df).transform operation  
(was: cache not working after discretizer.fit(df).transform)

> cache not working after discretizer.fit(df).transform operation
> ---
>
> Key: SPARK-27375
> URL: https://issues.apache.org/jira/browse/SPARK-27375
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.0
>Reporter: Zhenyi Lin
>Priority: Major
>
> Below gives an example. col(r1) should be equal to col(r2) if cache operation 
> works. However, after using discretizer fit and transformation DF, col(r1) 
> and col(r2) becomes different
>  
>  
> spark.catalog.clearCache()
>  import random
>  random.seed(123)
> @udf(IntegerType())
>  def ri():
>     return random.choice([1,2,3,4,5,6,7,8,9])
> df = spark.range(100).repartition("id")
> #remove discretizer part, col(r1) will be equal to col(r2)
>  discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
> outputCol="quantileNo")
>  df = discretizer.fit(df).transform(df)
> df = df.withColumn("r", ri()).cache()
>  df1 = df.withColumnRenamed("r", "r1")
>  df2 = df.withColumnRenamed("r", "r2")
> df1.join(df2, "id").explain()
>  dfj = df1.join(df2, "id")
>  dfj.select("id", "r1", "r2").show(5)
>  
> The result is shown as below, we see that col(r1) and col(r2) are different. 
> The physical plan of join operation shows that the cache() is missed. On the 
> other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, 
> or if we remove discretizer fit and transformation, col(r1) and col(r2) 
> become the same. 
>  
> == Physical Plan ==
> *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, 
> r2#15649]
> +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
>  :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
> quantileNo#15622, pythonUDF0#15661 AS r1#15645]
>  : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
>  : +- Exchange hashpartitioning(id#15612L, 24)
>  : +- *(1) Range (0, 100, step=1, splits=6)
>  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]))
>  +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
> quantileNo#15653, pythonUDF0#15662 AS r2#15649]
>  +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
>  +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
> +---+---+---+
> | id| r1| r2|
> +---+---+---+
> | 28| 9| 3|
> | 30| 3| 6|
> | 88| 1| 9|
> | 67| 3| 3|
> | 66| 1| 5|
> +---+---+---+
> only showing top 5 rows
> ++--++---
>  only showing top 5 rows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform

2019-04-03 Thread Zhenyi Lin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenyi Lin updated SPARK-27375:
---
Description: 
Below gives an example. col(r1) should be equal to col(r2) if cache operation 
works. However, after using discretizer fit and transformation DF, col(r1) and 
col(r2) becomes different

 

 

spark.catalog.clearCache()
 import random
 random.seed(123)

@udf(IntegerType())
 def ri():
    return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
 discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo")
 df = discretizer.fit(df).transform(df)

df = df.withColumn("r", ri()).cache()
 df1 = df.withColumnRenamed("r", "r1")
 df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
 dfj = df1.join(df2, "id")
 dfj.select("id", "r1", "r2").show(5)

 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan of join operation shows that the cache() is missed. On the 
other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, 
or if we remove discretizer fit and transformation, col(r1) and col(r2) become 
the same. 

 

== Physical Plan ==
*(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
+- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
 +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
+---+---+---+
| id| r1| r2|
+---+---+---+
| 28| 9| 3|
| 30| 3| 6|
| 88| 1| 9|
| 67| 3| 3|
| 66| 1| 5|
+---+---+---+
only showing top 5 rows

++--++---
 only showing top 5 rows

  was:
Below gives an example. col(r1) should be equal to col(r2) if cache operation 
works. However, after using discretizer fit and transformation DF, col(r1) and 
col(r2) becomes different

 

 

spark.catalog.clearCache()
import random
random.seed(123)

@udf(IntegerType())
def ri():(
 return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo")
df = discretizer.fit(df).transform(df)

df = df.withColumn("r", ri()).cache()
df1 = df.withColumnRenamed("r", "r1")
df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
dfj = df1.join(df2, "id")
dfj.select("id", "r1", "r2").show(5)

 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan of join operation shows that the cache() is missed. On the 
other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, 
or if we remove discretizer fit and transformation, col(r1) and col(r2) become 
the same. 

 

== Physical Plan ==
*(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
+- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
 +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
+---+---+---+
| id| r1| r2|
+---+---+---+
| 28| 9| 3|
| 30| 3| 6|
| 88| 1| 9|
| 67| 3| 3|
| 66| 1| 5|
+---+---+---+
only showing top 5 rows


> cache not working after discretizer.fit(df).transform
> -
>
> Key: SPARK-27375
> URL: https://issues.apache.org/jira/browse/SPARK-27375
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.0
>Reporter: Zhenyi Lin
>Priority: Major
>
> Below gives an example. col(r1) should be equal to col(r2) if cache operation 
> works. However, after using discretizer fit and transformation DF, col(r1) 
> and col(r2) becomes different
>  
>  
> spark.catalog.clearCache()
>  import random
>  random.seed(123)
>

[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform

2019-04-03 Thread Zhenyi Lin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenyi Lin updated SPARK-27375:
---
Summary: cache not working after discretizer.fit(df).transform  (was: cache 
not working after call discretizer.fit(df).transform)

> cache not working after discretizer.fit(df).transform
> -
>
> Key: SPARK-27375
> URL: https://issues.apache.org/jira/browse/SPARK-27375
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.3.0
>Reporter: Zhenyi Lin
>Priority: Major
>
> Below gives an example. col(r1) should be equal to col(r2) if cache operation 
> works. However, after using discretizer fit and transformation DF, col(r1) 
> and col(r2) becomes different
>  
>  
> spark.catalog.clearCache()
> import random
> random.seed(123)
> @udf(IntegerType())
> def ri():(
>  return random.choice([1,2,3,4,5,6,7,8,9])
> df = spark.range(100).repartition("id")
> #remove discretizer part, col(r1) will be equal to col(r2)
> discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
> outputCol="quantileNo")
> df = discretizer.fit(df).transform(df)
> df = df.withColumn("r", ri()).cache()
> df1 = df.withColumnRenamed("r", "r1")
> df2 = df.withColumnRenamed("r", "r2")
> df1.join(df2, "id").explain()
> dfj = df1.join(df2, "id")
> dfj.select("id", "r1", "r2").show(5)
>  
> The result is shown as below, we see that col(r1) and col(r2) are different. 
> The physical plan of join operation shows that the cache() is missed. On the 
> other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, 
> or if we remove discretizer fit and transformation, col(r1) and col(r2) 
> become the same. 
>  
> == Physical Plan ==
> *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, 
> r2#15649]
> +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
>  :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
> quantileNo#15622, pythonUDF0#15661 AS r1#15645]
>  : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
>  : +- Exchange hashpartitioning(id#15612L, 24)
>  : +- *(1) Range (0, 100, step=1, splits=6)
>  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]))
>  +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
> quantileNo#15653, pythonUDF0#15662 AS r2#15649]
>  +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
>  +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
> +---+---+---+
> | id| r1| r2|
> +---+---+---+
> | 28| 9| 3|
> | 30| 3| 6|
> | 88| 1| 9|
> | 67| 3| 3|
> | 66| 1| 5|
> +---+---+---+
> only showing top 5 rows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27375) cache not working after call discretizer.fit(df).transform

2019-04-03 Thread Zhenyi Lin (JIRA)

Zhenyi Lin created SPARK-27375:
--

 Summary: cache not working after call discretizer.fit(df).transform
 Key: SPARK-27375
 URL: https://issues.apache.org/jira/browse/SPARK-27375
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 2.3.0
Reporter: Zhenyi Lin


Below gives an example. col(r1) should be equal to col(r2) if cache operation 
works. However, after using discretizer fit and transformation DF, col(r1) and 
col(r2) becomes different

 

 

spark.catalog.clearCache()
import random
random.seed(123)

@udf(IntegerType())
def ri():(
 return random.choice([1,2,3,4,5,6,7,8,9])

df = spark.range(100).repartition("id")

#remove discretizer part, col(r1) will be equal to col(r2)
discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", 
outputCol="quantileNo")
df = discretizer.fit(df).transform(df)

df = df.withColumn("r", ri()).cache()
df1 = df.withColumnRenamed("r", "r1")
df2 = df.withColumnRenamed("r", "r2")

df1.join(df2, "id").explain()
dfj = df1.join(df2, "id")
dfj.select("id", "r1", "r2").show(5)

 

The result is shown as below, we see that col(r1) and col(r2) are different. 
The physical plan of join operation shows that the cache() is missed. On the 
other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, 
or if we remove discretizer fit and transformation, col(r1) and col(r2) become 
the same. 

 

== Physical Plan ==
*(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649]
+- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight
 :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS 
quantileNo#15622, pythonUDF0#15661 AS r1#15645]
 : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661]
 : +- Exchange hashpartitioning(id#15612L, 24)
 : +- *(1) Range (0, 100, step=1, splits=6)
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
 +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS 
quantileNo#15653, pythonUDF0#15662 AS r2#15649]
 +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662]
 +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24)
+---+---+---+
| id| r1| r2|
+---+---+---+
| 28| 9| 3|
| 30| 3| 6|
| 88| 1| 9|
| 67| 3| 3|
| 66| 1| 5|
+---+---+---+
only showing top 5 rows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

2019-04-03 Thread Mahima Khatri (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809463#comment-16809463
 ] 

Mahima Khatri commented on SPARK-27298:
---

Sure , I will try the code with 2.3.3 and check

> Dataset except operation gives different results(dataset count) on Spark 
> 2.3.0 Windows and Spark 2.3.0 Linux environment
> 
>
> Key: SPARK-27298
> URL: https://issues.apache.org/jira/browse/SPARK-27298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Mahima Khatri
>Priority: Major
>  Labels: data-loss
> Attachments: Console-Result-Windows.txt, 
> console-result-LinuxonVM.txt, customer.csv, pom.xml
>
>
> {code:java}
> // package com.verifyfilter.example;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.Column;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SaveMode;
> public class ExcludeInTesting {
> public static void main(String[] args) {
> SparkSession spark = SparkSession.builder()
> .appName("ExcludeInTesting")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
> Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("delimiter", "|")
> .option("inferSchema", "true")
> //.load("E:/resources/customer.csv"); local //below path for VM
> .load("/home/myproject/bda/home/bin/customer.csv");
> dataReadFromCSV.printSchema();
> dataReadFromCSV.show();
> //Adding an extra step of saving to db and then loading it again
> dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer");
> Dataset dataLoaded = spark.sql("select * from customer");
> //Gender EQ M
> Column genderCol = dataLoaded.col("Gender");
> Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M"));
> //Dataset onlyMaleDS = spark.sql("select count(*) from customer where 
> Gender='M'");
> onlyMaleDS.show();
> System.out.println("The count of Male customers is :"+ onlyMaleDS.count());
> System.out.println("*");
> // Income in the list
> Object[] valuesArray = new Object[5];
> valuesArray[0]=503.65;
> valuesArray[1]=495.54;
> valuesArray[2]=486.82;
> valuesArray[3]=481.28;
> valuesArray[4]=479.79;
> Column incomeCol = dataLoaded.col("Income");
> Dataset incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) 
> valuesArray));
> System.out.println("The count of customers satisfaying Income is :"+ 
> incomeMatchingSet.count());
> System.out.println("*");
> Dataset maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet);
> System.out.println("The count of final customers is :"+ 
> maleExcptIncomeMatch.count());
> System.out.println("*");
> }
> }
> {code}
>  When the above code is executed on Spark 2.3.0 ,it gives below different 
> results:
> *Windows* :  The code gives correct count of dataset 148237,
> *Linux :*         The code gives different {color:#172b4d}count of dataset 
> 129532 {color}
>  
> {color:#172b4d}Some more info related to this bug:{color}
> {color:#172b4d}1. Application Code (attached)
> 2. CSV file used(attached)
> 3. Windows spec 
>           Windows 10- 64 bit OS 
> 4. Linux spec (Running on Oracle VM virtual box)
>       Specifications: \{as captured from Vbox.log}
>         00:00:26.112908 VMMDev: Guest Additions information report: Version 
> 5.0.32 r112930          '5.0.32_Ubuntu'
>         00:00:26.112996 VMMDev: Guest Additions information report: Interface 
> = 0x00010004         osType = 0x00053100 (Linux >= 2.6, 64-bit)
> 5. Snapshots of output in both cases (attached){color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27338) Deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

2019-04-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27338:
---

Assignee: Venkata krishnan Sowrirajan

> Deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator
> -
>
> Key: SPARK-27338
> URL: https://issues.apache.org/jira/browse/SPARK-27338
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Venkata krishnan Sowrirajan
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
>
> We saw similar deadlock like this 
> https://issues.apache.org/jira/browse/SPARK-26265 happening between 
> TaskMemoryManager and UnsafeExternalSorted$SpillableIterator
> Jstack output:
> jstack information as follow:
> {code:java}
> Found one Java-level deadlock:
> =
> "stdout writer for 
> /usr/lib/envs/env-1923-ver-1755-a-4.2.9-py-3.5.3/bin/python":
>   waiting to lock monitor 0x7fce56409088 (object 0x0005700a2f98, a 
> org.apache.spark.memory.TaskMemoryManager),
>   which is held by "Executor task launch worker for task 2203"
> "Executor task launch worker for task 2203":
>   waiting to lock monitor 0x007cd878 (object 0x0005701a0eb0, a 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator),
>   which is held by "stdout writer for 
> /usr/lib/envs/env-1923-ver-1755-a-4.2.9-py-3.5.3/bin/python"
> Java stack information for the threads listed above:
> ===
> "stdout writer for 
> /usr/lib/envs/env-1923-ver-1755-a-4.2.9-py-3.5.3/bin/python":
>   at 
> org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:334)
>   - waiting to lock <0x0005700a2f98> (a 
> org.apache.spark.memory.TaskMemoryManager)
>   at 
> org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.access$1100(UnsafeExternalSorter.java:48)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.loadNext(UnsafeExternalSorter.java:583)
>   - locked <0x0005701a0eb0> (a 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:187)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:174)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$2.hasNext(WholeStageCodegenExec.scala:638)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1073)
>   at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1089)
>   at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1127)
>   at 
> scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
>   at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:50)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2067)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
> "Executor task launch worker for task 2203":
>   at 
>

[jira] [Resolved] (SPARK-27338) Deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator

2019-04-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27338.
-
   Resolution: Fixed
Fix Version/s: 2.3.4
   2.4.2
   3.0.0

Issue resolved by pull request 24265
[https://github.com/apache/spark/pull/24265]

> Deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator
> -
>
> Key: SPARK-27338
> URL: https://issues.apache.org/jira/browse/SPARK-27338
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Venkata krishnan Sowrirajan
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.0.0, 2.4.2, 2.3.4
>
>
> We saw similar deadlock like this 
> https://issues.apache.org/jira/browse/SPARK-26265 happening between 
> TaskMemoryManager and UnsafeExternalSorted$SpillableIterator
> Jstack output:
> jstack information as follow:
> {code:java}
> Found one Java-level deadlock:
> =
> "stdout writer for 
> /usr/lib/envs/env-1923-ver-1755-a-4.2.9-py-3.5.3/bin/python":
>   waiting to lock monitor 0x7fce56409088 (object 0x0005700a2f98, a 
> org.apache.spark.memory.TaskMemoryManager),
>   which is held by "Executor task launch worker for task 2203"
> "Executor task launch worker for task 2203":
>   waiting to lock monitor 0x007cd878 (object 0x0005701a0eb0, a 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator),
>   which is held by "stdout writer for 
> /usr/lib/envs/env-1923-ver-1755-a-4.2.9-py-3.5.3/bin/python"
> Java stack information for the threads listed above:
> ===
> "stdout writer for 
> /usr/lib/envs/env-1923-ver-1755-a-4.2.9-py-3.5.3/bin/python":
>   at 
> org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:334)
>   - waiting to lock <0x0005700a2f98> (a 
> org.apache.spark.memory.TaskMemoryManager)
>   at 
> org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.access$1100(UnsafeExternalSorter.java:48)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.loadNext(UnsafeExternalSorter.java:583)
>   - locked <0x0005701a0eb0> (a 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:187)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:174)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$2.hasNext(WholeStageCodegenExec.scala:638)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1073)
>   at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1089)
>   at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1127)
>   at 
> scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
>   at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:50)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2067)
>

[jira] [Updated] (SPARK-27366) Spark scheduler internal changes to support GPU scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27366:
--
Description: Update Spark job scheduler to support accelerator resource 
requests submitted at application level.

> Spark scheduler internal changes to support GPU scheduling
> --
>
> Key: SPARK-27366
> URL: https://issues.apache.org/jira/browse/SPARK-27366
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xingbo Jiang
>Priority: Major
>
> Update Spark job scheduler to support accelerator resource requests submitted 
> at application level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27374) Fetch assigned resources from TaskContext

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27374:
-

 Summary: Fetch assigned resources from TaskContext
 Key: SPARK-27374
 URL: https://issues.apache.org/jira/browse/SPARK-27374
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27365) Spark Jenkins supports testing GPU-aware scheduling features

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27365:
--
Summary: Spark Jenkins supports testing GPU-aware scheduling features  
(was: Spark Jenkins to support testing GPU-aware scheduling features)

> Spark Jenkins supports testing GPU-aware scheduling features
> 
>
> Key: SPARK-27365
> URL: https://issues.apache.org/jira/browse/SPARK-27365
> Project: Spark
>  Issue Type: Story
>  Components: jenkins
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27365) Spark Jenkins to support testing GPU-aware scheduling features

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27365:
--
Summary: Spark Jenkins to support testing GPU-aware scheduling features  
(was: Spark Jenkins setup to support GPU-aware scheduling)

> Spark Jenkins to support testing GPU-aware scheduling features
> --
>
> Key: SPARK-27365
> URL: https://issues.apache.org/jira/browse/SPARK-27365
> Project: Spark
>  Issue Type: Story
>  Components: jenkins
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27373) Design: Kubernetes support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27373:
-

 Summary: Design: Kubernetes support for GPU-aware scheduling
 Key: SPARK-27373
 URL: https://issues.apache.org/jira/browse/SPARK-27373
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27362) Kubernetes support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27362:
--
Description: Design and implement k8s support for GPU-aware scheduling.

> Kubernetes support for GPU-aware scheduling
> ---
>
> Key: SPARK-27362
> URL: https://issues.apache.org/jira/browse/SPARK-27362
> Project: Spark
>  Issue Type: Story
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Design and implement k8s support for GPU-aware scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27360:
--
Description: 
Design and implement standalone manager support for GPU-aware scheduling:

1. static conf to describe resources
2. spark-submit to request resources 
2. auto discovery of GPUs
3. executor process isolation

  was:
Design work for the standalone manager to support GPU-aware scheduling:

1. static conf to describe resources
2. spark-submit to request resources 
2. auto discovery of GPUs
3. executor process isolation


> Standalone cluster mode support for GPU-aware scheduling
> 
>
> Key: SPARK-27360
> URL: https://issues.apache.org/jira/browse/SPARK-27360
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design and implement standalone manager support for GPU-aware scheduling:
> 1. static conf to describe resources
> 2. spark-submit to request resources 
> 2. auto discovery of GPUs
> 3. executor process isolation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27372) Standalone executor process-level isolation to support GPU scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27372:
-

 Summary: Standalone executor process-level isolation to support 
GPU scheduling
 Key: SPARK-27372
 URL: https://issues.apache.org/jira/browse/SPARK-27372
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


As an admin, I can configure standalone to have multiple executor processes on 
the same worker node and processes are configured via cgroups so they only have 
access to assigned GPUs. So I don't need to worry about resource contention 
between processes on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27371) Standalone worker can auto discover GPUs

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27371:
-

 Summary: Standalone worker can auto discover GPUs
 Key: SPARK-27371
 URL: https://issues.apache.org/jira/browse/SPARK-27371
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


As an admin, I can let Spark Standalone worker automatically discover GPUs 
installed on worker nodes. So I don't need to manually configure them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27370) spark-submit requests GPUs in standalone mode

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27370:
-

 Summary: spark-submit requests GPUs in standalone mode
 Key: SPARK-27370
 URL: https://issues.apache.org/jira/browse/SPARK-27370
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Spark Submit
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


As a user, I can use spark-submit to request GPUs per task in standalone mode 
when I submit an Spark application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27360:
--
Description: 
Design work for the standalone manager to support GPU-aware scheduling:

1. static conf to describe resources
2. spark-submit to request resources 
2. auto discovery of GPUs
3. executor process isolation

  was:
Design work for the standalone manager to support GPU-aware scheduling:

1. static conf to list GPU devices
2. auto discovery of GPUs
3. executor process isolation


> Standalone cluster mode support for GPU-aware scheduling
> 
>
> Key: SPARK-27360
> URL: https://issues.apache.org/jira/browse/SPARK-27360
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design work for the standalone manager to support GPU-aware scheduling:
> 1. static conf to describe resources
> 2. spark-submit to request resources 
> 2. auto discovery of GPUs
> 3. executor process isolation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue

2019-04-03 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-27216.
--
Resolution: Fixed
  Assignee: Lantao Jin

> Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
> -
>
> Key: SPARK-27216
> URL: https://issues.apache.org/jira/browse/SPARK-27216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.0.0
>
>
> HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But 
> RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer.
> We can use below UT to reproduce:
> {code}
>   test("kryo serialization with RoaringBitmap") {
> val bitmap = new RoaringBitmap
> bitmap.add(1787)
> val safeSer = new KryoSerializer(conf).newInstance()
> val bitmap2 : RoaringBitmap = 
> safeSer.deserialize(safeSer.serialize(bitmap))
> assert(bitmap2.equals(bitmap))
> conf.set("spark.kryo.unsafe", "true")
> val unsafeSer = new KryoSerializer(conf).newInstance()
> val bitmap3 : RoaringBitmap = 
> unsafeSer.deserialize(unsafeSer.serialize(bitmap))
> assert(bitmap3.equals(bitmap)) // this will fail
>   }
> {code}
> Upgrade to latest version 0.7.45 to fix it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27369) Standalone support static conf to describe GPU resources

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27369:
-

 Summary: Standalone support static conf to describe GPU resources
 Key: SPARK-27369
 URL: https://issues.apache.org/jira/browse/SPARK-27369
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue

2019-04-03 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809415#comment-16809415
 ] 

Imran Rashid commented on SPARK-27216:
--

Fixed by https://github.com/apache/spark/pull/24264 in master / 3.0.0

> Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
> -
>
> Key: SPARK-27216
> URL: https://issues.apache.org/jira/browse/SPARK-27216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.0.0
>
>
> HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But 
> RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer.
> We can use below UT to reproduce:
> {code}
>   test("kryo serialization with RoaringBitmap") {
> val bitmap = new RoaringBitmap
> bitmap.add(1787)
> val safeSer = new KryoSerializer(conf).newInstance()
> val bitmap2 : RoaringBitmap = 
> safeSer.deserialize(safeSer.serialize(bitmap))
> assert(bitmap2.equals(bitmap))
> conf.set("spark.kryo.unsafe", "true")
> val unsafeSer = new KryoSerializer(conf).newInstance()
> val bitmap3 : RoaringBitmap = 
> unsafeSer.deserialize(unsafeSer.serialize(bitmap))
> assert(bitmap3.equals(bitmap)) // this will fail
>   }
> {code}
> Upgrade to latest version 0.7.45 to fix it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue

2019-04-03 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-27216:
-
Fix Version/s: 3.0.0

> Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
> -
>
> Key: SPARK-27216
> URL: https://issues.apache.org/jira/browse/SPARK-27216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
> Fix For: 3.0.0
>
>
> HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But 
> RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer.
> We can use below UT to reproduce:
> {code}
>   test("kryo serialization with RoaringBitmap") {
> val bitmap = new RoaringBitmap
> bitmap.add(1787)
> val safeSer = new KryoSerializer(conf).newInstance()
> val bitmap2 : RoaringBitmap = 
> safeSer.deserialize(safeSer.serialize(bitmap))
> assert(bitmap2.equals(bitmap))
> conf.set("spark.kryo.unsafe", "true")
> val unsafeSer = new KryoSerializer(conf).newInstance()
> val bitmap3 : RoaringBitmap = 
> unsafeSer.deserialize(unsafeSer.serialize(bitmap))
> assert(bitmap3.equals(bitmap)) // this will fail
>   }
> {code}
> Upgrade to latest version 0.7.45 to fix it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27368) Design: Standalone supports GPU scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27368:
-

 Summary: Design: Standalone supports GPU scheduling
 Key: SPARK-27368
 URL: https://issues.apache.org/jira/browse/SPARK-27368
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-27360:
-

Assignee: Xiangrui Meng

> Standalone cluster mode support for GPU-aware scheduling
> 
>
> Key: SPARK-27360
> URL: https://issues.apache.org/jira/browse/SPARK-27360
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Design work for the standalone manager to support GPU-aware scheduling:
> 1. static conf to list GPU devices
> 2. auto discovery of GPUs
> 3. executor process isolation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27005) Design sketch for SPIP discussion: Accelerator-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-27005.
---
   Resolution: Done
Fix Version/s: 3.0.0

I'm closing the ticket since SPIP passed vote. I create stories for each major 
sub-projects and we will finalize the design there.

> Design sketch for SPIP discussion: Accelerator-aware scheduling
> ---
>
> Key: SPARK-27005
> URL: https://issues.apache.org/jira/browse/SPARK-27005
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 3.0.0
>
>
> This task is to outline a design sketch for the accelerator-aware scheduling 
> SPIP discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27367) Faster RoaringBitmap Serialization with v0.8.0

2019-04-03 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-27367:


 Summary: Faster RoaringBitmap Serialization with v0.8.0
 Key: SPARK-27367
 URL: https://issues.apache.org/jira/browse/SPARK-27367
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Imran Rashid


RoaringBitmap 0.8.0 adds faster serde, but also requires us to change how we 
call the serde routines slightly to take advantage of it.  This is probably a 
worthwhile optimization as the every shuffle map task with a large # of 
partitions generates these bitmaps, and the driver especially has to 
deserialize many of these messages.

See 

* https://github.com/apache/spark/pull/24264#issuecomment-479675572
* https://github.com/RoaringBitmap/RoaringBitmap/pull/325
* https://github.com/RoaringBitmap/RoaringBitmap/issues/319




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27005) Design sketch for SPIP discussion: Accelerator-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27005:
--
Summary: Design sketch for SPIP discussion: Accelerator-aware scheduling  
(was: Design sketch: Accelerator-aware scheduling)

> Design sketch for SPIP discussion: Accelerator-aware scheduling
> ---
>
> Key: SPARK-27005
> URL: https://issues.apache.org/jira/browse/SPARK-27005
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
>
> This task is to outline a design sketch for the accelerator-aware scheduling 
> SPIP discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27366) Spark scheduler internal changes to support GPU scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27366:
-

 Summary: Spark scheduler internal changes to support GPU scheduling
 Key: SPARK-27366
 URL: https://issues.apache.org/jira/browse/SPARK-27366
 Project: Spark
  Issue Type: Story
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xiangrui Meng
Assignee: Xingbo Jiang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27005) Design sketch: Accelerator-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-27005:
-

Assignee: Xingbo Jiang

> Design sketch: Accelerator-aware scheduling
> ---
>
> Key: SPARK-27005
> URL: https://issues.apache.org/jira/browse/SPARK-27005
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
>
> This task is to outline a design sketch for the accelerator-aware scheduling 
> SPIP discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27364) User-facing APIs for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27364:
--
Description: 
Design and implement:

* General guidelines for cluster managers to understand resource requests at 
application start. The concrete conf/param will be under the design of each 
cluster manager.
* APIs to fetch assigned resources from task context.

> User-facing APIs for GPU-aware scheduling
> -
>
> Key: SPARK-27364
> URL: https://issues.apache.org/jira/browse/SPARK-27364
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>
> Design and implement:
> * General guidelines for cluster managers to understand resource requests at 
> application start. The concrete conf/param will be under the design of each 
> cluster manager.
> * APIs to fetch assigned resources from task context.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27364) User-facing APIs for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27364:
--
Summary: User-facing APIs for GPU-aware scheduling  (was: Design: 
User-facing APIs for GPU-aware scheduling)

> User-facing APIs for GPU-aware scheduling
> -
>
> Key: SPARK-27364
> URL: https://issues.apache.org/jira/browse/SPARK-27364
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27361) YARN support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27361:
--
Summary: YARN support for GPU-aware scheduling  (was: Design: YARN support 
for GPU-aware scheduling)

> YARN support for GPU-aware scheduling
> -
>
> Key: SPARK-27361
> URL: https://issues.apache.org/jira/browse/SPARK-27361
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27363) Mesos support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27363:
--
Summary: Mesos support for GPU-aware scheduling  (was: Design: Mesos 
support for GPU-aware scheduling)

> Mesos support for GPU-aware scheduling
> --
>
> Key: SPARK-27363
> URL: https://issues.apache.org/jira/browse/SPARK-27363
> Project: Spark
>  Issue Type: Story
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27362) Kubernetes support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27362:
--
Summary: Kubernetes support for GPU-aware scheduling  (was: Design: 
Kubernetes support for GPU-aware scheduling)

> Kubernetes support for GPU-aware scheduling
> ---
>
> Key: SPARK-27362
> URL: https://issues.apache.org/jira/browse/SPARK-27362
> Project: Spark
>  Issue Type: Story
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27024) Executor interface for cluster managers to support GPU resources

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27024:
--
Summary: Executor interface for cluster managers to support GPU resources  
(was: Design: Executor interface for cluster managers to support GPU resources)

> Executor interface for cluster managers to support GPU resources
> 
>
> Key: SPARK-27024
> URL: https://issues.apache.org/jira/browse/SPARK-27024
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Thomas Graves
>Priority: Major
>
> The executor interface shall deal with the resources allocated to the 
> executor by cluster managers(Standalone, YARN, Kubernetes), so the Spark 
> Executor don’t need to involve into the GPU discovery and allocation, which 
> shall be handled by cluster managers. However, an executor need to sync with 
> the driver to expose available resources to support task scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27360:
--
Summary: Standalone cluster mode support for GPU-aware scheduling  (was: 
Design: Standalone cluster mode support for GPU-aware scheduling)

> Standalone cluster mode support for GPU-aware scheduling
> 
>
> Key: SPARK-27360
> URL: https://issues.apache.org/jira/browse/SPARK-27360
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Design work for the standalone manager to support GPU-aware scheduling:
> 1. static conf to list GPU devices
> 2. auto discovery of GPUs
> 3. executor process isolation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27365) Spark Jenkins setup to support GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27365:
-

 Summary: Spark Jenkins setup to support GPU-aware scheduling
 Key: SPARK-27365
 URL: https://issues.apache.org/jira/browse/SPARK-27365
 Project: Spark
  Issue Type: Story
  Components: jenkins
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27024) Design: Executor interface for cluster managers to support GPU resources

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27024:
--
Summary: Design: Executor interface for cluster managers to support GPU 
resources  (was: Design executor interface to support GPU resources)

> Design: Executor interface for cluster managers to support GPU resources
> 
>
> Key: SPARK-27024
> URL: https://issues.apache.org/jira/browse/SPARK-27024
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Thomas Graves
>Priority: Major
>
> The executor interface shall deal with the resources allocated to the 
> executor by cluster managers(Standalone, YARN, Kubernetes), so the Spark 
> Executor don’t need to involve into the GPU discovery and allocation, which 
> shall be handled by cluster managers. However, an executor need to sync with 
> the driver to expose available resources to support task scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27364) Design: User-facing APIs for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27364:
-

 Summary: Design: User-facing APIs for GPU-aware scheduling
 Key: SPARK-27364
 URL: https://issues.apache.org/jira/browse/SPARK-27364
 Project: Spark
  Issue Type: Story
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xiangrui Meng
Assignee: Thomas Graves






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27363) Design: Mesos support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27363:
-

 Summary: Design: Mesos support for GPU-aware scheduling
 Key: SPARK-27363
 URL: https://issues.apache.org/jira/browse/SPARK-27363
 Project: Spark
  Issue Type: Story
  Components: Mesos
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27362) Design: Kubernetes support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27362:
-

 Summary: Design: Kubernetes support for GPU-aware scheduling
 Key: SPARK-27362
 URL: https://issues.apache.org/jira/browse/SPARK-27362
 Project: Spark
  Issue Type: Story
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27361) Design: YARN support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27361:
-

 Summary: Design: YARN support for GPU-aware scheduling
 Key: SPARK-27361
 URL: https://issues.apache.org/jira/browse/SPARK-27361
 Project: Spark
  Issue Type: Story
  Components: YARN
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27024) Design executor interface to support GPU resources

2019-04-03 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-27024:
-

Assignee: Thomas Graves

> Design executor interface to support GPU resources
> --
>
> Key: SPARK-27024
> URL: https://issues.apache.org/jira/browse/SPARK-27024
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Assignee: Thomas Graves
>Priority: Major
>
> The executor interface shall deal with the resources allocated to the 
> executor by cluster managers(Standalone, YARN, Kubernetes), so the Spark 
> Executor don’t need to involve into the GPU discovery and allocation, which 
> shall be handled by cluster managers. However, an executor need to sync with 
> the driver to expose available resources to support task scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27360) Design: Standalone cluster mode support for GPU-aware scheduling

2019-04-03 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-27360:
-

 Summary: Design: Standalone cluster mode support for GPU-aware 
scheduling
 Key: SPARK-27360
 URL: https://issues.apache.org/jira/browse/SPARK-27360
 Project: Spark
  Issue Type: Story
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


Design work for the standalone manager to support GPU-aware scheduling:

1. static conf to list GPU devices
2. auto discovery of GPUs
3. executor process isolation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27359) Joins on some array functions can be optimized

2019-04-03 Thread Nikolas Vanderhoof (JIRA)

Nikolas Vanderhoof created SPARK-27359:
--

 Summary: Joins on some array functions can be optimized
 Key: SPARK-27359
 URL: https://issues.apache.org/jira/browse/SPARK-27359
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer, SQL
Affects Versions: 3.0.0
Reporter: Nikolas Vanderhoof


I encounter these cases frequently, and implemented the optimization manually 
(as shown here). If others experience this as well, perhaps it would be good to 
add appropriate tree transformations into catalyst. I can create some rough 
draft implementations but expect I will need assistance when it comes to 
resolving the generating expressions in the logical plan.

h1. Case 1
A join like this:
{code:scala}
left.join(
  right,
  arrays_overlap(left("a"), right("b")) // Creates a cartesian product in 
the logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_a", explode(col("a")))
  val rightPrime = right.withColumn("exploded_b", explode(col("b")))

  leftPrime.join(
rightPrime,
leftPrime("exploded_a") === rightPrime("exploded_b")
  // Equijoin doesn't produce cartesian product
  ).drop("exploded_a", "exploded_b").distinct
}
{code}

h1. Case 2
A join like this:
{code:scala}
left.join(
  right,
  array_contains(left("arr"), right("value")) // Cartesian product in logical 
plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))

  leftPrime.join(
right,
leftPrime("exploded_arr") === right("value") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}

h1. Case 3
A join like this:
{code:scala}
left.join(
  right,
  array_contains(right("arr"), left("value")) // Cartesian product in logical 
plan
)
{code}

will produce the same results as:
{code:scala}
{
  val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))

  left.join(
rightPrime,
left("value") === rightPrime("exploded_arr") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not

2019-04-03 Thread Huon Wilson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809321#comment-16809321
 ] 

Huon Wilson commented on SPARK-27278:
-

Hm, I'm not sure I understand why only half of this optimisation is being 
removed. This bug represents a missed case in an existing optimisation, and so 
if this is being reverted, shouldn't the whole existing optimisation (that is, 
the {{GetMapValue(CreateMap(elems), key)}} case in {{SimplifyExtractValueOps}}) 
also be removed? The code-size concerns apply equally well to that as to the 
new case here.

> Optimize GetMapValue when the map is a foldable and the key is not
> --
>
> Key: SPARK-27278
> URL: https://issues.apache.org/jira/browse/SPARK-27278
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> With a map that isn't constant-foldable, spark will optimise an access to a 
> series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L 
> as int) = 2) THEN id#180L END AS x#182L]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> This results in an efficient series of ifs and elses, in the code generation:
> {code:java}
> /* 037 */   boolean project_isNull_3 = false;
> /* 038 */   int project_value_3 = -1;
> /* 039 */   if (!false) {
> /* 040 */ project_value_3 = (int) project_expr_0_0;
> /* 041 */   }
> /* 042 */
> /* 043 */   boolean project_value_2 = false;
> /* 044 */   project_value_2 = project_value_3 == 1;
> /* 045 */   if (!false && project_value_2) {
> /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 047 */ project_project_value_1_0 = 1L;
> /* 048 */ continue;
> /* 049 */   }
> /* 050 */
> /* 051 */   boolean project_isNull_8 = false;
> /* 052 */   int project_value_8 = -1;
> /* 053 */   if (!false) {
> /* 054 */ project_value_8 = (int) project_expr_0_0;
> /* 055 */   }
> /* 056 */
> /* 057 */   boolean project_value_7 = false;
> /* 058 */   project_value_7 = project_value_8 == 2;
> /* 059 */   if (!false && project_value_7) {
> /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 061 */ project_project_value_1_0 = project_expr_0_0;
> /* 062 */ continue;
> /* 063 */   }
> {code}
> If the map can be constant folded, the constant folding happens first, and 
> the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing 
> a map traversal and more dynamic checks:
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which 
> is what is stored in the {{Literal}} form of the {{map(...)}} expression in 
> that select. The code generated is less efficient, since it has to do a 
> manual dynamic traversal of the map's array of keys, with type casts etc.:
> {code:java}
> /* 099 */   int project_index_0 = 0;
> /* 100 */   boolean project_found_0 = false;
> /* 101 */   while (project_index_0 < project_length_0 && 
> !project_found_0) {
> /* 102 */ final int project_key_0 = 
> project_keys_0.getInt(project_index_0);
> /* 103 */ if (project_key_0 == project_value_2) {
> /* 104 */   project_found_0 = true;
> /* 105 */ } else {
> /* 106 */   project_index_0++;
> /* 107 */ }
> /* 108 */   }
> /* 109 */
> /* 110 */   if (!project_found_0) {
> /* 111 */ project_isNull_0 = true;
> /* 112 */   } else {
> /* 113 */ project_value_0 = 
> project_values_0.getInt(project_index_0);
> /* 114 */   }
> {code}
> It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't 
> handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form:
> {code:scala}
>   case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27299) Design: Property graph construction, save/load, and query APIs

2019-04-03 Thread Xiangrui Meng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809291#comment-16809291
 ] 

Xiangrui Meng commented on SPARK-27299:
---

[~mju] Could you post the draft design doc to JIRA?

> Design: Property graph construction, save/load, and query APIs
> --
>
> Key: SPARK-27299
> URL: https://issues.apache.org/jira/browse/SPARK-27299
> Project: Spark
>  Issue Type: Story
>  Components: Graph
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Martin Junghanns
>Priority: Major
>
> Design doc for property graph and Cypher queries.
> * Construct a property graph.
> * How nodes and relationships map to DataFrames
> * Save/load.
> * Cypher query.
> * Support Scala/Python/Java.
> * Dependencies
> * Test
> Out of scope:
> * Graph algorithms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27358) Update jquery to 1.12.x to address CVE

2019-04-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-27358:
--
   Priority: Major  (was: Minor)
Description: 
jquery 1.11.1 is affected by a CVE:
https://www.cvedetails.com/cve/CVE-2016-7103/

Note that I do not know whether this actually manifests as a security problem 
for Spark. But, we can easily update to 1.12.4 (latest 1.x version) to resolve 
it.

(Note that https://www.cvedetails.com/cve/CVE-2015-9251/ seems to have been 
fixed in 1.12 but then unfixed, so this may require a much bigger jump to 
jquery 3.x if it's a problem; leaving that until later.)

Along the way we will want to update jquery datatables to 1.10.18 to match 
jquery 1.12.4.

Relatedly, jquery mustache 0.8.1 also has a CVE: 
https://snyk.io/test/npm/mustache/0.8.2

I propose to update to 2.3.12 (latest 2.x) to resolve it.


Although targeted for 3.0, I believe this is back-port-able to 2.4.x if needed, 
assuming we find no UI issues.

  was:
jquery 1.11.1 is affected by a CVE:
https://www.cvedetails.com/cve/CVE-2016-7103/

We can easily update to 1.12.4 (latest 1.x version) to resolve it.

(Note that https://www.cvedetails.com/cve/CVE-2015-9251/ seems to have been 
fixed in 1.12 but then unfixed, so this may require a much bigger jump to 
jquery 3.x if it's a problem; leaving that until later.)

Along the way we will want to update jquery datatables to 1.10.18 to match 
jquery 1.12.4.

Relatedly, jquery mustache 0.8.1 also has a CVE: 
https://snyk.io/test/npm/mustache/0.8.2

I propose to update to 2.3.12 (latest 2.x) to resolve it.


Although targeted for 3.0, I believe this is back-port-able to 2.4.x if needed, 
assuming we find no UI issues.


> Update jquery to 1.12.x to address CVE
> --
>
> Key: SPARK-27358
> URL: https://issues.apache.org/jira/browse/SPARK-27358
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> jquery 1.11.1 is affected by a CVE:
> https://www.cvedetails.com/cve/CVE-2016-7103/
> Note that I do not know whether this actually manifests as a security problem 
> for Spark. But, we can easily update to 1.12.4 (latest 1.x version) to 
> resolve it.
> (Note that https://www.cvedetails.com/cve/CVE-2015-9251/ seems to have been 
> fixed in 1.12 but then unfixed, so this may require a much bigger jump to 
> jquery 3.x if it's a problem; leaving that until later.)
> Along the way we will want to update jquery datatables to 1.10.18 to match 
> jquery 1.12.4.
> Relatedly, jquery mustache 0.8.1 also has a CVE: 
> https://snyk.io/test/npm/mustache/0.8.2
> I propose to update to 2.3.12 (latest 2.x) to resolve it.
> Although targeted for 3.0, I believe this is back-port-able to 2.4.x if 
> needed, assuming we find no UI issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27358) Update jquery to 1.12.x to address CVE

2019-04-03 Thread Sean Owen (JIRA)

Sean Owen created SPARK-27358:
-

 Summary: Update jquery to 1.12.x to address CVE
 Key: SPARK-27358
 URL: https://issues.apache.org/jira/browse/SPARK-27358
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


jquery 1.11.1 is affected by a CVE:
https://www.cvedetails.com/cve/CVE-2016-7103/

We can easily update to 1.12.4 (latest 1.x version) to resolve it.

(Note that https://www.cvedetails.com/cve/CVE-2015-9251/ seems to have been 
fixed in 1.12 but then unfixed, so this may require a much bigger jump to 
jquery 3.x if it's a problem; leaving that until later.)

Along the way we will want to update jquery datatables to 1.10.18 to match 
jquery 1.12.4.

Relatedly, jquery mustache 0.8.1 also has a CVE: 
https://snyk.io/test/npm/mustache/0.8.2

I propose to update to 2.3.12 (latest 2.x) to resolve it.


Although targeted for 3.0, I believe this is back-port-able to 2.4.x if needed, 
assuming we find no UI issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27357) Cast timestamps to/from dates independently from time zones

2019-04-03 Thread Maxim Gekk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-27357:
---
Summary: Cast timestamps to/from dates independently from time zones  (was: 
Convert timestamps to/from dates independently from time zones)

> Cast timestamps to/from dates independently from time zones
> ---
>
> Key: SPARK-27357
> URL: https://issues.apache.org/jira/browse/SPARK-27357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Both Catalyst's types TIMESTAMP and DATE internally represent time intervals 
> since epoch in UTC time zone. The TIMESTAMP type contains number of 
> microseconds since epoch, and DATE is number of days since epoch (00:00:00  1 
> January 1970). As a consequence of that, the conversion should be independent 
> from session or local time zone. The ticket aims to fix current behavior and 
> makes the conversion independent from time zones.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27357) Convert timestamps to/from dates independently from time zones

2019-04-03 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-27357:
--

 Summary: Convert timestamps to/from dates independently from time 
zones
 Key: SPARK-27357
 URL: https://issues.apache.org/jira/browse/SPARK-27357
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Both Catalyst's types TIMESTAMP and DATE internally represent time intervals 
since epoch in UTC time zone. The TIMESTAMP type contains number of 
microseconds since epoch, and DATE is number of days since epoch (00:00:00  1 
January 1970). As a consequence of that, the conversion should be independent 
from session or local time zone. The ticket aims to fix current behavior and 
makes the conversion independent from time zones.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not

2019-04-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27278.
---
Resolution: Later

We can revisit this issue and the solution when we adding official "loop 
unrolling" support for array/map operations.

 

cc [~smilegator] and [~cloud_fan].

> Optimize GetMapValue when the map is a foldable and the key is not
> --
>
> Key: SPARK-27278
> URL: https://issues.apache.org/jira/browse/SPARK-27278
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
>Reporter: Huon Wilson
>Assignee: Marco Gaido
>Priority: Major
>
> With a map that isn't constant-foldable, spark will optimise an access to a 
> series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L 
> as int) = 2) THEN id#180L END AS x#182L]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> This results in an efficient series of ifs and elses, in the code generation:
> {code:java}
> /* 037 */   boolean project_isNull_3 = false;
> /* 038 */   int project_value_3 = -1;
> /* 039 */   if (!false) {
> /* 040 */ project_value_3 = (int) project_expr_0_0;
> /* 041 */   }
> /* 042 */
> /* 043 */   boolean project_value_2 = false;
> /* 044 */   project_value_2 = project_value_3 == 1;
> /* 045 */   if (!false && project_value_2) {
> /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 047 */ project_project_value_1_0 = 1L;
> /* 048 */ continue;
> /* 049 */   }
> /* 050 */
> /* 051 */   boolean project_isNull_8 = false;
> /* 052 */   int project_value_8 = -1;
> /* 053 */   if (!false) {
> /* 054 */ project_value_8 = (int) project_expr_0_0;
> /* 055 */   }
> /* 056 */
> /* 057 */   boolean project_value_7 = false;
> /* 058 */   project_value_7 = project_value_8 == 2;
> /* 059 */   if (!false && project_value_7) {
> /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 061 */ project_project_value_1_0 = project_expr_0_0;
> /* 062 */ continue;
> /* 063 */   }
> {code}
> If the map can be constant folded, the constant folding happens first, and 
> the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing 
> a map traversal and more dynamic checks:
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which 
> is what is stored in the {{Literal}} form of the {{map(...)}} expression in 
> that select. The code generated is less efficient, since it has to do a 
> manual dynamic traversal of the map's array of keys, with type casts etc.:
> {code:java}
> /* 099 */   int project_index_0 = 0;
> /* 100 */   boolean project_found_0 = false;
> /* 101 */   while (project_index_0 < project_length_0 && 
> !project_found_0) {
> /* 102 */ final int project_key_0 = 
> project_keys_0.getInt(project_index_0);
> /* 103 */ if (project_key_0 == project_value_2) {
> /* 104 */   project_found_0 = true;
> /* 105 */ } else {
> /* 106 */   project_index_0++;
> /* 107 */ }
> /* 108 */   }
> /* 109 */
> /* 110 */   if (!project_found_0) {
> /* 111 */ project_isNull_0 = true;
> /* 112 */   } else {
> /* 113 */ project_value_0 = 
> project_values_0.getInt(project_index_0);
> /* 114 */   }
> {code}
> It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't 
> handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form:
> {code:scala}
>   case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not

2019-04-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27278:
-

Assignee: (was: Marco Gaido)

> Optimize GetMapValue when the map is a foldable and the key is not
> --
>
> Key: SPARK-27278
> URL: https://issues.apache.org/jira/browse/SPARK-27278
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> With a map that isn't constant-foldable, spark will optimise an access to a 
> series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L 
> as int) = 2) THEN id#180L END AS x#182L]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> This results in an efficient series of ifs and elses, in the code generation:
> {code:java}
> /* 037 */   boolean project_isNull_3 = false;
> /* 038 */   int project_value_3 = -1;
> /* 039 */   if (!false) {
> /* 040 */ project_value_3 = (int) project_expr_0_0;
> /* 041 */   }
> /* 042 */
> /* 043 */   boolean project_value_2 = false;
> /* 044 */   project_value_2 = project_value_3 == 1;
> /* 045 */   if (!false && project_value_2) {
> /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 047 */ project_project_value_1_0 = 1L;
> /* 048 */ continue;
> /* 049 */   }
> /* 050 */
> /* 051 */   boolean project_isNull_8 = false;
> /* 052 */   int project_value_8 = -1;
> /* 053 */   if (!false) {
> /* 054 */ project_value_8 = (int) project_expr_0_0;
> /* 055 */   }
> /* 056 */
> /* 057 */   boolean project_value_7 = false;
> /* 058 */   project_value_7 = project_value_8 == 2;
> /* 059 */   if (!false && project_value_7) {
> /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 061 */ project_project_value_1_0 = project_expr_0_0;
> /* 062 */ continue;
> /* 063 */   }
> {code}
> If the map can be constant folded, the constant folding happens first, and 
> the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing 
> a map traversal and more dynamic checks:
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which 
> is what is stored in the {{Literal}} form of the {{map(...)}} expression in 
> that select. The code generated is less efficient, since it has to do a 
> manual dynamic traversal of the map's array of keys, with type casts etc.:
> {code:java}
> /* 099 */   int project_index_0 = 0;
> /* 100 */   boolean project_found_0 = false;
> /* 101 */   while (project_index_0 < project_length_0 && 
> !project_found_0) {
> /* 102 */ final int project_key_0 = 
> project_keys_0.getInt(project_index_0);
> /* 103 */ if (project_key_0 == project_value_2) {
> /* 104 */   project_found_0 = true;
> /* 105 */ } else {
> /* 106 */   project_index_0++;
> /* 107 */ }
> /* 108 */   }
> /* 109 */
> /* 110 */   if (!project_found_0) {
> /* 111 */ project_isNull_0 = true;
> /* 112 */   } else {
> /* 113 */ project_value_0 = 
> project_values_0.getInt(project_index_0);
> /* 114 */   }
> {code}
> It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't 
> handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form:
> {code:scala}
>   case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not

2019-04-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-27278.
-

> Optimize GetMapValue when the map is a foldable and the key is not
> --
>
> Key: SPARK-27278
> URL: https://issues.apache.org/jira/browse/SPARK-27278
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> With a map that isn't constant-foldable, spark will optimise an access to a 
> series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L 
> as int) = 2) THEN id#180L END AS x#182L]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> This results in an efficient series of ifs and elses, in the code generation:
> {code:java}
> /* 037 */   boolean project_isNull_3 = false;
> /* 038 */   int project_value_3 = -1;
> /* 039 */   if (!false) {
> /* 040 */ project_value_3 = (int) project_expr_0_0;
> /* 041 */   }
> /* 042 */
> /* 043 */   boolean project_value_2 = false;
> /* 044 */   project_value_2 = project_value_3 == 1;
> /* 045 */   if (!false && project_value_2) {
> /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 047 */ project_project_value_1_0 = 1L;
> /* 048 */ continue;
> /* 049 */   }
> /* 050 */
> /* 051 */   boolean project_isNull_8 = false;
> /* 052 */   int project_value_8 = -1;
> /* 053 */   if (!false) {
> /* 054 */ project_value_8 = (int) project_expr_0_0;
> /* 055 */   }
> /* 056 */
> /* 057 */   boolean project_value_7 = false;
> /* 058 */   project_value_7 = project_value_8 == 2;
> /* 059 */   if (!false && project_value_7) {
> /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 061 */ project_project_value_1_0 = project_expr_0_0;
> /* 062 */ continue;
> /* 063 */   }
> {code}
> If the map can be constant folded, the constant folding happens first, and 
> the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing 
> a map traversal and more dynamic checks:
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which 
> is what is stored in the {{Literal}} form of the {{map(...)}} expression in 
> that select. The code generated is less efficient, since it has to do a 
> manual dynamic traversal of the map's array of keys, with type casts etc.:
> {code:java}
> /* 099 */   int project_index_0 = 0;
> /* 100 */   boolean project_found_0 = false;
> /* 101 */   while (project_index_0 < project_length_0 && 
> !project_found_0) {
> /* 102 */ final int project_key_0 = 
> project_keys_0.getInt(project_index_0);
> /* 103 */ if (project_key_0 == project_value_2) {
> /* 104 */   project_found_0 = true;
> /* 105 */ } else {
> /* 106 */   project_index_0++;
> /* 107 */ }
> /* 108 */   }
> /* 109 */
> /* 110 */   if (!project_found_0) {
> /* 111 */ project_isNull_0 = true;
> /* 112 */   } else {
> /* 113 */ project_value_0 = 
> project_values_0.getInt(project_index_0);
> /* 114 */   }
> {code}
> It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't 
> handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form:
> {code:scala}
>   case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not

2019-04-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-27278:
---

This is reverted via 
[https://github.com/apache/spark/commit/b51763612a71e214ed94ac5fb6843cf03d00f9f8]
 to be safe. Please see the discussion on the PR, 
[https://github.com/apache/spark/pull/24223] .

> Optimize GetMapValue when the map is a foldable and the key is not
> --
>
> Key: SPARK-27278
> URL: https://issues.apache.org/jira/browse/SPARK-27278
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
>Reporter: Huon Wilson
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> With a map that isn't constant-foldable, spark will optimise an access to a 
> series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L 
> as int) = 2) THEN id#180L END AS x#182L]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> This results in an efficient series of ifs and elses, in the code generation:
> {code:java}
> /* 037 */   boolean project_isNull_3 = false;
> /* 038 */   int project_value_3 = -1;
> /* 039 */   if (!false) {
> /* 040 */ project_value_3 = (int) project_expr_0_0;
> /* 041 */   }
> /* 042 */
> /* 043 */   boolean project_value_2 = false;
> /* 044 */   project_value_2 = project_value_3 == 1;
> /* 045 */   if (!false && project_value_2) {
> /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 047 */ project_project_value_1_0 = 1L;
> /* 048 */ continue;
> /* 049 */   }
> /* 050 */
> /* 051 */   boolean project_isNull_8 = false;
> /* 052 */   int project_value_8 = -1;
> /* 053 */   if (!false) {
> /* 054 */ project_value_8 = (int) project_expr_0_0;
> /* 055 */   }
> /* 056 */
> /* 057 */   boolean project_value_7 = false;
> /* 058 */   project_value_7 = project_value_8 == 2;
> /* 059 */   if (!false && project_value_7) {
> /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 061 */ project_project_value_1_0 = project_expr_0_0;
> /* 062 */ continue;
> /* 063 */   }
> {code}
> If the map can be constant folded, the constant folding happens first, and 
> the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing 
> a map traversal and more dynamic checks:
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which 
> is what is stored in the {{Literal}} form of the {{map(...)}} expression in 
> that select. The code generated is less efficient, since it has to do a 
> manual dynamic traversal of the map's array of keys, with type casts etc.:
> {code:java}
> /* 099 */   int project_index_0 = 0;
> /* 100 */   boolean project_found_0 = false;
> /* 101 */   while (project_index_0 < project_length_0 && 
> !project_found_0) {
> /* 102 */ final int project_key_0 = 
> project_keys_0.getInt(project_index_0);
> /* 103 */ if (project_key_0 == project_value_2) {
> /* 104 */   project_found_0 = true;
> /* 105 */ } else {
> /* 106 */   project_index_0++;
> /* 107 */ }
> /* 108 */   }
> /* 109 */
> /* 110 */   if (!project_found_0) {
> /* 111 */ project_isNull_0 = true;
> /* 112 */   } else {
> /* 113 */ project_value_0 = 
> project_values_0.getInt(project_index_0);
> /* 114 */   }
> {code}
> It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't 
> handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form:
> {code:scala}
>   case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not

2019-04-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27278:
--
Fix Version/s: (was: 3.0.0)

> Optimize GetMapValue when the map is a foldable and the key is not
> --
>
> Key: SPARK-27278
> URL: https://issues.apache.org/jira/browse/SPARK-27278
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
>Reporter: Huon Wilson
>Assignee: Marco Gaido
>Priority: Major
>
> With a map that isn't constant-foldable, spark will optimise an access to a 
> series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L 
> as int) = 2) THEN id#180L END AS x#182L]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> This results in an efficient series of ifs and elses, in the code generation:
> {code:java}
> /* 037 */   boolean project_isNull_3 = false;
> /* 038 */   int project_value_3 = -1;
> /* 039 */   if (!false) {
> /* 040 */ project_value_3 = (int) project_expr_0_0;
> /* 041 */   }
> /* 042 */
> /* 043 */   boolean project_value_2 = false;
> /* 044 */   project_value_2 = project_value_3 == 1;
> /* 045 */   if (!false && project_value_2) {
> /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 047 */ project_project_value_1_0 = 1L;
> /* 048 */ continue;
> /* 049 */   }
> /* 050 */
> /* 051 */   boolean project_isNull_8 = false;
> /* 052 */   int project_value_8 = -1;
> /* 053 */   if (!false) {
> /* 054 */ project_value_8 = (int) project_expr_0_0;
> /* 055 */   }
> /* 056 */
> /* 057 */   boolean project_value_7 = false;
> /* 058 */   project_value_7 = project_value_8 == 2;
> /* 059 */   if (!false && project_value_7) {
> /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 061 */ project_project_value_1_0 = project_expr_0_0;
> /* 062 */ continue;
> /* 063 */   }
> {code}
> If the map can be constant folded, the constant folding happens first, and 
> the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing 
> a map traversal and more dynamic checks:
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which 
> is what is stored in the {{Literal}} form of the {{map(...)}} expression in 
> that select. The code generated is less efficient, since it has to do a 
> manual dynamic traversal of the map's array of keys, with type casts etc.:
> {code:java}
> /* 099 */   int project_index_0 = 0;
> /* 100 */   boolean project_found_0 = false;
> /* 101 */   while (project_index_0 < project_length_0 && 
> !project_found_0) {
> /* 102 */ final int project_key_0 = 
> project_keys_0.getInt(project_index_0);
> /* 103 */ if (project_key_0 == project_value_2) {
> /* 104 */   project_found_0 = true;
> /* 105 */ } else {
> /* 106 */   project_index_0++;
> /* 107 */ }
> /* 108 */   }
> /* 109 */
> /* 110 */   if (!project_found_0) {
> /* 111 */ project_isNull_0 = true;
> /* 112 */   } else {
> /* 113 */ project_value_0 = 
> project_values_0.getInt(project_index_0);
> /* 114 */   }
> {code}
> It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't 
> handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form:
> {code:scala}
>   case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27356) File source V2: return actual schema in method `FileScan.readSchema`

2019-04-03 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-27356:
--

 Summary: File source V2: return actual schema in method 
`FileScan.readSchema`
 Key: SPARK-27356
 URL: https://issues.apache.org/jira/browse/SPARK-27356
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


The method `Scan.readSchema` returns the actual schema of this data source 
scan. 
In the current file source V2 framework, the schema is not returned correctly. 
The result should be `readDataSchema + partitionSchema`, while the framework 
returns the schema which is pushed down in 
`SupportsPushDownRequiredColumns.requiredSchema`.
This is normally OK. But if there are overlap columns between `dataSchema` and 
`partitionSchema`, the result of row-base scan will be wrong. The actual schema 
should be 
`dataSchema - overlapSchema + partitionSchema`, which is different from from 
the pushed down `requiredSchema`.

This PR is to:
1. Bug fix: fix the corner case that `dataSchema` overlaps with 
`partitionSchema`.
2. Improvement: Prune partition column values if part of the partition columns 
are not required.
3. Behavior change: To make it simple, the schema of `FileTable` is `dataSchema 
- overlapSchema + partitionSchema`, instead of mixing data schema and 
partitionSchema (see `PartitioningUtils.mergeDataAndPartitionSchema`)
For example, the data schema is [a,b,c], the partition schema is  [b,d],
In V1, the schema of `HadoopFsRelation` is [a, b, c, d]
in File source V2 , the schema of `FileTable` is [a, c, b, d]
Putting all the partition columns to the end of table schema is more 
reasonable. Also, when there is `select *` operation and there is no schema 
pruning, the schema of `FileTable` and `FileScan` still matches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory

2019-04-03 Thread belvey (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806307#comment-16806307
 ] 

belvey edited comment on SPARK-13510 at 4/3/19 3:07 PM:


[~shenhong] , hello hongsheng, I am using spark2.0 facing the same issue,  I am 
not sure if it's merged into spark2. it's very kind for you to post your pr.

 edit:

I found that solution had already been added to spark2.3 and later.  i am not 
sure if is hong shen's pr , but the solution is similar to what hong shen's 
said. And for spark2.3 and later we can use 
"spark.maxRemoteBlockSizeFetchToMem" to control the max block size allowed for 
shuffle fetching data that catched in memory, it's default value is 
(Interger.max-512)  bytes.


was (Author: belvey):
[~shenhong] , hello hongsheng, I am using spark2.0 facing the same issue,  I am 
not sure if it's merged into spark2. it's very kind for you to post your pr.

 

> Shuffle may throw FetchFailedException: Direct buffer memory
> 
>
> Key: SPARK-13510
> URL: https://issues.apache.org/jira/browse/SPARK-13510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Hong Shen
>Priority: Major
> Attachments: spark-13510.diff
>
>
> In our cluster, when I test spark-1.6.0 with a sql, it throw exception and 
> failed.
> {code}
> 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request 
> for 1 blocks (915.4 MB) from 10.196.134.220:7337
> 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch 
> from 10.196.134.220:7337 (executor id 122)
> 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 
> to /10.196.134.220:7337
> 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in 
> connection from /10.196.134.220:7337
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>   at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:744)
> 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 
> requests outstanding when connection from /10.196.134.220:7337 is closed
> 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block 
> shuffle_3_81_2, and will not retry (0 retries)
> {code}
>   The reason is that when shuffle a big block(like 1G), task will allocate 
> the same memory, it will easily throw "FetchFailedException: Direct buffer 
> memory".
>   If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will 
> throw 
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607)
> at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
> {code}
>   
>   In mapreduce shuffle, it will firstly judge whether the block can cache in 
> memery, but spark doesn't. 
>   If the block is more than we can cache in memory, we  should write to disk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Created] (SPARK-27355) make query execution more sensitive to epoch message late or lost

2019-04-03 Thread Genmao Yu (JIRA)

Genmao Yu created SPARK-27355:
-

 Summary: make query execution more sensitive to epoch message late 
or lost
 Key: SPARK-27355
 URL: https://issues.apache.org/jira/browse/SPARK-27355
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Genmao Yu


In SPARK-23503, we enforce sequencing of committed epochs for Continuous 
Execution. In case a message for epoch n is lost and epoch (n + 1) is ready for 
commit before epoch n is, epoch (n + 1) will wait for epoch n to be committed 
first. With extreme condition, we will wait for `epochBacklogQueueSize` (1 
in default) epochs and then failed. There is no need to wait for such a long 
time before query fail, and we can make the condition more sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27354) Add a new empty hive-thriftserver module for Hive 2.3.4

2019-04-03 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-27354:
---

 Summary: Add a new empty hive-thriftserver module for Hive 2.3.4 
 Key: SPARK-27354
 URL: https://issues.apache.org/jira/browse/SPARK-27354
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


When we upgraded the built-in Hive to 2.3.4, the current {{hive-thriftserver}} 
module is not compatible, such as these Hive changes:
 # HIVE-12442 HiveServer2: Refactor/repackage HiveServer2's Thrift code so that 
it can be used in the tasks
 # HIVE-12237 Use slf4j as logging facade
 # HIVE-13169 HiveServer2: Support delegation token based connection when using 
http transport

So we should add a new {{hive-thriftserver}} module for Hive 2.3.4:
1. Add a new empty module for Hive 2.3.4 named {{hive-thriftserverV2}}.
2. Make {{hive-thriftserver}} can only be activated when testing with 
hadoop-2.7.
3. Make {{hive-thriftserverV2}} can only be activated when testing with 
hadoop-3.2.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27353) PySpark Row repr bug

2019-04-03 Thread Ihor Bobak (JIRA)

Ihor Bobak created SPARK-27353:
--

 Summary: PySpark  Row  __repr__ bug
 Key: SPARK-27353
 URL: https://issues.apache.org/jira/browse/SPARK-27353
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Ihor Bobak


Row class has this implementation of __repr__:

    def __repr__(self):
    """Printable representation of Row used in Python REPL."""
    if hasattr(self, "__fields__"):
    return "Row(%s)" % ", ".join("%s=%r" % (k, v)
 for k, v in zip(self.__fields__, 
tuple(self)))
    else:
    return "" % ", ".join(self)

 

the last line fails when you have a datetime.date instance in a row:


TypeError Traceback (most recent call last)
 in 
  2 print(*row.values)
  3 df_row = Row(*row.values)
> 4 print(repr(df_row))
  5 break
  6 

E:\spark\spark-2.3.2-bin-without-hadoop\python\pyspark\sql\types.py in 
__repr__(self)
   1579  for k, v in 
zip(self.__fields__, tuple(self)))
   1580 else:
-> 1581 return "" % ", ".join(self)
   1582 
   1583 

TypeError: sequence item 0: expected str instance, datetime.date found



 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2019-04-03 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808588#comment-16808588
 ] 

Wenchen Fan commented on SPARK-17592:
-

shall we close it? It seems not worth making behavior changes at this point, 
just to be consistent with Hive.

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Priority: Major
> Attachments: image-2018-05-24-17-10-24-515.png
>
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27344) Support the LocalDate and Instant classes in Java Bean encoders

2019-04-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27344.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24273
[https://github.com/apache/spark/pull/24273]

> Support the LocalDate and Instant classes in Java Bean encoders
> ---
>
> Key: SPARK-27344
> URL: https://issues.apache.org/jira/browse/SPARK-27344
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> - Check that Java Bean encoders support java.time.LocalDate and 
> java.time.Instant. Write a test for that.
> - Update the comment: 
> https://github.com/apache/spark/pull/24249/files#diff-3e88c21c9270fef6eaf6f0e64ed81f27R152



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27344) Support the LocalDate and Instant classes in Java Bean encoders

2019-04-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27344:
---

Assignee: Maxim Gekk

> Support the LocalDate and Instant classes in Java Bean encoders
> ---
>
> Key: SPARK-27344
> URL: https://issues.apache.org/jira/browse/SPARK-27344
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> - Check that Java Bean encoders support java.time.LocalDate and 
> java.time.Instant. Write a test for that.
> - Update the comment: 
> https://github.com/apache/spark/pull/24249/files#diff-3e88c21c9270fef6eaf6f0e64ed81f27R152



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.

2019-04-03 Thread Martin Studer (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808530#comment-16808530
 ] 

Martin Studer commented on SPARK-19860:
---

We're observing the same issue with pyspark 2.3.0. It happens on an inner join 
of two data frames which have one single column in common (the join column). If 
I rename one of the columns as mentioned by [~wuchang1989] and then use a join 
expression the join succeeds. Isolating the problem seems difficult as it 
happens only in the context of a larger pipeline.

> DataFrame join get conflict error if two frames has a same name column.
> ---
>
> Key: SPARK-19860
> URL: https://issues.apache.org/jira/browse/SPARK-19860
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: wuchang
>Priority: Major
>
> {code}
> >>> print df1.collect()
> [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', 
> in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), 
> Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', 
> in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), 
> Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', 
> in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), 
> Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', 
> in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), 
> Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', 
> in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), 
> Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', 
> in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), 
> Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', 
> in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), 
> Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', 
> in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), 
> Row(fdate=u'20170301', in_amount1=10159653)]
> >>> print df2.collect()
> [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', 
> in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), 
> Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', 
> in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), 
> Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', 
> in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), 
> Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', 
> in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), 
> Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', 
> in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), 
> Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', 
> in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), 
> Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', 
> in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), 
> Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', 
> in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), 
> Row(fdate=u'20170301', in_amount2=9475418)]
> >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner')
> 2017-03-08 10:27:34,357 WARN  [Thread-2] sql.Column: Constructing trivially 
> true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases.
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join
> jdf = self._jdf.join(other._jdf, on._jc, how)
>   File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u"
> Failure when resolving conflicting references in Join:
> 'Join Inner, (fdate#42 = fdate#42)
> :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) 
> as int) AS in_amount1#97]
> :  +- Filter (inorout#44 = A)
> : +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, 
> fdate#42]
> :+- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && 
> (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201)))
> :   +- SubqueryAlias history_transfer_v
> :  +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, 
> fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, 
> bankwaterid#48, waterid#49, waterstate#50, source#51]
> : +- SubqueryAlias history_transfer
> :+- 
>

[jira] [Commented] (SPARK-27335) cannot collect() from Correlation.corr

2019-04-03 Thread Natalino Busa (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808442#comment-16808442
 ] 

Natalino Busa commented on SPARK-27335:
---

Unfortunately this error does not appear all the time. I have been unable
to pinpoint when exactly happens it has something to do with how the
SQLcontext and the SparkSession are initialized.

However, I have discovered that starting SparkSession first and recreating
the Singleton SQLcontext right after always works.




> cannot collect() from Correlation.corr
> --
>
> Key: SPARK-27335
> URL: https://issues.apache.org/jira/browse/SPARK-27335
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Natalino Busa
>Priority: Major
>
> reproducing the bug from the example in the documentation:
>  
>  
> {code:java}
> import pyspark
> from pyspark.ml.linalg import Vectors
> from pyspark.ml.stat import Correlation
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> dataset = [[Vectors.dense([1, 0, 0, -2])],
>  [Vectors.dense([4, 5, 0, 3])],
>  [Vectors.dense([6, 7, 0, 8])],
>  [Vectors.dense([9, 0, 0, 1])]]
> dataset = spark.createDataFrame(dataset, ['features'])
> df = Correlation.corr(dataset, 'features', 'pearson')
> df.collect()
>  
> {code}
> This produces the following stack trace:
>  
> {code:java}
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
>  11 dataset = spark.createDataFrame(dataset, ['features'])
>  12 df = Correlation.corr(dataset, 'features', 'pearson')
> ---> 13 df.collect()
> /opt/spark/python/pyspark/sql/dataframe.py in collect(self)
> 530 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
> 531 """
> --> 532 with SCCallSiteSync(self._sc) as css:
> 533 sock_info = self._jdf.collectToPython()
> 534 return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
> /opt/spark/python/pyspark/traceback_utils.py in __enter__(self)
>  70 def __enter__(self):
>  71 if SCCallSiteSync._spark_stack_depth == 0:
> ---> 72 self._context._jsc.setCallSite(self._call_site)
>  73 SCCallSiteSync._spark_stack_depth += 1
>  74 
> AttributeError: 'NoneType' object has no attribute 'setCallSite'{code}
>  
>  
> Analysis:
> Somehow the dataframe properties `df.sql_ctx.sparkSession._jsparkSession`, 
> and `spark._jsparkSession` do not match with the ones available in the spark 
> session.
> The following code fixes the problem (I hope this helps you narrowing down 
> the root cause)
>  
> {code:java}
> df.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession
> df._sc = spark._sc
> df.collect()
> >>> [Row(pearson(features)=DenseMatrix(4, 4, [1.0, 0.0556, nan, 0.4005, 
> >>> 0.0556, 1.0, nan, 0.9136, nan, nan, 1.0, nan, 0.4005, 0.9136, nan, 1.0], 
> >>> False))]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27352) Apply for translation of the Chinese version, I hope to get authorization!

2019-04-03 Thread Yuan Yifan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuan Yifan updated SPARK-27352:
---
Description: 
Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source 
community in China, focusing on Big Data and AI.

Recently, we have been making progress on translating Spark documents.

 - [Source Of Document|https://github.com/apachecn/spark-doc-zh]
 - [Document Preview|http://spark.apachecn.org/]

There are several reasons:
 *1. The English level of many Chinese users is not very good.*
 *2. Network problems, you know (China's magic network)!*
 *3. Online blogs are very messy.*

We are very willing to do some Chinese localization for your project. If 
possible, please give us some authorization.

Yifan Yuan from Apache CN

You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] 
for more details

  was:
Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source 
community in China, focusing on Big Data and AI.

Recently, we have been making progress on translating HBase documents.

 - [Source Of Document|https://github.com/apachecn/spark-doc-zh]
 - [Document Preview|http://spark.apachecn.org/]

There are several reasons:
 *1. The English level of many Chinese users is not very good.*
 *2. Network problems, you know (China's magic network)!*
 *3. Online blogs are very messy.*

We are very willing to do some Chinese localization for your project. If 
possible, please give us some authorization.

Yifan Yuan from Apache CN

You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] 
for more details


> Apply for translation of the Chinese version, I hope to get authorization! 
> ---
>
> Key: SPARK-27352
> URL: https://issues.apache.org/jira/browse/SPARK-27352
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuan Yifan
>Priority: Minor
>
> Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source 
> community in China, focusing on Big Data and AI.
> Recently, we have been making progress on translating Spark documents.
>  - [Source Of Document|https://github.com/apachecn/spark-doc-zh]
>  - [Document Preview|http://spark.apachecn.org/]
> There are several reasons:
>  *1. The English level of many Chinese users is not very good.*
>  *2. Network problems, you know (China's magic network)!*
>  *3. Online blogs are very messy.*
> We are very willing to do some Chinese localization for your project. If 
> possible, please give us some authorization.
> Yifan Yuan from Apache CN
> You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] 
> for more details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27352) Apply for translation of the Chinese version, I hope to get authorization!

2019-04-03 Thread Yuan Yifan (JIRA)

Yuan Yifan created SPARK-27352:
--

 Summary: Apply for translation of the Chinese version, I hope to 
get authorization! 
 Key: SPARK-27352
 URL: https://issues.apache.org/jira/browse/SPARK-27352
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Yuan Yifan


Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source 
community in China, focusing on Big Data and AI.

Recently, we have been making progress on translating HBase documents.

 - [Source Of Document|https://github.com/apachecn/spark-doc-zh]
 - [Document Preview|http://spark.apachecn.org/]

There are several reasons:
 *1. The English level of many Chinese users is not very good.*
 *2. Network problems, you know (China's magic network)!*
 *3. Online blogs are very messy.*

We are very willing to do some Chinese localization for your project. If 
possible, please give us some authorization.

Yifan Yuan from Apache CN

You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] 
for more details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27351) Wrong outputRows estimation after AggregateEstimation with only null value column

2019-04-03 Thread peng bo (JIRA)

peng bo created SPARK-27351:
---

 Summary: Wrong outputRows estimation after AggregateEstimation 
with only null value column
 Key: SPARK-27351
 URL: https://issues.apache.org/jira/browse/SPARK-27351
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.1
Reporter: peng bo


The upper bound of group-by columns row number is to multiply distinct counts 
of group-by columns. However, column with only null value will cause the output 
row number to be 0 which is incorrect.

Ex:
col1 (distinct: 2, rowCount 2)
col2 (distinct: 0, rowCount 2)

group by col1, col2
Actual: output rows: 0
Expected: output rows: 2 

{code:java}
var outputRows: BigInt = agg.groupingExpressions.foldLeft(BigInt(1))(
(res, expr) => res * 
childStats.attributeStats(expr.asInstanceOf[Attribute]).distinctCount)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

99 matches

Mail list logo