[jira] [Resolved] (SPARK-27349) Dealing with TimeVars removed in Hive 2.x
[ https://issues.apache.org/jira/browse/SPARK-27349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27349. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 3.0.0 > Dealing with TimeVars removed in Hive 2.x > - > > Key: SPARK-27349 > URL: https://issues.apache.org/jira/browse/SPARK-27349 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > {{hive.stats.jdbc.timeout}} and {{hive.stats.retries.wait}} were removed by > HIVE-12164. We need dealing with this change when upgrading built-in Hive to > 2.3.4 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27383) Avoid using hard-coded jar names in Hive tests
[ https://issues.apache.org/jira/browse/SPARK-27383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27383: Description: Avoid using hard-coded jar names({{hive-contrib-0.13.1.jar}} and {{hive-hcatalog-core-0.13.1.jar}}) in Hive tests. This makes it easy to change when upgrading the built-in Hive to 2.3.4. > Avoid using hard-coded jar names in Hive tests > -- > > Key: SPARK-27383 > URL: https://issues.apache.org/jira/browse/SPARK-27383 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Avoid using hard-coded jar names({{hive-contrib-0.13.1.jar}} and > {{hive-hcatalog-core-0.13.1.jar}}) in Hive tests. This makes it easy to > change when upgrading the built-in Hive to 2.3.4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27383) Avoid using hard-coded jar names in Hive tests
Yuming Wang created SPARK-27383: --- Summary: Avoid using hard-coded jar names in Hive tests Key: SPARK-27383 URL: https://issues.apache.org/jira/browse/SPARK-27383 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27363) Mesos support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809511#comment-16809511 ] Xiangrui Meng commented on SPARK-27363: --- [~felixcheung] [~srowen] Anyone you recommend to lead design and implementation for Mesos? > Mesos support for GPU-aware scheduling > -- > > Key: SPARK-27363 > URL: https://issues.apache.org/jira/browse/SPARK-27363 > Project: Spark > Issue Type: Story > Components: Mesos >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Design and implement Mesos support for GPU-aware scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27363) Mesos support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27363: -- Description: Design and implement Mesos support for GPU-aware scheduling. > Mesos support for GPU-aware scheduling > -- > > Key: SPARK-27363 > URL: https://issues.apache.org/jira/browse/SPARK-27363 > Project: Spark > Issue Type: Story > Components: Mesos >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Design and implement Mesos support for GPU-aware scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27362) Kubernetes support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27362: -- Shepherd: Yinan Li > Kubernetes support for GPU-aware scheduling > --- > > Key: SPARK-27362 > URL: https://issues.apache.org/jira/browse/SPARK-27362 > Project: Spark > Issue Type: Story > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Design and implement k8s support for GPU-aware scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27365) Spark Jenkins supports testing GPU-aware scheduling features
[ https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27365: -- Description: Upgrade Spark Jenkins to install GPU cards and run GPU integration tests triggered by "GPU" in PRs. cc: [~afeng] [~shaneknapp] was:Upgrade Spark Jenkins to install GPU cards and run GPU integration tests triggered by "GPU" in PRs. > Spark Jenkins supports testing GPU-aware scheduling features > > > Key: SPARK-27365 > URL: https://issues.apache.org/jira/browse/SPARK-27365 > Project: Spark > Issue Type: Story > Components: jenkins >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Upgrade Spark Jenkins to install GPU cards and run GPU integration tests > triggered by "GPU" in PRs. > cc: [~afeng] [~shaneknapp] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27361) YARN support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27361: -- Description: Design and implement YARN support for GPU-aware scheduling: * User can request GPU resources at Spark application level. * YARN can pass GPU info to Spark executor. * Integrate with YARN 3.2 GPU support. was: Design and implement YARN support for GPU-aware scheduling: * User can request GPU resources at Spark application level. * YARN can pass GPU info to Spark executor. * Integrate with YARN 3.2 GPU support. * Dynamic allocation. > YARN support for GPU-aware scheduling > - > > Key: SPARK-27361 > URL: https://issues.apache.org/jira/browse/SPARK-27361 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Design and implement YARN support for GPU-aware scheduling: > * User can request GPU resources at Spark application level. > * YARN can pass GPU info to Spark executor. > * Integrate with YARN 3.2 GPU support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27361) YARN support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27361: -- Description: Design and implement YARN support for GPU-aware scheduling: * User can request GPU resources at Spark application level. * YARN can pass GPU info to Spark executor. * Integrate with YARN 3.2 GPU support. * Dynamic allocation. was: Design and implement YARN support for GPU-aware scheduling: * User can request GPU resources at Spark application level. * YARN can pass GPU info to Spark executor. * Integrate with YARN 3.x GPU support. * Dynamic allocation. > YARN support for GPU-aware scheduling > - > > Key: SPARK-27361 > URL: https://issues.apache.org/jira/browse/SPARK-27361 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Design and implement YARN support for GPU-aware scheduling: > * User can request GPU resources at Spark application level. > * YARN can pass GPU info to Spark executor. > * Integrate with YARN 3.2 GPU support. > * Dynamic allocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27377) Upgrade YARN to 3.1.2+ to support GPU
[ https://issues.apache.org/jira/browse/SPARK-27377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27377: -- Description: This task should be covered by SPARK-23710. Just a placeholder here. > Upgrade YARN to 3.1.2+ to support GPU > - > > Key: SPARK-27377 > URL: https://issues.apache.org/jira/browse/SPARK-27377 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > This task should be covered by SPARK-23710. Just a placeholder here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23710) Upgrade the built-in Hive to 2.3.4 for hadoop-3.2
[ https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23710: Summary: Upgrade the built-in Hive to 2.3.4 for hadoop-3.2 (was: Upgrade the built-in Hive to 2.3.4 for hadoop-3.1) > Upgrade the built-in Hive to 2.3.4 for hadoop-3.2 > - > > Key: SPARK-23710 > URL: https://issues.apache.org/jira/browse/SPARK-23710 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Critical > > Spark fail to run on Hadoop 3.x, because Hive's ShimLoader considers Hadoop > 3.x to be an unknown Hadoop version. see SPARK-18673 and HIVE-16081 for more > details. So we need to upgrade the built-in Hive for Hadoop-3.x. This is an > umbrella JIRA to track this upgrade. > > *Upgrade Plan*: > # SPARK-27054 Remove the Calcite dependency. This can avoid some jar > conflicts. > # SPARK-23749 Replace built-in Hive API (isSub/toKryo) and remove > OrcProto.Type usage > # SPARK-27158, SPARK-27130 Update dev/* to support dynamic change profiles > when testing > # Fix ORC dependency conflict to makes it test passed on Hive 1.2.1 and > compile passed on Hive 2.3.4 > # Add an empty hive-thriftserverV2 module. then we could test all test cases > in next step > # Make Hadoop-3.1 with Hive 2.3.4 test passed > # Adapted hive-thriftserverV2 from hive-thriftserver with Hive 2.3.4's > [TCLIService.thrift|https://github.com/apache/hive/blob/rel/release-2.3.4/service-rpc/if/TCLIService.thrift] > > I have completed the [initial > work|https://github.com/apache/spark/pull/24044] and plan to finish this > upgrade step by step. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27382) Update Spark 2.4.x testing in HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27382: -- Priority: Minor (was: Trivial) > Update Spark 2.4.x testing in HiveExternalCatalogVersionsSuite > -- > > Key: SPARK-27382 > URL: https://issues.apache.org/jira/browse/SPARK-27382 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 2.4.2, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27382) Update Spark 2.4.x testing in HiveExternalCatalogVersionsSuite
Dongjoon Hyun created SPARK-27382: - Summary: Update Spark 2.4.x testing in HiveExternalCatalogVersionsSuite Key: SPARK-27382 URL: https://issues.apache.org/jira/browse/SPARK-27382 Project: Spark Issue Type: Task Components: SQL, Tests Affects Versions: 2.4.2, 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27024) Executor interface for cluster managers to support GPU resources
[ https://issues.apache.org/jira/browse/SPARK-27024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27024: -- Issue Type: Story (was: Task) > Executor interface for cluster managers to support GPU resources > > > Key: SPARK-27024 > URL: https://issues.apache.org/jira/browse/SPARK-27024 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Thomas Graves >Priority: Major > > The executor interface shall deal with the resources allocated to the > executor by cluster managers(Standalone, YARN, Kubernetes), so the Spark > Executor don’t need to involve into the GPU discovery and allocation, which > shall be handled by cluster managers. However, an executor need to sync with > the driver to expose available resources to support task scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-24615: -- Epic Name: GPU-aware Scheduling (was: Support GPU Scheduling) > SPIP: Accelerator-aware task scheduling for Spark > - > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Xingbo Jiang >Priority: Major > Labels: Hydrogen, SPIP > Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, > SPIP_ Accelerator-aware scheduling.pdf > > > (The JIRA received a major update on 2019/02/28. Some comments were based on > an earlier version. Please ignore them. New comments start at > [#comment-16778026].) > h2. Background and Motivation > GPUs and other accelerators have been widely used for accelerating special > workloads, e.g., deep learning and signal processing. While users from the AI > community use GPUs heavily, they often need Apache Spark to load and process > large datasets and to handle complex data scenarios like streaming. YARN and > Kubernetes already support GPUs in their recent releases. Although Spark > supports those two cluster managers, Spark itself is not aware of GPUs > exposed by them and hence Spark cannot properly request GPUs and schedule > them for users. This leaves a critical gap to unify big data and AI workloads > and make life simpler for end users. > To make Spark be aware of GPUs, we shall make two major changes at high level: > * At cluster manager level, we update or upgrade cluster managers to include > GPU support. Then we expose user interfaces for Spark to request GPUs from > them. > * Within Spark, we update its scheduler to understand available GPUs > allocated to executors, user task requests, and assign GPUs to tasks properly. > Based on the work done in YARN and Kubernetes to support GPUs and some > offline prototypes, we could have necessary features implemented in the next > major release of Spark. You can find a detailed scoping doc here, where we > listed user stories and their priorities. > h2. Goals > * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes. > * No regression on scheduler performance for normal jobs. > h2. Non-goals > * Fine-grained scheduling within one GPU card. > ** We treat one GPU card and its memory together as a non-divisible unit. > * Support TPU. > * Support Mesos. > * Support Windows. > h2. Target Personas > * Admins who need to configure clusters to run Spark with GPU nodes. > * Data scientists who need to build DL applications on Spark. > * Developers who need to integrate DL features on Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27380) Get and install GPU cards to Jenkins machines
Xiangrui Meng created SPARK-27380: - Summary: Get and install GPU cards to Jenkins machines Key: SPARK-27380 URL: https://issues.apache.org/jira/browse/SPARK-27380 Project: Spark Issue Type: Sub-task Components: jenkins Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27381) Design: Spark Jenkins supports GPU integration tests
Xiangrui Meng created SPARK-27381: - Summary: Design: Spark Jenkins supports GPU integration tests Key: SPARK-27381 URL: https://issues.apache.org/jira/browse/SPARK-27381 Project: Spark Issue Type: Sub-task Components: jenkins Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27365) Spark Jenkins supports testing GPU-aware scheduling features
[ https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27365: -- Description: Upgrade Spark Jenkins to install GPU cards and run GPU integration tests triggered by "GPU" in PRs. > Spark Jenkins supports testing GPU-aware scheduling features > > > Key: SPARK-27365 > URL: https://issues.apache.org/jira/browse/SPARK-27365 > Project: Spark > Issue Type: Story > Components: jenkins >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Upgrade Spark Jenkins to install GPU cards and run GPU integration tests > triggered by "GPU" in PRs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27379) YARN passes GPU info to Spark executor
Xiangrui Meng created SPARK-27379: - Summary: YARN passes GPU info to Spark executor Key: SPARK-27379 URL: https://issues.apache.org/jira/browse/SPARK-27379 Project: Spark Issue Type: Sub-task Components: Spark Core, YARN Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27378) spark-submit requests GPUs in YARN mode
Xiangrui Meng created SPARK-27378: - Summary: spark-submit requests GPUs in YARN mode Key: SPARK-27378 URL: https://issues.apache.org/jira/browse/SPARK-27378 Project: Spark Issue Type: Sub-task Components: Spark Submit, YARN Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27377) Upgrade YARN to 3.1.2+ to support GPU
Xiangrui Meng created SPARK-27377: - Summary: Upgrade YARN to 3.1.2+ to support GPU Key: SPARK-27377 URL: https://issues.apache.org/jira/browse/SPARK-27377 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27361) YARN support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27361: -- Description: Design and implement YARN support for GPU-aware scheduling: * User can request GPU resources at Spark application level. * YARN can pass GPU info to Spark executor. * Integrate with YARN 3.x GPU support. * Dynamic allocation. was: Design and implement YARN support for GPU-aware scheduling: * User can request GPU resources at Spark application level. * YARN can pass GPU info to Spark executor. * Integrate with YARN GPU support. * Dynamic allocation. > YARN support for GPU-aware scheduling > - > > Key: SPARK-27361 > URL: https://issues.apache.org/jira/browse/SPARK-27361 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Design and implement YARN support for GPU-aware scheduling: > * User can request GPU resources at Spark application level. > * YARN can pass GPU info to Spark executor. > * Integrate with YARN 3.x GPU support. > * Dynamic allocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27376) Design: YARN supports Spark GPU-aware scheduling
Xiangrui Meng created SPARK-27376: - Summary: Design: YARN supports Spark GPU-aware scheduling Key: SPARK-27376 URL: https://issues.apache.org/jira/browse/SPARK-27376 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27361) YARN support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27361: -- Description: Design and implement YARN support for GPU-aware scheduling: * User can request GPU resources at Spark application level. * YARN can pass GPU info to Spark executor. * Integrate with YARN GPU support. * Dynamic allocation. > YARN support for GPU-aware scheduling > - > > Key: SPARK-27361 > URL: https://issues.apache.org/jira/browse/SPARK-27361 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Design and implement YARN support for GPU-aware scheduling: > * User can request GPU resources at Spark application level. > * YARN can pass GPU info to Spark executor. > * Integrate with YARN GPU support. > * Dynamic allocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform(df)
[ https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenyi Lin updated SPARK-27375: --- Summary: cache not working after discretizer.fit(df).transform(df) (was: cache not working after discretizer.fit(df).transform operation) > cache not working after discretizer.fit(df).transform(df) > - > > Key: SPARK-27375 > URL: https://issues.apache.org/jira/browse/SPARK-27375 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.3.0 >Reporter: Zhenyi Lin >Priority: Major > > Below gives an example. > If cache works, col(r1) should be equal to col(r2) in the output dfj.show(). > However, after using discretizer fit and transform DF, col(r1) and col(r2) > are different. > > {noformat} > spark.catalog.clearCache() > import random > random.seed(123) > @udf(IntegerType()) > def ri(): > return random.choice([1,2,3,4,5,6,7,8,9]) > df = spark.range(100).repartition("id") > #remove discretizer part, col(r1) will be equal to col(r2) > discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", > outputCol="quantileNo") > df = discretizer.fit(df).transform(df) > # if we add following 1 line copy df, col(r1) will also become equal to > col(r2) > # df = df.rdd.toDF() > df = df.withColumn("r", ri()).cache() > df1 = df.withColumnRenamed("r", "r1") > df2 = df.withColumnRenamed("r", "r2") > df1.join(df2, "id").explain() > dfj = df1.join(df2, "id") > dfj.select("id", "r1", "r2").show(5) > > The result is shown as below, we see that col(r1) and col(r2) are different. > The physical plan shows that the cache() is missed in join operation. > To avoid it, I either add df = df.rdd.toDF() before creating df1 and df2, or > if we remove discretizer fit and transformation, col(r1) and col(r2) become > identical. > > == Physical Plan == > *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, > r2#15649] > +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight > :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS > quantileNo#15622, pythonUDF0#15661 AS r1#15645] > : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] > : +- Exchange hashpartitioning(id#15612L, 24) > : +- *(1) Range (0, 100, step=1, splits=6) > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])) > +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS > quantileNo#15653, pythonUDF0#15662 AS r2#15649] > +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] > +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) > +---+---+---+ > | id| r1| r2| > +---+---+---+ > | 28| 9| 3| > | 30| 3| 6| > | 88| 1| 9| > | 67| 3| 3| > | 66| 1| 5| > +---+---+---+ > only showing top 5 rows > > {noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform operation
[ https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenyi Lin updated SPARK-27375: --- Description: Below gives an example. If cache works, col(r1) should be equal to col(r2) in the output dfj.show(). However, after using discretizer fit and transform DF, col(r1) and col(r2) are different. {noformat} spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri(): return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) # if we add following 1 line copy df, col(r1) will also become equal to col(r2) # df = df.rdd.toDF() df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan shows that the cache() is missed in join operation. To avoid it, I either add df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become identical. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) +---+---+---+ | id| r1| r2| +---+---+---+ | 28| 9| 3| | 30| 3| 6| | 88| 1| 9| | 67| 3| 3| | 66| 1| 5| +---+---+---+ only showing top 5 rows {noformat} was: Below gives an example. If cache works, col(r1) should be equal to col(r2) in the output dfj.show(). However, after using discretizer fit and transform DF, col(r1) and col(r2) are different. {noformat} spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri(): return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) # if we add following 1 line copy df, col(r1) will also become equal to col(r2) # df = df.rdd.toDF() df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan shows that the cache() is missed in join operation. To avoid it, I either add df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become identical. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) +---+---+---+ | id| r1| r2| +---+---+---+ | 28| 9| 3| | 30| 3| 6| | 88| 1| 9| | 67| 3| 3| | 66| 1| 5| +---+---+---+ only showing top 5 rows {noformat} > cache not working after discretizer.fit(df).transform operation > --- > > Key: SPARK-27375 > URL: https://issues.apache.org/jira/browse/SPARK-27375 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.3.0 >Reporter: Zhenyi Lin >Priority: Major > > Below gives an example. > If cache works,
[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform operation
[ https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenyi Lin updated SPARK-27375: --- Description: Below gives an example. If cache works, col(r1) should be equal to col(r2) in the output dfj.show(). However, after using discretizer fit and transform DF, col(r1) and col(r2) are different. {noformat} spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri(): return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) # if we add following 1 line copy df, col(r1) will also become equal to col(r2) # df = df.rdd.toDF() df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan shows that the cache() is missed in join operation. To avoid it, I either add df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become identical. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) +---+---+---+ | id| r1| r2| +---+---+---+ | 28| 9| 3| | 30| 3| 6| | 88| 1| 9| | 67| 3| 3| | 66| 1| 5| +---+---+---+ only showing top 5 rows {noformat} was: Below gives an example. If cache works, col(r1) in the output should be equal to col(r2). However, after using discretizer fit and transform DF, col(r1) and col(r2) becomes different. {noformat} spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri(): return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) # if we add following 1 line copy df, col(r1) will also become equal to col(r2) # df = df.rdd.toDF() df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan shows that the cache() is missed in join operation. To avoid it, I either add df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become identical. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) +---+---+---+ | id| r1| r2| +---+---+---+ | 28| 9| 3| | 30| 3| 6| | 88| 1| 9| | 67| 3| 3| | 66| 1| 5| +---+---+---+ only showing top 5 rows {noformat} > cache not working after discretizer.fit(df).transform operation > --- > > Key: SPARK-27375 > URL: https://issues.apache.org/jira/browse/SPARK-27375 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.3.0 >Reporter: Zhenyi Lin >Priority: Major > > Below gives an example. > If cache works,
[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform operation
[ https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenyi Lin updated SPARK-27375: --- Description: Below gives an example. If cache works, col(r1) in the output should be equal to col(r2). However, after using discretizer fit and transform DF, col(r1) and col(r2) becomes different. {noformat} spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri(): return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) # if we add following 1 line copy df, col(r1) will also become equal to col(r2) # df = df.rdd.toDF() df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan shows that the cache() is missed in join operation. To avoid it, I either add df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become identical. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) +---+---+---+ | id| r1| r2| +---+---+---+ | 28| 9| 3| | 30| 3| 6| | 88| 1| 9| | 67| 3| 3| | 66| 1| 5| +---+---+---+ only showing top 5 rows {noformat} was: Below gives an example. col(r1) should be equal to col(r2) if cache operation works. However, after using discretizer fit and transformation DF, col(r1) and col(r2) becomes different {noformat} spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri(): return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) # if we add following 1 line copy df, col(r1) will also become equal to col(r2) # df = df.rdd.toDF() df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan of join operation shows that the cache() is missed. On the other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become the same. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) +---+---+---+ | id| r1| r2| +---+---+---+ | 28| 9| 3| | 30| 3| 6| | 88| 1| 9| | 67| 3| 3| | 66| 1| 5| +---+---+---+ only showing top 5 rows {noformat} > cache not working after discretizer.fit(df).transform operation > --- > > Key: SPARK-27375 > URL: https://issues.apache.org/jira/browse/SPARK-27375 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.3.0 >Reporter: Zhenyi Lin >Priority: Major > > Below gives an example. > If cache
[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform operation
[ https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenyi Lin updated SPARK-27375: --- Description: Below gives an example. col(r1) should be equal to col(r2) if cache operation works. However, after using discretizer fit and transformation DF, col(r1) and col(r2) becomes different {noformat} spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri(): return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) # if we add following 1 line copy df, col(r1) will also become equal to col(r2) # df = df.rdd.toDF() df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan of join operation shows that the cache() is missed. On the other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become the same. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) +---+---+---+ | id| r1| r2| +---+---+---+ | 28| 9| 3| | 30| 3| 6| | 88| 1| 9| | 67| 3| 3| | 66| 1| 5| +---+---+---+ only showing top 5 rows {noformat} was: Below gives an example. col(r1) should be equal to col(r2) if cache operation works. However, after using discretizer fit and transformation DF, col(r1) and col(r2) becomes different spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri(): return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) # if we add following 1 line copy df, col(r1) will also become equal to col(r2) # df = df.rdd.toDF() df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan of join operation shows that the cache() is missed. On the other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become the same. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649|#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L|#15612L], [id#15655L|#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645|#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661|#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649|#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662|#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L|#15655L], Exchange hashpartitioning(id#15612L, 24) ++--++--- |id|r1|r2| ++--++--- |28|9|3| |30|3|6| |88|1|9| |67|3|3| |66|1|5| ++--++--- only showing top 5 rows +-+-++--- only showing top 5 rows > cache not working after discretizer.fit(df).transform operation > --- > > Key:
[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform operation
[ https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenyi Lin updated SPARK-27375: --- Description: Below gives an example. col(r1) should be equal to col(r2) if cache operation works. However, after using discretizer fit and transformation DF, col(r1) and col(r2) becomes different spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri(): return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) # if we add following 1 line copy df, col(r1) will also become equal to col(r2) # df = df.rdd.toDF() df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan of join operation shows that the cache() is missed. On the other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become the same. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649|#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L|#15612L], [id#15655L|#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645|#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661|#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649|#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662|#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L|#15655L], Exchange hashpartitioning(id#15612L, 24) ++--++--- |id|r1|r2| ++--++--- |28|9|3| |30|3|6| |88|1|9| |67|3|3| |66|1|5| ++--++--- only showing top 5 rows +-+-++--- only showing top 5 rows was: Below gives an example. col(r1) should be equal to col(r2) if cache operation works. However, after using discretizer fit and transformation DF, col(r1) and col(r2) becomes different spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri(): return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan of join operation shows that the cache() is missed. On the other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become the same. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) +---+---+---+ | id| r1| r2| +---+---+---+ | 28| 9| 3| | 30| 3| 6| | 88| 1| 9| | 67| 3| 3| | 66| 1| 5| +---+---+---+ only showing top 5 rows ++--++--- only showing top 5 rows > cache not working after discretizer.fit(df).transform operation > --- > > Key: SPARK-27375 > URL: https://issues.apache.org/jira/browse/SPARK-27375 >
[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform operation
[ https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenyi Lin updated SPARK-27375: --- Summary: cache not working after discretizer.fit(df).transform operation (was: cache not working after discretizer.fit(df).transform) > cache not working after discretizer.fit(df).transform operation > --- > > Key: SPARK-27375 > URL: https://issues.apache.org/jira/browse/SPARK-27375 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.3.0 >Reporter: Zhenyi Lin >Priority: Major > > Below gives an example. col(r1) should be equal to col(r2) if cache operation > works. However, after using discretizer fit and transformation DF, col(r1) > and col(r2) becomes different > > > spark.catalog.clearCache() > import random > random.seed(123) > @udf(IntegerType()) > def ri(): > return random.choice([1,2,3,4,5,6,7,8,9]) > df = spark.range(100).repartition("id") > #remove discretizer part, col(r1) will be equal to col(r2) > discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", > outputCol="quantileNo") > df = discretizer.fit(df).transform(df) > df = df.withColumn("r", ri()).cache() > df1 = df.withColumnRenamed("r", "r1") > df2 = df.withColumnRenamed("r", "r2") > df1.join(df2, "id").explain() > dfj = df1.join(df2, "id") > dfj.select("id", "r1", "r2").show(5) > > The result is shown as below, we see that col(r1) and col(r2) are different. > The physical plan of join operation shows that the cache() is missed. On the > other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, > or if we remove discretizer fit and transformation, col(r1) and col(r2) > become the same. > > == Physical Plan == > *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, > r2#15649] > +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight > :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS > quantileNo#15622, pythonUDF0#15661 AS r1#15645] > : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] > : +- Exchange hashpartitioning(id#15612L, 24) > : +- *(1) Range (0, 100, step=1, splits=6) > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])) > +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS > quantileNo#15653, pythonUDF0#15662 AS r2#15649] > +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] > +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) > +---+---+---+ > | id| r1| r2| > +---+---+---+ > | 28| 9| 3| > | 30| 3| 6| > | 88| 1| 9| > | 67| 3| 3| > | 66| 1| 5| > +---+---+---+ > only showing top 5 rows > ++--++--- > only showing top 5 rows -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform
[ https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenyi Lin updated SPARK-27375: --- Description: Below gives an example. col(r1) should be equal to col(r2) if cache operation works. However, after using discretizer fit and transformation DF, col(r1) and col(r2) becomes different spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri(): return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan of join operation shows that the cache() is missed. On the other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become the same. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) +---+---+---+ | id| r1| r2| +---+---+---+ | 28| 9| 3| | 30| 3| 6| | 88| 1| 9| | 67| 3| 3| | 66| 1| 5| +---+---+---+ only showing top 5 rows ++--++--- only showing top 5 rows was: Below gives an example. col(r1) should be equal to col(r2) if cache operation works. However, after using discretizer fit and transformation DF, col(r1) and col(r2) becomes different spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri():( return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan of join operation shows that the cache() is missed. On the other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become the same. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) +---+---+---+ | id| r1| r2| +---+---+---+ | 28| 9| 3| | 30| 3| 6| | 88| 1| 9| | 67| 3| 3| | 66| 1| 5| +---+---+---+ only showing top 5 rows > cache not working after discretizer.fit(df).transform > - > > Key: SPARK-27375 > URL: https://issues.apache.org/jira/browse/SPARK-27375 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.3.0 >Reporter: Zhenyi Lin >Priority: Major > > Below gives an example. col(r1) should be equal to col(r2) if cache operation > works. However, after using discretizer fit and transformation DF, col(r1) > and col(r2) becomes different > > > spark.catalog.clearCache() > import random > random.seed(123) >
[jira] [Updated] (SPARK-27375) cache not working after discretizer.fit(df).transform
[ https://issues.apache.org/jira/browse/SPARK-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenyi Lin updated SPARK-27375: --- Summary: cache not working after discretizer.fit(df).transform (was: cache not working after call discretizer.fit(df).transform) > cache not working after discretizer.fit(df).transform > - > > Key: SPARK-27375 > URL: https://issues.apache.org/jira/browse/SPARK-27375 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.3.0 >Reporter: Zhenyi Lin >Priority: Major > > Below gives an example. col(r1) should be equal to col(r2) if cache operation > works. However, after using discretizer fit and transformation DF, col(r1) > and col(r2) becomes different > > > spark.catalog.clearCache() > import random > random.seed(123) > @udf(IntegerType()) > def ri():( > return random.choice([1,2,3,4,5,6,7,8,9]) > df = spark.range(100).repartition("id") > #remove discretizer part, col(r1) will be equal to col(r2) > discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", > outputCol="quantileNo") > df = discretizer.fit(df).transform(df) > df = df.withColumn("r", ri()).cache() > df1 = df.withColumnRenamed("r", "r1") > df2 = df.withColumnRenamed("r", "r2") > df1.join(df2, "id").explain() > dfj = df1.join(df2, "id") > dfj.select("id", "r1", "r2").show(5) > > The result is shown as below, we see that col(r1) and col(r2) are different. > The physical plan of join operation shows that the cache() is missed. On the > other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, > or if we remove discretizer fit and transformation, col(r1) and col(r2) > become the same. > > == Physical Plan == > *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, > r2#15649] > +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight > :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS > quantileNo#15622, pythonUDF0#15661 AS r1#15645] > : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] > : +- Exchange hashpartitioning(id#15612L, 24) > : +- *(1) Range (0, 100, step=1, splits=6) > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])) > +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS > quantileNo#15653, pythonUDF0#15662 AS r2#15649] > +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] > +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) > +---+---+---+ > | id| r1| r2| > +---+---+---+ > | 28| 9| 3| > | 30| 3| 6| > | 88| 1| 9| > | 67| 3| 3| > | 66| 1| 5| > +---+---+---+ > only showing top 5 rows -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27375) cache not working after call discretizer.fit(df).transform
Zhenyi Lin created SPARK-27375: -- Summary: cache not working after call discretizer.fit(df).transform Key: SPARK-27375 URL: https://issues.apache.org/jira/browse/SPARK-27375 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 2.3.0 Reporter: Zhenyi Lin Below gives an example. col(r1) should be equal to col(r2) if cache operation works. However, after using discretizer fit and transformation DF, col(r1) and col(r2) becomes different spark.catalog.clearCache() import random random.seed(123) @udf(IntegerType()) def ri():( return random.choice([1,2,3,4,5,6,7,8,9]) df = spark.range(100).repartition("id") #remove discretizer part, col(r1) will be equal to col(r2) discretizer = QuantileDiscretizer(numBuckets=3, inputCol="id", outputCol="quantileNo") df = discretizer.fit(df).transform(df) df = df.withColumn("r", ri()).cache() df1 = df.withColumnRenamed("r", "r1") df2 = df.withColumnRenamed("r", "r2") df1.join(df2, "id").explain() dfj = df1.join(df2, "id") dfj.select("id", "r1", "r2").show(5) The result is shown as below, we see that col(r1) and col(r2) are different. The physical plan of join operation shows that the cache() is missed. On the other hand, if we add one row df = df.rdd.toDF() before creating df1 and df2, or if we remove discretizer fit and transformation, col(r1) and col(r2) become the same. == Physical Plan == *(4) Project [id#15612L, quantileNo#15622, r1#15645, quantileNo#15653, r2#15649] +- *(4) BroadcastHashJoin [id#15612L], [id#15655L], Inner, BuildRight :- *(4) Project [id#15612L, UDF:bucketizer_0(cast(id#15612L as double)) AS quantileNo#15622, pythonUDF0#15661 AS r1#15645] : +- BatchEvalPython [ri()], [id#15612L, pythonUDF0#15661] : +- Exchange hashpartitioning(id#15612L, 24) : +- *(1) Range (0, 100, step=1, splits=6) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *(3) Project [id#15655L, UDF:bucketizer_0(cast(id#15655L as double)) AS quantileNo#15653, pythonUDF0#15662 AS r2#15649] +- BatchEvalPython [ri()], [id#15655L, pythonUDF0#15662] +- ReusedExchange [id#15655L], Exchange hashpartitioning(id#15612L, 24) +---+---+---+ | id| r1| r2| +---+---+---+ | 28| 9| 3| | 30| 3| 6| | 88| 1| 9| | 67| 3| 3| | 66| 1| 5| +---+---+---+ only showing top 5 rows -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
[ https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809463#comment-16809463 ] Mahima Khatri commented on SPARK-27298: --- Sure , I will try the code with 2.3.3 and check > Dataset except operation gives different results(dataset count) on Spark > 2.3.0 Windows and Spark 2.3.0 Linux environment > > > Key: SPARK-27298 > URL: https://issues.apache.org/jira/browse/SPARK-27298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Mahima Khatri >Priority: Major > Labels: data-loss > Attachments: Console-Result-Windows.txt, > console-result-LinuxonVM.txt, customer.csv, pom.xml > > > {code:java} > // package com.verifyfilter.example; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.Column; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SaveMode; > public class ExcludeInTesting { > public static void main(String[] args) { > SparkSession spark = SparkSession.builder() > .appName("ExcludeInTesting") > .config("spark.some.config.option", "some-value") > .getOrCreate(); > Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv") > .option("header", "true") > .option("delimiter", "|") > .option("inferSchema", "true") > //.load("E:/resources/customer.csv"); local //below path for VM > .load("/home/myproject/bda/home/bin/customer.csv"); > dataReadFromCSV.printSchema(); > dataReadFromCSV.show(); > //Adding an extra step of saving to db and then loading it again > dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer"); > Dataset dataLoaded = spark.sql("select * from customer"); > //Gender EQ M > Column genderCol = dataLoaded.col("Gender"); > Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M")); > //Dataset onlyMaleDS = spark.sql("select count(*) from customer where > Gender='M'"); > onlyMaleDS.show(); > System.out.println("The count of Male customers is :"+ onlyMaleDS.count()); > System.out.println("*"); > // Income in the list > Object[] valuesArray = new Object[5]; > valuesArray[0]=503.65; > valuesArray[1]=495.54; > valuesArray[2]=486.82; > valuesArray[3]=481.28; > valuesArray[4]=479.79; > Column incomeCol = dataLoaded.col("Income"); > Dataset incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) > valuesArray)); > System.out.println("The count of customers satisfaying Income is :"+ > incomeMatchingSet.count()); > System.out.println("*"); > Dataset maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet); > System.out.println("The count of final customers is :"+ > maleExcptIncomeMatch.count()); > System.out.println("*"); > } > } > {code} > When the above code is executed on Spark 2.3.0 ,it gives below different > results: > *Windows* : The code gives correct count of dataset 148237, > *Linux :* The code gives different {color:#172b4d}count of dataset > 129532 {color} > > {color:#172b4d}Some more info related to this bug:{color} > {color:#172b4d}1. Application Code (attached) > 2. CSV file used(attached) > 3. Windows spec > Windows 10- 64 bit OS > 4. Linux spec (Running on Oracle VM virtual box) > Specifications: \{as captured from Vbox.log} > 00:00:26.112908 VMMDev: Guest Additions information report: Version > 5.0.32 r112930 '5.0.32_Ubuntu' > 00:00:26.112996 VMMDev: Guest Additions information report: Interface > = 0x00010004 osType = 0x00053100 (Linux >= 2.6, 64-bit) > 5. Snapshots of output in both cases (attached){color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27338) Deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator
[ https://issues.apache.org/jira/browse/SPARK-27338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27338: --- Assignee: Venkata krishnan Sowrirajan > Deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator > - > > Key: SPARK-27338 > URL: https://issues.apache.org/jira/browse/SPARK-27338 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Venkata krishnan Sowrirajan >Assignee: Venkata krishnan Sowrirajan >Priority: Major > > We saw similar deadlock like this > https://issues.apache.org/jira/browse/SPARK-26265 happening between > TaskMemoryManager and UnsafeExternalSorted$SpillableIterator > Jstack output: > jstack information as follow: > {code:java} > Found one Java-level deadlock: > = > "stdout writer for > /usr/lib/envs/env-1923-ver-1755-a-4.2.9-py-3.5.3/bin/python": > waiting to lock monitor 0x7fce56409088 (object 0x0005700a2f98, a > org.apache.spark.memory.TaskMemoryManager), > which is held by "Executor task launch worker for task 2203" > "Executor task launch worker for task 2203": > waiting to lock monitor 0x007cd878 (object 0x0005701a0eb0, a > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator), > which is held by "stdout writer for > /usr/lib/envs/env-1923-ver-1755-a-4.2.9-py-3.5.3/bin/python" > Java stack information for the threads listed above: > === > "stdout writer for > /usr/lib/envs/env-1923-ver-1755-a-4.2.9-py-3.5.3/bin/python": > at > org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:334) > - waiting to lock <0x0005700a2f98> (a > org.apache.spark.memory.TaskMemoryManager) > at > org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.access$1100(UnsafeExternalSorter.java:48) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.loadNext(UnsafeExternalSorter.java:583) > - locked <0x0005701a0eb0> (a > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:187) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:174) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$2.hasNext(WholeStageCodegenExec.scala:638) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1073) > at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1089) > at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1127) > at > scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:50) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2067) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194) > "Executor task launch worker for task 2203": > at >
[jira] [Resolved] (SPARK-27338) Deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator
[ https://issues.apache.org/jira/browse/SPARK-27338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27338. - Resolution: Fixed Fix Version/s: 2.3.4 2.4.2 3.0.0 Issue resolved by pull request 24265 [https://github.com/apache/spark/pull/24265] > Deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator > - > > Key: SPARK-27338 > URL: https://issues.apache.org/jira/browse/SPARK-27338 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Venkata krishnan Sowrirajan >Assignee: Venkata krishnan Sowrirajan >Priority: Major > Fix For: 3.0.0, 2.4.2, 2.3.4 > > > We saw similar deadlock like this > https://issues.apache.org/jira/browse/SPARK-26265 happening between > TaskMemoryManager and UnsafeExternalSorted$SpillableIterator > Jstack output: > jstack information as follow: > {code:java} > Found one Java-level deadlock: > = > "stdout writer for > /usr/lib/envs/env-1923-ver-1755-a-4.2.9-py-3.5.3/bin/python": > waiting to lock monitor 0x7fce56409088 (object 0x0005700a2f98, a > org.apache.spark.memory.TaskMemoryManager), > which is held by "Executor task launch worker for task 2203" > "Executor task launch worker for task 2203": > waiting to lock monitor 0x007cd878 (object 0x0005701a0eb0, a > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator), > which is held by "stdout writer for > /usr/lib/envs/env-1923-ver-1755-a-4.2.9-py-3.5.3/bin/python" > Java stack information for the threads listed above: > === > "stdout writer for > /usr/lib/envs/env-1923-ver-1755-a-4.2.9-py-3.5.3/bin/python": > at > org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:334) > - waiting to lock <0x0005700a2f98> (a > org.apache.spark.memory.TaskMemoryManager) > at > org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.access$1100(UnsafeExternalSorter.java:48) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.loadNext(UnsafeExternalSorter.java:583) > - locked <0x0005701a0eb0> (a > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:187) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:174) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$2.hasNext(WholeStageCodegenExec.scala:638) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1073) > at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1089) > at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1127) > at > scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224) > at > org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:50) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2067) >
[jira] [Updated] (SPARK-27366) Spark scheduler internal changes to support GPU scheduling
[ https://issues.apache.org/jira/browse/SPARK-27366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27366: -- Description: Update Spark job scheduler to support accelerator resource requests submitted at application level. > Spark scheduler internal changes to support GPU scheduling > -- > > Key: SPARK-27366 > URL: https://issues.apache.org/jira/browse/SPARK-27366 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xingbo Jiang >Priority: Major > > Update Spark job scheduler to support accelerator resource requests submitted > at application level. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27374) Fetch assigned resources from TaskContext
Xiangrui Meng created SPARK-27374: - Summary: Fetch assigned resources from TaskContext Key: SPARK-27374 URL: https://issues.apache.org/jira/browse/SPARK-27374 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27365) Spark Jenkins supports testing GPU-aware scheduling features
[ https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27365: -- Summary: Spark Jenkins supports testing GPU-aware scheduling features (was: Spark Jenkins to support testing GPU-aware scheduling features) > Spark Jenkins supports testing GPU-aware scheduling features > > > Key: SPARK-27365 > URL: https://issues.apache.org/jira/browse/SPARK-27365 > Project: Spark > Issue Type: Story > Components: jenkins >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27365) Spark Jenkins to support testing GPU-aware scheduling features
[ https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27365: -- Summary: Spark Jenkins to support testing GPU-aware scheduling features (was: Spark Jenkins setup to support GPU-aware scheduling) > Spark Jenkins to support testing GPU-aware scheduling features > -- > > Key: SPARK-27365 > URL: https://issues.apache.org/jira/browse/SPARK-27365 > Project: Spark > Issue Type: Story > Components: jenkins >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27373) Design: Kubernetes support for GPU-aware scheduling
Xiangrui Meng created SPARK-27373: - Summary: Design: Kubernetes support for GPU-aware scheduling Key: SPARK-27373 URL: https://issues.apache.org/jira/browse/SPARK-27373 Project: Spark Issue Type: Sub-task Components: Kubernetes Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27362) Kubernetes support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27362: -- Description: Design and implement k8s support for GPU-aware scheduling. > Kubernetes support for GPU-aware scheduling > --- > > Key: SPARK-27362 > URL: https://issues.apache.org/jira/browse/SPARK-27362 > Project: Spark > Issue Type: Story > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Design and implement k8s support for GPU-aware scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27360: -- Description: Design and implement standalone manager support for GPU-aware scheduling: 1. static conf to describe resources 2. spark-submit to request resources 2. auto discovery of GPUs 3. executor process isolation was: Design work for the standalone manager to support GPU-aware scheduling: 1. static conf to describe resources 2. spark-submit to request resources 2. auto discovery of GPUs 3. executor process isolation > Standalone cluster mode support for GPU-aware scheduling > > > Key: SPARK-27360 > URL: https://issues.apache.org/jira/browse/SPARK-27360 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design and implement standalone manager support for GPU-aware scheduling: > 1. static conf to describe resources > 2. spark-submit to request resources > 2. auto discovery of GPUs > 3. executor process isolation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27372) Standalone executor process-level isolation to support GPU scheduling
Xiangrui Meng created SPARK-27372: - Summary: Standalone executor process-level isolation to support GPU scheduling Key: SPARK-27372 URL: https://issues.apache.org/jira/browse/SPARK-27372 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Xiangrui Meng As an admin, I can configure standalone to have multiple executor processes on the same worker node and processes are configured via cgroups so they only have access to assigned GPUs. So I don't need to worry about resource contention between processes on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27371) Standalone worker can auto discover GPUs
Xiangrui Meng created SPARK-27371: - Summary: Standalone worker can auto discover GPUs Key: SPARK-27371 URL: https://issues.apache.org/jira/browse/SPARK-27371 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Xiangrui Meng As an admin, I can let Spark Standalone worker automatically discover GPUs installed on worker nodes. So I don't need to manually configure them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27370) spark-submit requests GPUs in standalone mode
Xiangrui Meng created SPARK-27370: - Summary: spark-submit requests GPUs in standalone mode Key: SPARK-27370 URL: https://issues.apache.org/jira/browse/SPARK-27370 Project: Spark Issue Type: Sub-task Components: Spark Core, Spark Submit Affects Versions: 3.0.0 Reporter: Xiangrui Meng As a user, I can use spark-submit to request GPUs per task in standalone mode when I submit an Spark application. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27360: -- Description: Design work for the standalone manager to support GPU-aware scheduling: 1. static conf to describe resources 2. spark-submit to request resources 2. auto discovery of GPUs 3. executor process isolation was: Design work for the standalone manager to support GPU-aware scheduling: 1. static conf to list GPU devices 2. auto discovery of GPUs 3. executor process isolation > Standalone cluster mode support for GPU-aware scheduling > > > Key: SPARK-27360 > URL: https://issues.apache.org/jira/browse/SPARK-27360 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design work for the standalone manager to support GPU-aware scheduling: > 1. static conf to describe resources > 2. spark-submit to request resources > 2. auto discovery of GPUs > 3. executor process isolation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
[ https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-27216. -- Resolution: Fixed Assignee: Lantao Jin > Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue > - > > Key: SPARK-27216 > URL: https://issues.apache.org/jira/browse/SPARK-27216 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Fix For: 3.0.0 > > > HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But > RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer. > We can use below UT to reproduce: > {code} > test("kryo serialization with RoaringBitmap") { > val bitmap = new RoaringBitmap > bitmap.add(1787) > val safeSer = new KryoSerializer(conf).newInstance() > val bitmap2 : RoaringBitmap = > safeSer.deserialize(safeSer.serialize(bitmap)) > assert(bitmap2.equals(bitmap)) > conf.set("spark.kryo.unsafe", "true") > val unsafeSer = new KryoSerializer(conf).newInstance() > val bitmap3 : RoaringBitmap = > unsafeSer.deserialize(unsafeSer.serialize(bitmap)) > assert(bitmap3.equals(bitmap)) // this will fail > } > {code} > Upgrade to latest version 0.7.45 to fix it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27369) Standalone support static conf to describe GPU resources
Xiangrui Meng created SPARK-27369: - Summary: Standalone support static conf to describe GPU resources Key: SPARK-27369 URL: https://issues.apache.org/jira/browse/SPARK-27369 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
[ https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809415#comment-16809415 ] Imran Rashid commented on SPARK-27216: -- Fixed by https://github.com/apache/spark/pull/24264 in master / 3.0.0 > Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue > - > > Key: SPARK-27216 > URL: https://issues.apache.org/jira/browse/SPARK-27216 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Fix For: 3.0.0 > > > HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But > RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer. > We can use below UT to reproduce: > {code} > test("kryo serialization with RoaringBitmap") { > val bitmap = new RoaringBitmap > bitmap.add(1787) > val safeSer = new KryoSerializer(conf).newInstance() > val bitmap2 : RoaringBitmap = > safeSer.deserialize(safeSer.serialize(bitmap)) > assert(bitmap2.equals(bitmap)) > conf.set("spark.kryo.unsafe", "true") > val unsafeSer = new KryoSerializer(conf).newInstance() > val bitmap3 : RoaringBitmap = > unsafeSer.deserialize(unsafeSer.serialize(bitmap)) > assert(bitmap3.equals(bitmap)) // this will fail > } > {code} > Upgrade to latest version 0.7.45 to fix it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27216) Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue
[ https://issues.apache.org/jira/browse/SPARK-27216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-27216: - Fix Version/s: 3.0.0 > Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue > - > > Key: SPARK-27216 > URL: https://issues.apache.org/jira/browse/SPARK-27216 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: Lantao Jin >Priority: Major > Fix For: 3.0.0 > > > HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But > RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer. > We can use below UT to reproduce: > {code} > test("kryo serialization with RoaringBitmap") { > val bitmap = new RoaringBitmap > bitmap.add(1787) > val safeSer = new KryoSerializer(conf).newInstance() > val bitmap2 : RoaringBitmap = > safeSer.deserialize(safeSer.serialize(bitmap)) > assert(bitmap2.equals(bitmap)) > conf.set("spark.kryo.unsafe", "true") > val unsafeSer = new KryoSerializer(conf).newInstance() > val bitmap3 : RoaringBitmap = > unsafeSer.deserialize(unsafeSer.serialize(bitmap)) > assert(bitmap3.equals(bitmap)) // this will fail > } > {code} > Upgrade to latest version 0.7.45 to fix it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27368) Design: Standalone supports GPU scheduling
Xiangrui Meng created SPARK-27368: - Summary: Design: Standalone supports GPU scheduling Key: SPARK-27368 URL: https://issues.apache.org/jira/browse/SPARK-27368 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27360: - Assignee: Xiangrui Meng > Standalone cluster mode support for GPU-aware scheduling > > > Key: SPARK-27360 > URL: https://issues.apache.org/jira/browse/SPARK-27360 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Design work for the standalone manager to support GPU-aware scheduling: > 1. static conf to list GPU devices > 2. auto discovery of GPUs > 3. executor process isolation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27005) Design sketch for SPIP discussion: Accelerator-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-27005. --- Resolution: Done Fix Version/s: 3.0.0 I'm closing the ticket since SPIP passed vote. I create stories for each major sub-projects and we will finalize the design there. > Design sketch for SPIP discussion: Accelerator-aware scheduling > --- > > Key: SPARK-27005 > URL: https://issues.apache.org/jira/browse/SPARK-27005 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Major > Fix For: 3.0.0 > > > This task is to outline a design sketch for the accelerator-aware scheduling > SPIP discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27367) Faster RoaringBitmap Serialization with v0.8.0
Imran Rashid created SPARK-27367: Summary: Faster RoaringBitmap Serialization with v0.8.0 Key: SPARK-27367 URL: https://issues.apache.org/jira/browse/SPARK-27367 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Imran Rashid RoaringBitmap 0.8.0 adds faster serde, but also requires us to change how we call the serde routines slightly to take advantage of it. This is probably a worthwhile optimization as the every shuffle map task with a large # of partitions generates these bitmaps, and the driver especially has to deserialize many of these messages. See * https://github.com/apache/spark/pull/24264#issuecomment-479675572 * https://github.com/RoaringBitmap/RoaringBitmap/pull/325 * https://github.com/RoaringBitmap/RoaringBitmap/issues/319 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27005) Design sketch for SPIP discussion: Accelerator-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27005: -- Summary: Design sketch for SPIP discussion: Accelerator-aware scheduling (was: Design sketch: Accelerator-aware scheduling) > Design sketch for SPIP discussion: Accelerator-aware scheduling > --- > > Key: SPARK-27005 > URL: https://issues.apache.org/jira/browse/SPARK-27005 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Major > > This task is to outline a design sketch for the accelerator-aware scheduling > SPIP discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27366) Spark scheduler internal changes to support GPU scheduling
Xiangrui Meng created SPARK-27366: - Summary: Spark scheduler internal changes to support GPU scheduling Key: SPARK-27366 URL: https://issues.apache.org/jira/browse/SPARK-27366 Project: Spark Issue Type: Story Components: Spark Core Affects Versions: 3.0.0 Reporter: Xiangrui Meng Assignee: Xingbo Jiang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27005) Design sketch: Accelerator-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27005: - Assignee: Xingbo Jiang > Design sketch: Accelerator-aware scheduling > --- > > Key: SPARK-27005 > URL: https://issues.apache.org/jira/browse/SPARK-27005 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Xingbo Jiang >Priority: Major > > This task is to outline a design sketch for the accelerator-aware scheduling > SPIP discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27364) User-facing APIs for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27364: -- Description: Design and implement: * General guidelines for cluster managers to understand resource requests at application start. The concrete conf/param will be under the design of each cluster manager. * APIs to fetch assigned resources from task context. > User-facing APIs for GPU-aware scheduling > - > > Key: SPARK-27364 > URL: https://issues.apache.org/jira/browse/SPARK-27364 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > > Design and implement: > * General guidelines for cluster managers to understand resource requests at > application start. The concrete conf/param will be under the design of each > cluster manager. > * APIs to fetch assigned resources from task context. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27364) User-facing APIs for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27364: -- Summary: User-facing APIs for GPU-aware scheduling (was: Design: User-facing APIs for GPU-aware scheduling) > User-facing APIs for GPU-aware scheduling > - > > Key: SPARK-27364 > URL: https://issues.apache.org/jira/browse/SPARK-27364 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27361) YARN support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27361: -- Summary: YARN support for GPU-aware scheduling (was: Design: YARN support for GPU-aware scheduling) > YARN support for GPU-aware scheduling > - > > Key: SPARK-27361 > URL: https://issues.apache.org/jira/browse/SPARK-27361 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27363) Mesos support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27363: -- Summary: Mesos support for GPU-aware scheduling (was: Design: Mesos support for GPU-aware scheduling) > Mesos support for GPU-aware scheduling > -- > > Key: SPARK-27363 > URL: https://issues.apache.org/jira/browse/SPARK-27363 > Project: Spark > Issue Type: Story > Components: Mesos >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27362) Kubernetes support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27362: -- Summary: Kubernetes support for GPU-aware scheduling (was: Design: Kubernetes support for GPU-aware scheduling) > Kubernetes support for GPU-aware scheduling > --- > > Key: SPARK-27362 > URL: https://issues.apache.org/jira/browse/SPARK-27362 > Project: Spark > Issue Type: Story > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27024) Executor interface for cluster managers to support GPU resources
[ https://issues.apache.org/jira/browse/SPARK-27024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27024: -- Summary: Executor interface for cluster managers to support GPU resources (was: Design: Executor interface for cluster managers to support GPU resources) > Executor interface for cluster managers to support GPU resources > > > Key: SPARK-27024 > URL: https://issues.apache.org/jira/browse/SPARK-27024 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Thomas Graves >Priority: Major > > The executor interface shall deal with the resources allocated to the > executor by cluster managers(Standalone, YARN, Kubernetes), so the Spark > Executor don’t need to involve into the GPU discovery and allocation, which > shall be handled by cluster managers. However, an executor need to sync with > the driver to expose available resources to support task scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27360) Standalone cluster mode support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27360: -- Summary: Standalone cluster mode support for GPU-aware scheduling (was: Design: Standalone cluster mode support for GPU-aware scheduling) > Standalone cluster mode support for GPU-aware scheduling > > > Key: SPARK-27360 > URL: https://issues.apache.org/jira/browse/SPARK-27360 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Design work for the standalone manager to support GPU-aware scheduling: > 1. static conf to list GPU devices > 2. auto discovery of GPUs > 3. executor process isolation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27365) Spark Jenkins setup to support GPU-aware scheduling
Xiangrui Meng created SPARK-27365: - Summary: Spark Jenkins setup to support GPU-aware scheduling Key: SPARK-27365 URL: https://issues.apache.org/jira/browse/SPARK-27365 Project: Spark Issue Type: Story Components: jenkins Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27024) Design: Executor interface for cluster managers to support GPU resources
[ https://issues.apache.org/jira/browse/SPARK-27024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-27024: -- Summary: Design: Executor interface for cluster managers to support GPU resources (was: Design executor interface to support GPU resources) > Design: Executor interface for cluster managers to support GPU resources > > > Key: SPARK-27024 > URL: https://issues.apache.org/jira/browse/SPARK-27024 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Thomas Graves >Priority: Major > > The executor interface shall deal with the resources allocated to the > executor by cluster managers(Standalone, YARN, Kubernetes), so the Spark > Executor don’t need to involve into the GPU discovery and allocation, which > shall be handled by cluster managers. However, an executor need to sync with > the driver to expose available resources to support task scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27364) Design: User-facing APIs for GPU-aware scheduling
Xiangrui Meng created SPARK-27364: - Summary: Design: User-facing APIs for GPU-aware scheduling Key: SPARK-27364 URL: https://issues.apache.org/jira/browse/SPARK-27364 Project: Spark Issue Type: Story Components: Spark Core Affects Versions: 3.0.0 Reporter: Xiangrui Meng Assignee: Thomas Graves -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27363) Design: Mesos support for GPU-aware scheduling
Xiangrui Meng created SPARK-27363: - Summary: Design: Mesos support for GPU-aware scheduling Key: SPARK-27363 URL: https://issues.apache.org/jira/browse/SPARK-27363 Project: Spark Issue Type: Story Components: Mesos Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27362) Design: Kubernetes support for GPU-aware scheduling
Xiangrui Meng created SPARK-27362: - Summary: Design: Kubernetes support for GPU-aware scheduling Key: SPARK-27362 URL: https://issues.apache.org/jira/browse/SPARK-27362 Project: Spark Issue Type: Story Components: Kubernetes Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27361) Design: YARN support for GPU-aware scheduling
Xiangrui Meng created SPARK-27361: - Summary: Design: YARN support for GPU-aware scheduling Key: SPARK-27361 URL: https://issues.apache.org/jira/browse/SPARK-27361 Project: Spark Issue Type: Story Components: YARN Affects Versions: 3.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27024) Design executor interface to support GPU resources
[ https://issues.apache.org/jira/browse/SPARK-27024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-27024: - Assignee: Thomas Graves > Design executor interface to support GPU resources > -- > > Key: SPARK-27024 > URL: https://issues.apache.org/jira/browse/SPARK-27024 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Assignee: Thomas Graves >Priority: Major > > The executor interface shall deal with the resources allocated to the > executor by cluster managers(Standalone, YARN, Kubernetes), so the Spark > Executor don’t need to involve into the GPU discovery and allocation, which > shall be handled by cluster managers. However, an executor need to sync with > the driver to expose available resources to support task scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27360) Design: Standalone cluster mode support for GPU-aware scheduling
Xiangrui Meng created SPARK-27360: - Summary: Design: Standalone cluster mode support for GPU-aware scheduling Key: SPARK-27360 URL: https://issues.apache.org/jira/browse/SPARK-27360 Project: Spark Issue Type: Story Components: Spark Core Affects Versions: 3.0.0 Reporter: Xiangrui Meng Design work for the standalone manager to support GPU-aware scheduling: 1. static conf to list GPU devices 2. auto discovery of GPUs 3. executor process isolation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27359) Joins on some array functions can be optimized
Nikolas Vanderhoof created SPARK-27359: -- Summary: Joins on some array functions can be optimized Key: SPARK-27359 URL: https://issues.apache.org/jira/browse/SPARK-27359 Project: Spark Issue Type: Improvement Components: Optimizer, SQL Affects Versions: 3.0.0 Reporter: Nikolas Vanderhoof I encounter these cases frequently, and implemented the optimization manually (as shown here). If others experience this as well, perhaps it would be good to add appropriate tree transformations into catalyst. I can create some rough draft implementations but expect I will need assistance when it comes to resolving the generating expressions in the logical plan. h1. Case 1 A join like this: {code:scala} left.join( right, arrays_overlap(left("a"), right("b")) // Creates a cartesian product in the logical plan ) {code} will produce the same results as: {code:scala} { val leftPrime = left.withColumn("exploded_a", explode(col("a"))) val rightPrime = right.withColumn("exploded_b", explode(col("b"))) leftPrime.join( rightPrime, leftPrime("exploded_a") === rightPrime("exploded_b") // Equijoin doesn't produce cartesian product ).drop("exploded_a", "exploded_b").distinct } {code} h1. Case 2 A join like this: {code:scala} left.join( right, array_contains(left("arr"), right("value")) // Cartesian product in logical plan ) {code} will produce the same results as: {code:scala} { val leftPrime = left.withColumn("exploded_arr", explode(col("arr"))) leftPrime.join( right, leftPrime("exploded_arr") === right("value") // Fast equijoin ).drop("exploded_arr").distinct } {code} h1. Case 3 A join like this: {code:scala} left.join( right, array_contains(right("arr"), left("value")) // Cartesian product in logical plan ) {code} will produce the same results as: {code:scala} { val rightPrime = right.withColumn("exploded_arr", explode(col("arr"))) left.join( rightPrime, left("value") === rightPrime("exploded_arr") // Fast equijoin ).drop("exploded_arr").distinct } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not
[ https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809321#comment-16809321 ] Huon Wilson commented on SPARK-27278: - Hm, I'm not sure I understand why only half of this optimisation is being removed. This bug represents a missed case in an existing optimisation, and so if this is being reverted, shouldn't the whole existing optimisation (that is, the {{GetMapValue(CreateMap(elems), key)}} case in {{SimplifyExtractValueOps}}) also be removed? The code-size concerns apply equally well to that as to the new case here. > Optimize GetMapValue when the map is a foldable and the key is not > -- > > Key: SPARK-27278 > URL: https://issues.apache.org/jira/browse/SPARK-27278 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 >Reporter: Huon Wilson >Priority: Major > > With a map that isn't constant-foldable, spark will optimise an access to a > series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as > "x").explain > == Physical Plan == > *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L > as int) = 2) THEN id#180L END AS x#182L] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > This results in an efficient series of ifs and elses, in the code generation: > {code:java} > /* 037 */ boolean project_isNull_3 = false; > /* 038 */ int project_value_3 = -1; > /* 039 */ if (!false) { > /* 040 */ project_value_3 = (int) project_expr_0_0; > /* 041 */ } > /* 042 */ > /* 043 */ boolean project_value_2 = false; > /* 044 */ project_value_2 = project_value_3 == 1; > /* 045 */ if (!false && project_value_2) { > /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 047 */ project_project_value_1_0 = 1L; > /* 048 */ continue; > /* 049 */ } > /* 050 */ > /* 051 */ boolean project_isNull_8 = false; > /* 052 */ int project_value_8 = -1; > /* 053 */ if (!false) { > /* 054 */ project_value_8 = (int) project_expr_0_0; > /* 055 */ } > /* 056 */ > /* 057 */ boolean project_value_7 = false; > /* 058 */ project_value_7 = project_value_8 == 2; > /* 059 */ if (!false && project_value_7) { > /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 061 */ project_project_value_1_0 = project_expr_0_0; > /* 062 */ continue; > /* 063 */ } > {code} > If the map can be constant folded, the constant folding happens first, and > the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing > a map traversal and more dynamic checks: > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as > "x").explain > == Physical Plan == > *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which > is what is stored in the {{Literal}} form of the {{map(...)}} expression in > that select. The code generated is less efficient, since it has to do a > manual dynamic traversal of the map's array of keys, with type casts etc.: > {code:java} > /* 099 */ int project_index_0 = 0; > /* 100 */ boolean project_found_0 = false; > /* 101 */ while (project_index_0 < project_length_0 && > !project_found_0) { > /* 102 */ final int project_key_0 = > project_keys_0.getInt(project_index_0); > /* 103 */ if (project_key_0 == project_value_2) { > /* 104 */ project_found_0 = true; > /* 105 */ } else { > /* 106 */ project_index_0++; > /* 107 */ } > /* 108 */ } > /* 109 */ > /* 110 */ if (!project_found_0) { > /* 111 */ project_isNull_0 = true; > /* 112 */ } else { > /* 113 */ project_value_0 = > project_values_0.getInt(project_index_0); > /* 114 */ } > {code} > It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't > handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form: > {code:scala} > case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27299) Design: Property graph construction, save/load, and query APIs
[ https://issues.apache.org/jira/browse/SPARK-27299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809291#comment-16809291 ] Xiangrui Meng commented on SPARK-27299: --- [~mju] Could you post the draft design doc to JIRA? > Design: Property graph construction, save/load, and query APIs > -- > > Key: SPARK-27299 > URL: https://issues.apache.org/jira/browse/SPARK-27299 > Project: Spark > Issue Type: Story > Components: Graph >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Martin Junghanns >Priority: Major > > Design doc for property graph and Cypher queries. > * Construct a property graph. > * How nodes and relationships map to DataFrames > * Save/load. > * Cypher query. > * Support Scala/Python/Java. > * Dependencies > * Test > Out of scope: > * Graph algorithms. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27358) Update jquery to 1.12.x to address CVE
[ https://issues.apache.org/jira/browse/SPARK-27358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-27358: -- Priority: Major (was: Minor) Description: jquery 1.11.1 is affected by a CVE: https://www.cvedetails.com/cve/CVE-2016-7103/ Note that I do not know whether this actually manifests as a security problem for Spark. But, we can easily update to 1.12.4 (latest 1.x version) to resolve it. (Note that https://www.cvedetails.com/cve/CVE-2015-9251/ seems to have been fixed in 1.12 but then unfixed, so this may require a much bigger jump to jquery 3.x if it's a problem; leaving that until later.) Along the way we will want to update jquery datatables to 1.10.18 to match jquery 1.12.4. Relatedly, jquery mustache 0.8.1 also has a CVE: https://snyk.io/test/npm/mustache/0.8.2 I propose to update to 2.3.12 (latest 2.x) to resolve it. Although targeted for 3.0, I believe this is back-port-able to 2.4.x if needed, assuming we find no UI issues. was: jquery 1.11.1 is affected by a CVE: https://www.cvedetails.com/cve/CVE-2016-7103/ We can easily update to 1.12.4 (latest 1.x version) to resolve it. (Note that https://www.cvedetails.com/cve/CVE-2015-9251/ seems to have been fixed in 1.12 but then unfixed, so this may require a much bigger jump to jquery 3.x if it's a problem; leaving that until later.) Along the way we will want to update jquery datatables to 1.10.18 to match jquery 1.12.4. Relatedly, jquery mustache 0.8.1 also has a CVE: https://snyk.io/test/npm/mustache/0.8.2 I propose to update to 2.3.12 (latest 2.x) to resolve it. Although targeted for 3.0, I believe this is back-port-able to 2.4.x if needed, assuming we find no UI issues. > Update jquery to 1.12.x to address CVE > -- > > Key: SPARK-27358 > URL: https://issues.apache.org/jira/browse/SPARK-27358 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > > jquery 1.11.1 is affected by a CVE: > https://www.cvedetails.com/cve/CVE-2016-7103/ > Note that I do not know whether this actually manifests as a security problem > for Spark. But, we can easily update to 1.12.4 (latest 1.x version) to > resolve it. > (Note that https://www.cvedetails.com/cve/CVE-2015-9251/ seems to have been > fixed in 1.12 but then unfixed, so this may require a much bigger jump to > jquery 3.x if it's a problem; leaving that until later.) > Along the way we will want to update jquery datatables to 1.10.18 to match > jquery 1.12.4. > Relatedly, jquery mustache 0.8.1 also has a CVE: > https://snyk.io/test/npm/mustache/0.8.2 > I propose to update to 2.3.12 (latest 2.x) to resolve it. > Although targeted for 3.0, I believe this is back-port-able to 2.4.x if > needed, assuming we find no UI issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27358) Update jquery to 1.12.x to address CVE
Sean Owen created SPARK-27358: - Summary: Update jquery to 1.12.x to address CVE Key: SPARK-27358 URL: https://issues.apache.org/jira/browse/SPARK-27358 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.0 Reporter: Sean Owen Assignee: Sean Owen jquery 1.11.1 is affected by a CVE: https://www.cvedetails.com/cve/CVE-2016-7103/ We can easily update to 1.12.4 (latest 1.x version) to resolve it. (Note that https://www.cvedetails.com/cve/CVE-2015-9251/ seems to have been fixed in 1.12 but then unfixed, so this may require a much bigger jump to jquery 3.x if it's a problem; leaving that until later.) Along the way we will want to update jquery datatables to 1.10.18 to match jquery 1.12.4. Relatedly, jquery mustache 0.8.1 also has a CVE: https://snyk.io/test/npm/mustache/0.8.2 I propose to update to 2.3.12 (latest 2.x) to resolve it. Although targeted for 3.0, I believe this is back-port-able to 2.4.x if needed, assuming we find no UI issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27357) Cast timestamps to/from dates independently from time zones
[ https://issues.apache.org/jira/browse/SPARK-27357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-27357: --- Summary: Cast timestamps to/from dates independently from time zones (was: Convert timestamps to/from dates independently from time zones) > Cast timestamps to/from dates independently from time zones > --- > > Key: SPARK-27357 > URL: https://issues.apache.org/jira/browse/SPARK-27357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Both Catalyst's types TIMESTAMP and DATE internally represent time intervals > since epoch in UTC time zone. The TIMESTAMP type contains number of > microseconds since epoch, and DATE is number of days since epoch (00:00:00 1 > January 1970). As a consequence of that, the conversion should be independent > from session or local time zone. The ticket aims to fix current behavior and > makes the conversion independent from time zones. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27357) Convert timestamps to/from dates independently from time zones
Maxim Gekk created SPARK-27357: -- Summary: Convert timestamps to/from dates independently from time zones Key: SPARK-27357 URL: https://issues.apache.org/jira/browse/SPARK-27357 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Both Catalyst's types TIMESTAMP and DATE internally represent time intervals since epoch in UTC time zone. The TIMESTAMP type contains number of microseconds since epoch, and DATE is number of days since epoch (00:00:00 1 January 1970). As a consequence of that, the conversion should be independent from session or local time zone. The ticket aims to fix current behavior and makes the conversion independent from time zones. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not
[ https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27278. --- Resolution: Later We can revisit this issue and the solution when we adding official "loop unrolling" support for array/map operations. cc [~smilegator] and [~cloud_fan]. > Optimize GetMapValue when the map is a foldable and the key is not > -- > > Key: SPARK-27278 > URL: https://issues.apache.org/jira/browse/SPARK-27278 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 >Reporter: Huon Wilson >Assignee: Marco Gaido >Priority: Major > > With a map that isn't constant-foldable, spark will optimise an access to a > series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as > "x").explain > == Physical Plan == > *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L > as int) = 2) THEN id#180L END AS x#182L] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > This results in an efficient series of ifs and elses, in the code generation: > {code:java} > /* 037 */ boolean project_isNull_3 = false; > /* 038 */ int project_value_3 = -1; > /* 039 */ if (!false) { > /* 040 */ project_value_3 = (int) project_expr_0_0; > /* 041 */ } > /* 042 */ > /* 043 */ boolean project_value_2 = false; > /* 044 */ project_value_2 = project_value_3 == 1; > /* 045 */ if (!false && project_value_2) { > /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 047 */ project_project_value_1_0 = 1L; > /* 048 */ continue; > /* 049 */ } > /* 050 */ > /* 051 */ boolean project_isNull_8 = false; > /* 052 */ int project_value_8 = -1; > /* 053 */ if (!false) { > /* 054 */ project_value_8 = (int) project_expr_0_0; > /* 055 */ } > /* 056 */ > /* 057 */ boolean project_value_7 = false; > /* 058 */ project_value_7 = project_value_8 == 2; > /* 059 */ if (!false && project_value_7) { > /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 061 */ project_project_value_1_0 = project_expr_0_0; > /* 062 */ continue; > /* 063 */ } > {code} > If the map can be constant folded, the constant folding happens first, and > the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing > a map traversal and more dynamic checks: > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as > "x").explain > == Physical Plan == > *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which > is what is stored in the {{Literal}} form of the {{map(...)}} expression in > that select. The code generated is less efficient, since it has to do a > manual dynamic traversal of the map's array of keys, with type casts etc.: > {code:java} > /* 099 */ int project_index_0 = 0; > /* 100 */ boolean project_found_0 = false; > /* 101 */ while (project_index_0 < project_length_0 && > !project_found_0) { > /* 102 */ final int project_key_0 = > project_keys_0.getInt(project_index_0); > /* 103 */ if (project_key_0 == project_value_2) { > /* 104 */ project_found_0 = true; > /* 105 */ } else { > /* 106 */ project_index_0++; > /* 107 */ } > /* 108 */ } > /* 109 */ > /* 110 */ if (!project_found_0) { > /* 111 */ project_isNull_0 = true; > /* 112 */ } else { > /* 113 */ project_value_0 = > project_values_0.getInt(project_index_0); > /* 114 */ } > {code} > It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't > handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form: > {code:scala} > case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not
[ https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-27278: - Assignee: (was: Marco Gaido) > Optimize GetMapValue when the map is a foldable and the key is not > -- > > Key: SPARK-27278 > URL: https://issues.apache.org/jira/browse/SPARK-27278 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 >Reporter: Huon Wilson >Priority: Major > > With a map that isn't constant-foldable, spark will optimise an access to a > series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as > "x").explain > == Physical Plan == > *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L > as int) = 2) THEN id#180L END AS x#182L] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > This results in an efficient series of ifs and elses, in the code generation: > {code:java} > /* 037 */ boolean project_isNull_3 = false; > /* 038 */ int project_value_3 = -1; > /* 039 */ if (!false) { > /* 040 */ project_value_3 = (int) project_expr_0_0; > /* 041 */ } > /* 042 */ > /* 043 */ boolean project_value_2 = false; > /* 044 */ project_value_2 = project_value_3 == 1; > /* 045 */ if (!false && project_value_2) { > /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 047 */ project_project_value_1_0 = 1L; > /* 048 */ continue; > /* 049 */ } > /* 050 */ > /* 051 */ boolean project_isNull_8 = false; > /* 052 */ int project_value_8 = -1; > /* 053 */ if (!false) { > /* 054 */ project_value_8 = (int) project_expr_0_0; > /* 055 */ } > /* 056 */ > /* 057 */ boolean project_value_7 = false; > /* 058 */ project_value_7 = project_value_8 == 2; > /* 059 */ if (!false && project_value_7) { > /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 061 */ project_project_value_1_0 = project_expr_0_0; > /* 062 */ continue; > /* 063 */ } > {code} > If the map can be constant folded, the constant folding happens first, and > the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing > a map traversal and more dynamic checks: > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as > "x").explain > == Physical Plan == > *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which > is what is stored in the {{Literal}} form of the {{map(...)}} expression in > that select. The code generated is less efficient, since it has to do a > manual dynamic traversal of the map's array of keys, with type casts etc.: > {code:java} > /* 099 */ int project_index_0 = 0; > /* 100 */ boolean project_found_0 = false; > /* 101 */ while (project_index_0 < project_length_0 && > !project_found_0) { > /* 102 */ final int project_key_0 = > project_keys_0.getInt(project_index_0); > /* 103 */ if (project_key_0 == project_value_2) { > /* 104 */ project_found_0 = true; > /* 105 */ } else { > /* 106 */ project_index_0++; > /* 107 */ } > /* 108 */ } > /* 109 */ > /* 110 */ if (!project_found_0) { > /* 111 */ project_isNull_0 = true; > /* 112 */ } else { > /* 113 */ project_value_0 = > project_values_0.getInt(project_index_0); > /* 114 */ } > {code} > It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't > handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form: > {code:scala} > case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not
[ https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-27278. - > Optimize GetMapValue when the map is a foldable and the key is not > -- > > Key: SPARK-27278 > URL: https://issues.apache.org/jira/browse/SPARK-27278 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 >Reporter: Huon Wilson >Priority: Major > > With a map that isn't constant-foldable, spark will optimise an access to a > series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as > "x").explain > == Physical Plan == > *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L > as int) = 2) THEN id#180L END AS x#182L] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > This results in an efficient series of ifs and elses, in the code generation: > {code:java} > /* 037 */ boolean project_isNull_3 = false; > /* 038 */ int project_value_3 = -1; > /* 039 */ if (!false) { > /* 040 */ project_value_3 = (int) project_expr_0_0; > /* 041 */ } > /* 042 */ > /* 043 */ boolean project_value_2 = false; > /* 044 */ project_value_2 = project_value_3 == 1; > /* 045 */ if (!false && project_value_2) { > /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 047 */ project_project_value_1_0 = 1L; > /* 048 */ continue; > /* 049 */ } > /* 050 */ > /* 051 */ boolean project_isNull_8 = false; > /* 052 */ int project_value_8 = -1; > /* 053 */ if (!false) { > /* 054 */ project_value_8 = (int) project_expr_0_0; > /* 055 */ } > /* 056 */ > /* 057 */ boolean project_value_7 = false; > /* 058 */ project_value_7 = project_value_8 == 2; > /* 059 */ if (!false && project_value_7) { > /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 061 */ project_project_value_1_0 = project_expr_0_0; > /* 062 */ continue; > /* 063 */ } > {code} > If the map can be constant folded, the constant folding happens first, and > the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing > a map traversal and more dynamic checks: > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as > "x").explain > == Physical Plan == > *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which > is what is stored in the {{Literal}} form of the {{map(...)}} expression in > that select. The code generated is less efficient, since it has to do a > manual dynamic traversal of the map's array of keys, with type casts etc.: > {code:java} > /* 099 */ int project_index_0 = 0; > /* 100 */ boolean project_found_0 = false; > /* 101 */ while (project_index_0 < project_length_0 && > !project_found_0) { > /* 102 */ final int project_key_0 = > project_keys_0.getInt(project_index_0); > /* 103 */ if (project_key_0 == project_value_2) { > /* 104 */ project_found_0 = true; > /* 105 */ } else { > /* 106 */ project_index_0++; > /* 107 */ } > /* 108 */ } > /* 109 */ > /* 110 */ if (!project_found_0) { > /* 111 */ project_isNull_0 = true; > /* 112 */ } else { > /* 113 */ project_value_0 = > project_values_0.getInt(project_index_0); > /* 114 */ } > {code} > It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't > handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form: > {code:scala} > case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not
[ https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-27278: --- This is reverted via [https://github.com/apache/spark/commit/b51763612a71e214ed94ac5fb6843cf03d00f9f8] to be safe. Please see the discussion on the PR, [https://github.com/apache/spark/pull/24223] . > Optimize GetMapValue when the map is a foldable and the key is not > -- > > Key: SPARK-27278 > URL: https://issues.apache.org/jira/browse/SPARK-27278 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 >Reporter: Huon Wilson >Assignee: Marco Gaido >Priority: Major > Fix For: 3.0.0 > > > With a map that isn't constant-foldable, spark will optimise an access to a > series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as > "x").explain > == Physical Plan == > *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L > as int) = 2) THEN id#180L END AS x#182L] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > This results in an efficient series of ifs and elses, in the code generation: > {code:java} > /* 037 */ boolean project_isNull_3 = false; > /* 038 */ int project_value_3 = -1; > /* 039 */ if (!false) { > /* 040 */ project_value_3 = (int) project_expr_0_0; > /* 041 */ } > /* 042 */ > /* 043 */ boolean project_value_2 = false; > /* 044 */ project_value_2 = project_value_3 == 1; > /* 045 */ if (!false && project_value_2) { > /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 047 */ project_project_value_1_0 = 1L; > /* 048 */ continue; > /* 049 */ } > /* 050 */ > /* 051 */ boolean project_isNull_8 = false; > /* 052 */ int project_value_8 = -1; > /* 053 */ if (!false) { > /* 054 */ project_value_8 = (int) project_expr_0_0; > /* 055 */ } > /* 056 */ > /* 057 */ boolean project_value_7 = false; > /* 058 */ project_value_7 = project_value_8 == 2; > /* 059 */ if (!false && project_value_7) { > /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 061 */ project_project_value_1_0 = project_expr_0_0; > /* 062 */ continue; > /* 063 */ } > {code} > If the map can be constant folded, the constant folding happens first, and > the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing > a map traversal and more dynamic checks: > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as > "x").explain > == Physical Plan == > *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which > is what is stored in the {{Literal}} form of the {{map(...)}} expression in > that select. The code generated is less efficient, since it has to do a > manual dynamic traversal of the map's array of keys, with type casts etc.: > {code:java} > /* 099 */ int project_index_0 = 0; > /* 100 */ boolean project_found_0 = false; > /* 101 */ while (project_index_0 < project_length_0 && > !project_found_0) { > /* 102 */ final int project_key_0 = > project_keys_0.getInt(project_index_0); > /* 103 */ if (project_key_0 == project_value_2) { > /* 104 */ project_found_0 = true; > /* 105 */ } else { > /* 106 */ project_index_0++; > /* 107 */ } > /* 108 */ } > /* 109 */ > /* 110 */ if (!project_found_0) { > /* 111 */ project_isNull_0 = true; > /* 112 */ } else { > /* 113 */ project_value_0 = > project_values_0.getInt(project_index_0); > /* 114 */ } > {code} > It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't > handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form: > {code:scala} > case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not
[ https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27278: -- Fix Version/s: (was: 3.0.0) > Optimize GetMapValue when the map is a foldable and the key is not > -- > > Key: SPARK-27278 > URL: https://issues.apache.org/jira/browse/SPARK-27278 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 >Reporter: Huon Wilson >Assignee: Marco Gaido >Priority: Major > > With a map that isn't constant-foldable, spark will optimise an access to a > series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as > "x").explain > == Physical Plan == > *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L > as int) = 2) THEN id#180L END AS x#182L] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > This results in an efficient series of ifs and elses, in the code generation: > {code:java} > /* 037 */ boolean project_isNull_3 = false; > /* 038 */ int project_value_3 = -1; > /* 039 */ if (!false) { > /* 040 */ project_value_3 = (int) project_expr_0_0; > /* 041 */ } > /* 042 */ > /* 043 */ boolean project_value_2 = false; > /* 044 */ project_value_2 = project_value_3 == 1; > /* 045 */ if (!false && project_value_2) { > /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 047 */ project_project_value_1_0 = 1L; > /* 048 */ continue; > /* 049 */ } > /* 050 */ > /* 051 */ boolean project_isNull_8 = false; > /* 052 */ int project_value_8 = -1; > /* 053 */ if (!false) { > /* 054 */ project_value_8 = (int) project_expr_0_0; > /* 055 */ } > /* 056 */ > /* 057 */ boolean project_value_7 = false; > /* 058 */ project_value_7 = project_value_8 == 2; > /* 059 */ if (!false && project_value_7) { > /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0); > /* 061 */ project_project_value_1_0 = project_expr_0_0; > /* 062 */ continue; > /* 063 */ } > {code} > If the map can be constant folded, the constant folding happens first, and > the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing > a map traversal and more dynamic checks: > {code:none} > scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as > "x").explain > == Physical Plan == > *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197] > +- *(1) Range (0, 1000, step=1, splits=12) > {code} > The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which > is what is stored in the {{Literal}} form of the {{map(...)}} expression in > that select. The code generated is less efficient, since it has to do a > manual dynamic traversal of the map's array of keys, with type casts etc.: > {code:java} > /* 099 */ int project_index_0 = 0; > /* 100 */ boolean project_found_0 = false; > /* 101 */ while (project_index_0 < project_length_0 && > !project_found_0) { > /* 102 */ final int project_key_0 = > project_keys_0.getInt(project_index_0); > /* 103 */ if (project_key_0 == project_value_2) { > /* 104 */ project_found_0 = true; > /* 105 */ } else { > /* 106 */ project_index_0++; > /* 107 */ } > /* 108 */ } > /* 109 */ > /* 110 */ if (!project_found_0) { > /* 111 */ project_isNull_0 = true; > /* 112 */ } else { > /* 113 */ project_value_0 = > project_values_0.getInt(project_index_0); > /* 114 */ } > {code} > It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't > handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form: > {code:scala} > case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27356) File source V2: return actual schema in method `FileScan.readSchema`
Gengliang Wang created SPARK-27356: -- Summary: File source V2: return actual schema in method `FileScan.readSchema` Key: SPARK-27356 URL: https://issues.apache.org/jira/browse/SPARK-27356 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang The method `Scan.readSchema` returns the actual schema of this data source scan. In the current file source V2 framework, the schema is not returned correctly. The result should be `readDataSchema + partitionSchema`, while the framework returns the schema which is pushed down in `SupportsPushDownRequiredColumns.requiredSchema`. This is normally OK. But if there are overlap columns between `dataSchema` and `partitionSchema`, the result of row-base scan will be wrong. The actual schema should be `dataSchema - overlapSchema + partitionSchema`, which is different from from the pushed down `requiredSchema`. This PR is to: 1. Bug fix: fix the corner case that `dataSchema` overlaps with `partitionSchema`. 2. Improvement: Prune partition column values if part of the partition columns are not required. 3. Behavior change: To make it simple, the schema of `FileTable` is `dataSchema - overlapSchema + partitionSchema`, instead of mixing data schema and partitionSchema (see `PartitioningUtils.mergeDataAndPartitionSchema`) For example, the data schema is [a,b,c], the partition schema is [b,d], In V1, the schema of `HadoopFsRelation` is [a, b, c, d] in File source V2 , the schema of `FileTable` is [a, c, b, d] Putting all the partition columns to the end of table schema is more reasonable. Also, when there is `select *` operation and there is no schema pruning, the schema of `FileTable` and `FileScan` still matches. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806307#comment-16806307 ] belvey edited comment on SPARK-13510 at 4/3/19 3:07 PM: [~shenhong] , hello hongsheng, I am using spark2.0 facing the same issue, I am not sure if it's merged into spark2. it's very kind for you to post your pr. edit: I found that solution had already been added to spark2.3 and later. i am not sure if is hong shen's pr , but the solution is similar to what hong shen's said. And for spark2.3 and later we can use "spark.maxRemoteBlockSizeFetchToMem" to control the max block size allowed for shuffle fetching data that catched in memory, it's default value is (Interger.max-512) bytes. was (Author: belvey): [~shenhong] , hello hongsheng, I am using spark2.0 facing the same issue, I am not sure if it's merged into spark2. it's very kind for you to post your pr. > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen >Priority: Major > Attachments: spark-13510.diff > > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. > If the block is more than we can cache in memory, we should write to disk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Created] (SPARK-27355) make query execution more sensitive to epoch message late or lost
Genmao Yu created SPARK-27355: - Summary: make query execution more sensitive to epoch message late or lost Key: SPARK-27355 URL: https://issues.apache.org/jira/browse/SPARK-27355 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Genmao Yu In SPARK-23503, we enforce sequencing of committed epochs for Continuous Execution. In case a message for epoch n is lost and epoch (n + 1) is ready for commit before epoch n is, epoch (n + 1) will wait for epoch n to be committed first. With extreme condition, we will wait for `epochBacklogQueueSize` (1 in default) epochs and then failed. There is no need to wait for such a long time before query fail, and we can make the condition more sensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27354) Add a new empty hive-thriftserver module for Hive 2.3.4
Yuming Wang created SPARK-27354: --- Summary: Add a new empty hive-thriftserver module for Hive 2.3.4 Key: SPARK-27354 URL: https://issues.apache.org/jira/browse/SPARK-27354 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang When we upgraded the built-in Hive to 2.3.4, the current {{hive-thriftserver}} module is not compatible, such as these Hive changes: # HIVE-12442 HiveServer2: Refactor/repackage HiveServer2's Thrift code so that it can be used in the tasks # HIVE-12237 Use slf4j as logging facade # HIVE-13169 HiveServer2: Support delegation token based connection when using http transport So we should add a new {{hive-thriftserver}} module for Hive 2.3.4: 1. Add a new empty module for Hive 2.3.4 named {{hive-thriftserverV2}}. 2. Make {{hive-thriftserver}} can only be activated when testing with hadoop-2.7. 3. Make {{hive-thriftserverV2}} can only be activated when testing with hadoop-3.2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27353) PySpark Row __repr__ bug
Ihor Bobak created SPARK-27353: -- Summary: PySpark Row __repr__ bug Key: SPARK-27353 URL: https://issues.apache.org/jira/browse/SPARK-27353 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Reporter: Ihor Bobak Row class has this implementation of __repr__: def __repr__(self): """Printable representation of Row used in Python REPL.""" if hasattr(self, "__fields__"): return "Row(%s)" % ", ".join("%s=%r" % (k, v) for k, v in zip(self.__fields__, tuple(self))) else: return "" % ", ".join(self) the last line fails when you have a datetime.date instance in a row: TypeError Traceback (most recent call last) in 2 print(*row.values) 3 df_row = Row(*row.values) > 4 print(repr(df_row)) 5 break 6 E:\spark\spark-2.3.2-bin-without-hadoop\python\pyspark\sql\types.py in __repr__(self) 1579 for k, v in zip(self.__fields__, tuple(self))) 1580 else: -> 1581 return "" % ", ".join(self) 1582 1583 TypeError: sequence item 0: expected str instance, datetime.date found -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808588#comment-16808588 ] Wenchen Fan commented on SPARK-17592: - shall we close it? It seems not worth making behavior changes at this point, just to be consistent with Hive. > SQL: CAST string as INT inconsistent with Hive > -- > > Key: SPARK-17592 > URL: https://issues.apache.org/jira/browse/SPARK-17592 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin >Priority: Major > Attachments: image-2018-05-24-17-10-24-515.png > > > Hello, > there seem to be an inconsistency between Spark and Hive when casting a > string into an Int. > With Hive: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 0 > select cast("0.6" as INT) ; > > 0 > {code} > With Spark-SQL: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 1 > select cast("0.6" as INT) ; > > 1 > {code} > Hive seems to perform a floor(string.toDouble), while Spark seems to perform > a round(string.toDouble) > I'm not sure there is any ISO standard for this, mysql has the same behavior > than Hive, while postgresql performs a string.toInt and throws an > NumberFormatException > Personnally I think Hive is right, hence my posting this here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27344) Support the LocalDate and Instant classes in Java Bean encoders
[ https://issues.apache.org/jira/browse/SPARK-27344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27344. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24273 [https://github.com/apache/spark/pull/24273] > Support the LocalDate and Instant classes in Java Bean encoders > --- > > Key: SPARK-27344 > URL: https://issues.apache.org/jira/browse/SPARK-27344 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > - Check that Java Bean encoders support java.time.LocalDate and > java.time.Instant. Write a test for that. > - Update the comment: > https://github.com/apache/spark/pull/24249/files#diff-3e88c21c9270fef6eaf6f0e64ed81f27R152 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27344) Support the LocalDate and Instant classes in Java Bean encoders
[ https://issues.apache.org/jira/browse/SPARK-27344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27344: --- Assignee: Maxim Gekk > Support the LocalDate and Instant classes in Java Bean encoders > --- > > Key: SPARK-27344 > URL: https://issues.apache.org/jira/browse/SPARK-27344 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > - Check that Java Bean encoders support java.time.LocalDate and > java.time.Instant. Write a test for that. > - Update the comment: > https://github.com/apache/spark/pull/24249/files#diff-3e88c21c9270fef6eaf6f0e64ed81f27R152 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.
[ https://issues.apache.org/jira/browse/SPARK-19860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808530#comment-16808530 ] Martin Studer commented on SPARK-19860: --- We're observing the same issue with pyspark 2.3.0. It happens on an inner join of two data frames which have one single column in common (the join column). If I rename one of the columns as mentioned by [~wuchang1989] and then use a join expression the join succeeds. Isolating the problem seems difficult as it happens only in the context of a larger pipeline. > DataFrame join get conflict error if two frames has a same name column. > --- > > Key: SPARK-19860 > URL: https://issues.apache.org/jira/browse/SPARK-19860 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: wuchang >Priority: Major > > {code} > >>> print df1.collect() > [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', > in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), > Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', > in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), > Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', > in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), > Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', > in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), > Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', > in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), > Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', > in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), > Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', > in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), > Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', > in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), > Row(fdate=u'20170301', in_amount1=10159653)] > >>> print df2.collect() > [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', > in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), > Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', > in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), > Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', > in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), > Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', > in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), > Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', > in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), > Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', > in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), > Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', > in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), > Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', > in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), > Row(fdate=u'20170301', in_amount2=9475418)] > >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner') > 2017-03-08 10:27:34,357 WARN [Thread-2] sql.Column: Constructing trivially > true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases. > Traceback (most recent call last): > File "", line 1, in > File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join > jdf = self._jdf.join(other._jdf, on._jc, how) > File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: u" > Failure when resolving conflicting references in Join: > 'Join Inner, (fdate#42 = fdate#42) > :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) > as int) AS in_amount1#97] > : +- Filter (inorout#44 = A) > : +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, > fdate#42] > :+- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && > (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201))) > : +- SubqueryAlias history_transfer_v > : +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, > fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, > bankwaterid#48, waterid#49, waterstate#50, source#51] > : +- SubqueryAlias history_transfer > :+- >
[jira] [Commented] (SPARK-27335) cannot collect() from Correlation.corr
[ https://issues.apache.org/jira/browse/SPARK-27335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808442#comment-16808442 ] Natalino Busa commented on SPARK-27335: --- Unfortunately this error does not appear all the time. I have been unable to pinpoint when exactly happens it has something to do with how the SQLcontext and the SparkSession are initialized. However, I have discovered that starting SparkSession first and recreating the Singleton SQLcontext right after always works. > cannot collect() from Correlation.corr > -- > > Key: SPARK-27335 > URL: https://issues.apache.org/jira/browse/SPARK-27335 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: Natalino Busa >Priority: Major > > reproducing the bug from the example in the documentation: > > > {code:java} > import pyspark > from pyspark.ml.linalg import Vectors > from pyspark.ml.stat import Correlation > spark = pyspark.sql.SparkSession.builder.getOrCreate() > dataset = [[Vectors.dense([1, 0, 0, -2])], > [Vectors.dense([4, 5, 0, 3])], > [Vectors.dense([6, 7, 0, 8])], > [Vectors.dense([9, 0, 0, 1])]] > dataset = spark.createDataFrame(dataset, ['features']) > df = Correlation.corr(dataset, 'features', 'pearson') > df.collect() > > {code} > This produces the following stack trace: > > {code:java} > --- > AttributeErrorTraceback (most recent call last) > in () > 11 dataset = spark.createDataFrame(dataset, ['features']) > 12 df = Correlation.corr(dataset, 'features', 'pearson') > ---> 13 df.collect() > /opt/spark/python/pyspark/sql/dataframe.py in collect(self) > 530 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] > 531 """ > --> 532 with SCCallSiteSync(self._sc) as css: > 533 sock_info = self._jdf.collectToPython() > 534 return list(_load_from_socket(sock_info, > BatchedSerializer(PickleSerializer( > /opt/spark/python/pyspark/traceback_utils.py in __enter__(self) > 70 def __enter__(self): > 71 if SCCallSiteSync._spark_stack_depth == 0: > ---> 72 self._context._jsc.setCallSite(self._call_site) > 73 SCCallSiteSync._spark_stack_depth += 1 > 74 > AttributeError: 'NoneType' object has no attribute 'setCallSite'{code} > > > Analysis: > Somehow the dataframe properties `df.sql_ctx.sparkSession._jsparkSession`, > and `spark._jsparkSession` do not match with the ones available in the spark > session. > The following code fixes the problem (I hope this helps you narrowing down > the root cause) > > {code:java} > df.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession > df._sc = spark._sc > df.collect() > >>> [Row(pearson(features)=DenseMatrix(4, 4, [1.0, 0.0556, nan, 0.4005, > >>> 0.0556, 1.0, nan, 0.9136, nan, nan, 1.0, nan, 0.4005, 0.9136, nan, 1.0], > >>> False))]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27352) Apply for translation of the Chinese version, I hope to get authorization!
[ https://issues.apache.org/jira/browse/SPARK-27352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuan Yifan updated SPARK-27352: --- Description: Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source community in China, focusing on Big Data and AI. Recently, we have been making progress on translating Spark documents. - [Source Of Document|https://github.com/apachecn/spark-doc-zh] - [Document Preview|http://spark.apachecn.org/] There are several reasons: *1. The English level of many Chinese users is not very good.* *2. Network problems, you know (China's magic network)!* *3. Online blogs are very messy.* We are very willing to do some Chinese localization for your project. If possible, please give us some authorization. Yifan Yuan from Apache CN You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] for more details was: Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source community in China, focusing on Big Data and AI. Recently, we have been making progress on translating HBase documents. - [Source Of Document|https://github.com/apachecn/spark-doc-zh] - [Document Preview|http://spark.apachecn.org/] There are several reasons: *1. The English level of many Chinese users is not very good.* *2. Network problems, you know (China's magic network)!* *3. Online blogs are very messy.* We are very willing to do some Chinese localization for your project. If possible, please give us some authorization. Yifan Yuan from Apache CN You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] for more details > Apply for translation of the Chinese version, I hope to get authorization! > --- > > Key: SPARK-27352 > URL: https://issues.apache.org/jira/browse/SPARK-27352 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Yuan Yifan >Priority: Minor > > Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source > community in China, focusing on Big Data and AI. > Recently, we have been making progress on translating Spark documents. > - [Source Of Document|https://github.com/apachecn/spark-doc-zh] > - [Document Preview|http://spark.apachecn.org/] > There are several reasons: > *1. The English level of many Chinese users is not very good.* > *2. Network problems, you know (China's magic network)!* > *3. Online blogs are very messy.* > We are very willing to do some Chinese localization for your project. If > possible, please give us some authorization. > Yifan Yuan from Apache CN > You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] > for more details -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27352) Apply for translation of the Chinese version, I hope to get authorization!
Yuan Yifan created SPARK-27352: -- Summary: Apply for translation of the Chinese version, I hope to get authorization! Key: SPARK-27352 URL: https://issues.apache.org/jira/browse/SPARK-27352 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 2.4.0 Reporter: Yuan Yifan Hello everyone, we are [ApacheCN|https://www.apachecn.org/], an open-source community in China, focusing on Big Data and AI. Recently, we have been making progress on translating HBase documents. - [Source Of Document|https://github.com/apachecn/spark-doc-zh] - [Document Preview|http://spark.apachecn.org/] There are several reasons: *1. The English level of many Chinese users is not very good.* *2. Network problems, you know (China's magic network)!* *3. Online blogs are very messy.* We are very willing to do some Chinese localization for your project. If possible, please give us some authorization. Yifan Yuan from Apache CN You may contact me by mail [tsingjyuj...@163.com|mailto:tsingjyuj...@163.com] for more details -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27351) Wrong outputRows estimation after AggregateEstimation with only null value column
peng bo created SPARK-27351: --- Summary: Wrong outputRows estimation after AggregateEstimation with only null value column Key: SPARK-27351 URL: https://issues.apache.org/jira/browse/SPARK-27351 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.1 Reporter: peng bo The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 {code:java} var outputRows: BigInt = agg.groupingExpressions.foldLeft(BigInt(1))( (res, expr) => res * childStats.attributeStats(expr.asInstanceOf[Attribute]).distinctCount) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org