[jira] [Commented] (SPARK-25572) SparkR tests failed on CRAN on Java 10
[ https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632796#comment-16632796 ] Apache Spark commented on SPARK-25572: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/22589 > SparkR tests failed on CRAN on Java 10 > -- > > Key: SPARK-25572 > URL: https://issues.apache.org/jira/browse/SPARK-25572 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Major > > follow up to SPARK-24255 > from 2.3.2 release we can see that CRAN doesn't seem to respect the system > requirements as running tests - we have seen cases where SparkR is run on > Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt > skipping all tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25572) SparkR tests failed on CRAN on Java 10
[ https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-25572: - Summary: SparkR tests failed on CRAN on Java 10 (was: SparkR to skip tests because Java 10) > SparkR tests failed on CRAN on Java 10 > -- > > Key: SPARK-25572 > URL: https://issues.apache.org/jira/browse/SPARK-25572 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Major > > follow up to SPARK-24255 > from 2.3.2 release we can see that CRAN doesn't seem to respect the system > requirements as running tests - we have seen cases where SparkR is run on > Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt > skipping all tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25572) SparkR to skip tests because Java 10
[ https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25572: Assignee: Apache Spark (was: Felix Cheung) > SparkR to skip tests because Java 10 > > > Key: SPARK-25572 > URL: https://issues.apache.org/jira/browse/SPARK-25572 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Felix Cheung >Assignee: Apache Spark >Priority: Major > > follow up to SPARK-24255 > from 2.3.2 release we can see that CRAN doesn't seem to respect the system > requirements as running tests - we have seen cases where SparkR is run on > Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt > skipping all tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25572) SparkR to skip tests because Java 10
[ https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632795#comment-16632795 ] Apache Spark commented on SPARK-25572: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/22589 > SparkR to skip tests because Java 10 > > > Key: SPARK-25572 > URL: https://issues.apache.org/jira/browse/SPARK-25572 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Major > > follow up to SPARK-24255 > from 2.3.2 release we can see that CRAN doesn't seem to respect the system > requirements as running tests - we have seen cases where SparkR is run on > Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt > skipping all tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25572) SparkR to skip tests because Java 10
[ https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25572: Assignee: Felix Cheung (was: Apache Spark) > SparkR to skip tests because Java 10 > > > Key: SPARK-25572 > URL: https://issues.apache.org/jira/browse/SPARK-25572 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Major > > follow up to SPARK-24255 > from 2.3.2 release we can see that CRAN doesn't seem to respect the system > requirements as running tests - we have seen cases where SparkR is run on > Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt > skipping all tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25572) SparkR to skip tests because Java 10
Felix Cheung created SPARK-25572: Summary: SparkR to skip tests because Java 10 Key: SPARK-25572 URL: https://issues.apache.org/jira/browse/SPARK-25572 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.4.0 Reporter: Felix Cheung Assignee: Felix Cheung follow up to SPARK-24255 from 2.3.2 release we can see that CRAN doesn't seem to respect the system requirements as running tests - we have seen cases where SparkR is run on Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt skipping all tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25571) Add withColumnsRenamed method to Dataset
[ https://issues.apache.org/jira/browse/SPARK-25571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632777#comment-16632777 ] Chaerim Yeo commented on SPARK-25571: - I'm working on it now. > Add withColumnsRenamed method to Dataset > > > Key: SPARK-25571 > URL: https://issues.apache.org/jira/browse/SPARK-25571 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: Chaerim Yeo >Priority: Major > > There are two general approaches to rename several columns. > * Using *withColumnRenamed* method > * Using *select* method > {code} > // Using withColumnRenamed > ds.withColumnRenamed("first_name", "firstName") > .withColumnRenamed("last_name", "lastName") > .withColumnRenamed("postal_code", "postalCode") > // Using select > ds.select( > $"id", > $"first_name" as "firstName", > $"last_name" as "lastName", > $"address", > $"postal_code" as "postalCode" > ) > {code} > However, both approaches are still inefficient and redundant due to following > limitations. > * withColumnRenamed: it is required to call method several times > * select: it is required to pass all columns to select method > It is necessary to implement new method, such as *withColumnsRenamed*, which > can rename many columns at once. > {code} > ds.withColumnsRenamed( > "first_name" -> "firstName", > "last_name" -> "lastName", > "postal_code" -> "postalCode" > ) > // or > ds.withColumnsRenamed(Map( > "first_name" -> "firstName", > "last_name" -> "lastName", > "postal_code" -> "postalCode" > )) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25571) Add withColumnsRenamed method to Dataset
Chaerim Yeo created SPARK-25571: --- Summary: Add withColumnsRenamed method to Dataset Key: SPARK-25571 URL: https://issues.apache.org/jira/browse/SPARK-25571 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.2 Reporter: Chaerim Yeo There are two general approaches to rename several columns. * Using *withColumnRenamed* method * Using *select* method {code} // Using withColumnRenamed ds.withColumnRenamed("first_name", "firstName") .withColumnRenamed("last_name", "lastName") .withColumnRenamed("postal_code", "postalCode") // Using select ds.select( $"id", $"first_name" as "firstName", $"last_name" as "lastName", $"address", $"postal_code" as "postalCode" ) {code} However, both approaches are still inefficient and redundant due to following limitations. * withColumnRenamed: it is required to call method several times * select: it is required to pass all columns to select method It is necessary to implement new method, such as *withColumnsRenamed*, which can rename many columns at once. {code} ds.withColumnsRenamed( "first_name" -> "firstName", "last_name" -> "lastName", "postal_code" -> "postalCode" ) // or ds.withColumnsRenamed(Map( "first_name" -> "firstName", "last_name" -> "lastName", "postal_code" -> "postalCode" )) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632771#comment-16632771 ] Apache Spark commented on SPARK-25262: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/22588 > Make Spark local dir volumes configurable with Spark on Kubernetes > -- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Priority: Major > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25570. -- Resolution: Fixed Fix Version/s: 2.4.1 2.3.3 2.5.0 Issue resolved by pull request 22587 [https://github.com/apache/spark/pull/22587] > Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite > > > Key: SPARK-25570 > URL: https://issues.apache.org/jira/browse/SPARK-25570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.3, 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.5.0, 2.3.3, 2.4.1 > > > This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite > by using the latest Spark 2.3.2 because the Apache mirror will remove the old > Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail > because SPARK-24813 implements a fallback logic, but it causes many trials in > all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25570: Assignee: Dongjoon Hyun > Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite > > > Key: SPARK-25570 > URL: https://issues.apache.org/jira/browse/SPARK-25570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.3, 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite > by using the latest Spark 2.3.2 because the Apache mirror will remove the old > Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail > because SPARK-24813 implements a fallback logic, but it causes many trials in > all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632729#comment-16632729 ] Steven Rand commented on SPARK-25538: - [~kiszk] that makes sense, I'll try to do so. The issue I've been having so far is that when I run the UDF I've written to change the data (while preserving number of duplicate rows), the resulting DataFrame doesn't reproduce the issue. > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25559) Just remove the unsupported predicates in Parquet
[ https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25559. --- Resolution: Fixed Fix Version/s: 2.5.0 Issue resolved by pull request 22574 [https://github.com/apache/spark/pull/22574] > Just remove the unsupported predicates in Parquet > - > > Key: SPARK-25559 > URL: https://issues.apache.org/jira/browse/SPARK-25559 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.5.0 > > > Currently, in *ParquetFilters*, if one of the children predicates is not > supported by Parquet, the entire predicates will be thrown away. In fact, if > the unsupported predicate is in the top level *And* condition or in the child > before hitting *Not* or *Or* condition, it's safe to just remove the > unsupported one as unhandled filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632667#comment-16632667 ] Apache Spark commented on SPARK-25570: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22587 > Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite > > > Key: SPARK-25570 > URL: https://issues.apache.org/jira/browse/SPARK-25570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.3, 2.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite > by using the latest Spark 2.3.2 because the Apache mirror will remove the old > Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail > because SPARK-24813 implements a fallback logic, but it causes many trials in > all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25570: Assignee: (was: Apache Spark) > Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite > > > Key: SPARK-25570 > URL: https://issues.apache.org/jira/browse/SPARK-25570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.3, 2.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite > by using the latest Spark 2.3.2 because the Apache mirror will remove the old > Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail > because SPARK-24813 implements a fallback logic, but it causes many trials in > all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25570: Assignee: Apache Spark > Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite > > > Key: SPARK-25570 > URL: https://issues.apache.org/jira/browse/SPARK-25570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.3, 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite > by using the latest Spark 2.3.2 because the Apache mirror will remove the old > Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail > because SPARK-24813 implements a fallback logic, but it causes many trials in > all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632666#comment-16632666 ] Apache Spark commented on SPARK-25570: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22587 > Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite > > > Key: SPARK-25570 > URL: https://issues.apache.org/jira/browse/SPARK-25570 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.3, 2.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite > by using the latest Spark 2.3.2 because the Apache mirror will remove the old > Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail > because SPARK-24813 implements a fallback logic, but it causes many trials in > all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
Dongjoon Hyun created SPARK-25570: - Summary: Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite Key: SPARK-25570 URL: https://issues.apache.org/jira/browse/SPARK-25570 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.3.3, 2.4.0 Reporter: Dongjoon Hyun This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite by using the latest Spark 2.3.2 because the Apache mirror will remove the old Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail because SPARK-24813 implements a fallback logic, but it causes many trials in all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25449) Don't send zero accumulators in heartbeats
[ https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-25449. -- Resolution: Fixed Assignee: Mukul Murthy Fix Version/s: 2.5.0 > Don't send zero accumulators in heartbeats > -- > > Key: SPARK-25449 > URL: https://issues.apache.org/jira/browse/SPARK-25449 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Mukul Murthy >Assignee: Mukul Murthy >Priority: Major > Fix For: 2.5.0 > > > Heartbeats sent from executors to the driver every 10 seconds contain metrics > and are generally on the order of a few KBs. However, for large jobs with > lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks > to die with heartbeat failures. We can mitigate this by not sending zero > metrics to the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
[ https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25429. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 2.5.0 > SparkListenerBus inefficient due to > 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure > > > Key: SPARK-25429 > URL: https://issues.apache.org/jira/browse/SPARK-25429 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: DENG FEI >Assignee: Yuming Wang >Priority: Major > Fix For: 2.5.0 > > > {code:java} > private def updateStageMetrics( > stageId: Int, > attemptId: Int, > taskId: Long, > accumUpdates: Seq[AccumulableInfo], > succeeded: Boolean): Unit = { > Option(stageMetrics.get(stageId)).foreach { metrics => > if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) { > return > } > val oldTaskMetrics = metrics.taskMetrics.get(taskId) > if (oldTaskMetrics != null && oldTaskMetrics.succeeded) { > return > } > val updates = accumUpdates > .filter { acc => acc.update.isDefined && > metrics.accumulatorIds.contains(acc.id) } > .sortBy(_.id) > if (updates.isEmpty) { > return > } > val ids = new Array[Long](updates.size) > val values = new Array[Long](updates.size) > updates.zipWithIndex.foreach { case (acc, idx) => > ids(idx) = acc.id > // In a live application, accumulators have Long values, but when > reading from event > // logs, they have String values. For now, assume all accumulators > are Long and covert > // accordingly. > values(idx) = acc.update.get match { > case s: String => s.toLong > case l: Long => l > case o => throw new IllegalArgumentException(s"Unexpected: $o") > } > } > // TODO: storing metrics by task ID can cause metrics for the same task > index to be > // counted multiple times, for example due to speculation or > re-attempts. > metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, > succeeded)) > } > } > {code} > 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated > many accumulator, it's inefficient use Arrray#contains. > Actually, application may timeout while quit and will killed by RM on YARN > mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25458) Support FOR ALL COLUMNS in ANALYZE TABLE
[ https://issues.apache.org/jira/browse/SPARK-25458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25458. - Resolution: Fixed Assignee: Dilip Biswal Fix Version/s: 2.5.0 > Support FOR ALL COLUMNS in ANALYZE TABLE > - > > Key: SPARK-25458 > URL: https://issues.apache.org/jira/browse/SPARK-25458 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.5.0 > > > Currently, to collect the statistics of all the columns, users need to > specify the names of all the columns when calling the command "ANALYZE TABLE > ... FOR COLUMNS...". This is not user friendly. Instead, we can introduce the > following SQL command to achieve it without specifying the column names. > {code:java} >ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25569) Failing a Spark job when an accumulator cannot be updated
Shixiong Zhu created SPARK-25569: Summary: Failing a Spark job when an accumulator cannot be updated Key: SPARK-25569 URL: https://issues.apache.org/jira/browse/SPARK-25569 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Shixiong Zhu Currently, when Spark fails to merge an accumulator updates from a task, it will not fail the task. (See https://github.com/apache/spark/blob/b7d80349b0e367d78cab238e62c2ec353f0f12b3/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1266) So an accumulator update failure may be ignored silently. Some user may want to use accumulators in business critical things, and would like to fail a job when an accumulator is broken. We can add a flag to always fail a Spark job when hitting an accumulator failure. Or we can add a new property to an accumulator and only fail a spark job when such accumulator fails. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25542) Flaky test: OpenHashMapSuite
[ https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632568#comment-16632568 ] Dongjoon Hyun commented on SPARK-25542: --- I marked this as 2.4.1 because we are in the middle of RC2 vote. cc [~cloud_fan] > Flaky test: OpenHashMapSuite > > > Key: SPARK-25542 > URL: https://issues.apache.org/jira/browse/SPARK-25542 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.5.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.1 > > > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/] > (Sep 25, 2018 5:52:56 PM) > {code:java} > org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a > sbt.testing.SuiteSelector) > Failing for the past 1 build (Since #96585 ) > Took 0 ms. > Error Message > sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space > Stacktrace > sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: > Java heap space > at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117) > at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115) > at > org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159) > at > org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234) > at > org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171) > at > org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25542) Flaky test: OpenHashMapSuite
[ https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25542: -- Fix Version/s: (was: 2.4.0) 2.4.1 > Flaky test: OpenHashMapSuite > > > Key: SPARK-25542 > URL: https://issues.apache.org/jira/browse/SPARK-25542 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.5.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.1 > > > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/] > (Sep 25, 2018 5:52:56 PM) > {code:java} > org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a > sbt.testing.SuiteSelector) > Failing for the past 1 build (Since #96585 ) > Took 0 ms. > Error Message > sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space > Stacktrace > sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: > Java heap space > at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117) > at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115) > at > org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159) > at > org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234) > at > org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171) > at > org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25542) Flaky test: OpenHashMapSuite
[ https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25542: - Assignee: Liang-Chi Hsieh > Flaky test: OpenHashMapSuite > > > Key: SPARK-25542 > URL: https://issues.apache.org/jira/browse/SPARK-25542 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.5.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/] > (Sep 25, 2018 5:52:56 PM) > {code:java} > org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a > sbt.testing.SuiteSelector) > Failing for the past 1 build (Since #96585 ) > Took 0 ms. > Error Message > sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space > Stacktrace > sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: > Java heap space > at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117) > at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115) > at > org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159) > at > org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234) > at > org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171) > at > org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25542) Flaky test: OpenHashMapSuite
[ https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25542. --- Resolution: Fixed Resolved via https://github.com/apache/spark/pull/22569 > Flaky test: OpenHashMapSuite > > > Key: SPARK-25542 > URL: https://issues.apache.org/jira/browse/SPARK-25542 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.5.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/] > (Sep 25, 2018 5:52:56 PM) > {code:java} > org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a > sbt.testing.SuiteSelector) > Failing for the past 1 build (Since #96585 ) > Took 0 ms. > Error Message > sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space > Stacktrace > sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: > Java heap space > at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117) > at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115) > at > org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159) > at > org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234) > at > org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171) > at > org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25542) Flaky test: OpenHashMapSuite
[ https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25542: -- Fix Version/s: 2.4.0 > Flaky test: OpenHashMapSuite > > > Key: SPARK-25542 > URL: https://issues.apache.org/jira/browse/SPARK-25542 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.5.0 >Reporter: Dongjoon Hyun >Priority: Major > Fix For: 2.4.0 > > > - > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/] > (Sep 25, 2018 5:52:56 PM) > {code:java} > org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a > sbt.testing.SuiteSelector) > Failing for the past 1 build (Since #96585 ) > Took 0 ms. > Error Message > sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space > Stacktrace > sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: > Java heap space > at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117) > at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115) > at > org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159) > at > org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234) > at > org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171) > at > org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191) > at > org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator
[ https://issues.apache.org/jira/browse/SPARK-25568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25568: Assignee: Apache Spark (was: Shixiong Zhu) > Continue to update the remaining accumulators when failing to update one > accumulator > > > Key: SPARK-25568 > URL: https://issues.apache.org/jira/browse/SPARK-25568 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > > Currently when failing to update an accumulator, > DAGScheduler.updateAccumulators will skip the remaining accumulators. We > should try to update the remaining accumulators if possible so that they can > still report correct values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator
[ https://issues.apache.org/jira/browse/SPARK-25568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25568: Assignee: Shixiong Zhu (was: Apache Spark) > Continue to update the remaining accumulators when failing to update one > accumulator > > > Key: SPARK-25568 > URL: https://issues.apache.org/jira/browse/SPARK-25568 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Currently when failing to update an accumulator, > DAGScheduler.updateAccumulators will skip the remaining accumulators. We > should try to update the remaining accumulators if possible so that they can > still report correct values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator
[ https://issues.apache.org/jira/browse/SPARK-25568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632547#comment-16632547 ] Apache Spark commented on SPARK-25568: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/22586 > Continue to update the remaining accumulators when failing to update one > accumulator > > > Key: SPARK-25568 > URL: https://issues.apache.org/jira/browse/SPARK-25568 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Currently when failing to update an accumulator, > DAGScheduler.updateAccumulators will skip the remaining accumulators. We > should try to update the remaining accumulators if possible so that they can > still report correct values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator
Shixiong Zhu created SPARK-25568: Summary: Continue to update the remaining accumulators when failing to update one accumulator Key: SPARK-25568 URL: https://issues.apache.org/jira/browse/SPARK-25568 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.2, 2.4.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Currently when failing to update an accumulator, DAGScheduler.updateAccumulators will skip the remaining accumulators. We should try to update the remaining accumulators if possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator
[ https://issues.apache.org/jira/browse/SPARK-25568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-25568: - Description: Currently when failing to update an accumulator, DAGScheduler.updateAccumulators will skip the remaining accumulators. We should try to update the remaining accumulators if possible so that they can still report correct values. was:Currently when failing to update an accumulator, DAGScheduler.updateAccumulators will skip the remaining accumulators. We should try to update the remaining accumulators if possible. > Continue to update the remaining accumulators when failing to update one > accumulator > > > Key: SPARK-25568 > URL: https://issues.apache.org/jira/browse/SPARK-25568 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Currently when failing to update an accumulator, > DAGScheduler.updateAccumulators will skip the remaining accumulators. We > should try to update the remaining accumulators if possible so that they can > still report correct values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632512#comment-16632512 ] Nicholas Chammas commented on SPARK-25150: -- Correct, this isn't a cross join. It's just a plain inner join. In theory, whether cross joins are enabled or not should have no bearing on the result. However, what we're seeing is that without them enabled we get an incorrect error and with them enabled we get incorrect results. If we were actually trying a cross join (i.e. no {{on=(...)}} condition specified) I think those results (with the 4 output rows) would still be incorrect since you'd expect NH's population to be combined with RI's stats in one of the output rows, but that's not the case. You'd also expect MA to show up in the output, too. > The second join joins on a column in {{states}}, but that is not a DataFrame > used in that join. Is that the problem? Not sure what you mean here. Both joins join on {{states}}, which is the first DataFrame in the definition of {{analysis}}. > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional
[ https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632511#comment-16632511 ] Apache Spark commented on SPARK-23285: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22585 > Allow spark.executor.cores to be fractional > --- > > Key: SPARK-23285 > URL: https://issues.apache.org/jira/browse/SPARK-23285 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Scheduler, Spark Submit >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Assignee: Yinan Li >Priority: Minor > Fix For: 2.4.0 > > > There is a strong check for an integral number of cores per executor in > [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272]. > Given we're reusing that property in K8s, does it make sense to relax it? > > K8s treats CPU as a "compressible resource" and can actually assign millicpus > to individual containers. Also to be noted - spark.driver.cores has no such > check in place. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632491#comment-16632491 ] Apache Spark commented on SPARK-25262: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22584 > Make Spark local dir volumes configurable with Spark on Kubernetes > -- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Priority: Major > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632490#comment-16632490 ] Apache Spark commented on SPARK-25262: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22584 > Make Spark local dir volumes configurable with Spark on Kubernetes > -- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Priority: Major > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632485#comment-16632485 ] Sean Owen commented on SPARK-25150: --- Hm, I am not sure I understand the example yet – help me clarify here. We have three dataframes, really; states, humans, zombies: {code:java} State,Total Population,Total Area RI,120,30 MA,800,1700 NH,330,910 +-+-+ |State|count| +-+-+ | RI|2| | NH|1| +-+-+ +-+-+ |State|count| +-+-+ | RI|1| | MA|1| +-+-+{code} You join all three on state: {code:java} analysis = ( states .join( total_humans, on=(states['State'] == total_humans['State']) ) .join( total_zombies, on=(states['State'] == total_zombies['State']) ) .orderBy(states['State'], ascending=True) .select( states['State'], states['Total Population'], total_humans['count'].alias('Total Humans'), total_zombies['count'].alias('Total Zombies'), ) ) {code} and you get {code:java} +-+++-+ |State|Total Population|Total Humans|Total Zombies| +-+++-+ | NH| 330| 1|1| | NH| 330| 1|1| | RI| 120| 2|1| | RI| 120| 2|1| +-+++-+{code} But say you expect {code:java} +-+++-+ |State|Total Population|Total Humans|Total Zombies| +-+++-+ | RI| 120| 2|1| +-+++-+{code} First, this isn't a cross join right? the message says it thinks there is no join condition and wonders if you're really trying to do a cross join, but you're not, so enabling it isn't helping. If these were cross-joins, the output would be correct I think? The second join joins on a column in {{states}}, but that is not a DataFrame used in that join. Is that the problem? > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql
[ https://issues.apache.org/jira/browse/SPARK-25561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632483#comment-16632483 ] Karthik Manamcheri commented on SPARK-25561: [~michael] thanks for the prompt reply. This is hard to test because the problem happens only in the case when HMS goes into fallback ORM mode. For this to happen, we need to have the direct SQL query fail in HMS. There are no consistent bugs (that I know of) which can be used to test this in a deterministic fashion. I was able to run into this running Hive 1.1.0. However, as I understand HMS behavior of falling back to ORM has been the same in Hive from the beginning. Not sure. > HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql > -- > > Key: SPARK-25561 > URL: https://issues.apache.org/jira/browse/SPARK-25561 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Karthik Manamcheri >Priority: Major > > In HiveShim.scala, the current behavior is that if > hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter > call to succeed. If it fails, we'll throw a RuntimeException. > However, this might not always be the case. Hive's direct SQL functionality > is best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark > should handle that exception correctly if Hive falls back to ORM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632381#comment-16632381 ] Nicholas Chammas commented on SPARK-25150: -- ([~petertoth] - Seeing your comment edit now.) OK, so it seems the two problems I identified are accurate, but they have a common root cause. Thanks for confirming. [~srowen] - Given Peter's confirmation that the results with cross join enabled are incorrect, I believe we should mark this as a correctness issue. > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632281#comment-16632281 ] Nicholas Chammas commented on SPARK-25150: -- I've uploaded the expected output. I realize that the reproduction I've attached to this ticket (zombie-analysis.py plus the related files), though complete and self-contained, is a bit verbose. If it's not helpful enough I will see if I can boil it down further. [~petertoth] - I suggest you take another look at the output with cross joins enabled and compare it to what (I think) is the correct expected output. If I'm understanding things correctly, there are two issues: 1) the bad error when cross join is not enabled (there should be no error), and 2) the incorrect results when cross join _is_ enabled (the results I just uploaded). Your PR doesn't appear to investigate or address the incorrect results issue, so I'm not sure if it would fix that too of if I am just mistaken about there being a second issue. > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632274#comment-16632274 ] Peter Toth edited comment on SPARK-25150 at 9/28/18 7:28 PM: - [~nchammas], sorry for the late reply. There is only one issue here. Please see zombie-analysis.py, it contains 2 joins and both joins define the condition explicitly, so setting spark.sql.crossJoin.enabled=true {color:#33}should not have any effect.{color} {color:#33}The root cause of the error you see when spark.sql.crossJoin.enabled=false (default) and the incorrect results when spark.sql.crossJoin.enabled=true is the same, the join condition is handled incorrectly.{color} {color:#33}Please see my PR's description for further details: [https://github.com/apache/spark/pull/22318]{color} was (Author: petertoth): [~nchammas], sorry for the late reply. There is only one issue here. Please see zombie-analysis.py, it contains 2 joins and both joins define the condition explicitly, so setting spark.sql.crossJoin.enabled=true {color:#33}should not have any effect.{color} {color:#33}Simply the SQL statement should not fail, please see my PR's description for further details: [https://github.com/apache/spark/pull/22318]{color} > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-25150: - Attachment: expected-output.txt > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632274#comment-16632274 ] Peter Toth commented on SPARK-25150: [~nchammas], sorry for the late reply. There is only one issue here. Please see zombie-analysis.py, it contains 2 joins and both joins define the condition explicitly, so setting spark.sql.crossJoin.enabled=true {color:#33}should not have any effect.{color} {color:#33}Simply the SQL statement should not fail, please see my PR's description for further details: [https://github.com/apache/spark/pull/22318]{color} > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: output-with-implicit-cross-join.txt, > output-without-implicit-cross-join.txt, persons.csv, states.csv, > zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-25150: - Description: I have two DataFrames, A and B. From B, I have derived two additional DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very confusing error: {code:java} Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; {code} Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, Spark appears to give me incorrect answers. I am not sure if I am missing something obvious, or if there is some kind of bug here. The "join condition is missing" error is confusing and doesn't make sense to me, and the seemingly incorrect output is concerning. I've attached a reproduction, along with the output I'm seeing with and without the implicit cross join enabled. I realize the join I've written is not "correct" in the sense that it should be left outer join instead of an inner join (since some of the aggregates are not available for all states), but that doesn't explain Spark's behavior. was: I have two DataFrames, A and B. From B, I have derived two additional DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very confusing error: {code:java} Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; {code} Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, Spark appears to give me incorrect answers. I am not sure if I am missing something obvious, or if there is some kind of bug here. The "join condition is missing" error is confusing and doesn't make sense to me, and the seemingly incorrect output is concerning. I've attached a reproduction, along with the output I'm seeing with and without the implicit cross join enabled. I realize the join I've written is not correct in the sense that it should be left outer join instead of an inner join (since some of the aggregates are not available for all states), but that doesn't explain Spark's behavior. > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: output-with-implicit-cross-join.txt, > output-without-implicit-cross-join.txt, persons.csv, states.csv, > zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632252#comment-16632252 ] Nicholas Chammas commented on SPARK-25150: -- The attachments on this ticket contain a complete reproduction. The comment towards the beginning of zombie-analysis.py points to the config that, when enabled, appears to yield incorrect results. (Without the config enabled we get a confusing/incorrect error, which is a second issue.) The results with and without the config enabled are also attached here. I will add another attachment showing the expected results. I believe some folks over on the linked PR provided a simpler reproduction of part of this issue, but I haven't taken a close look at it to see if it captures the same two issues (incorrect results + confusing/incorrect error). > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: output-with-implicit-cross-join.txt, > output-without-implicit-cross-join.txt, persons.csv, states.csv, > zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not correct in the sense that it should be > left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632234#comment-16632234 ] Sean Owen commented on SPARK-25150: --- What's an example of expected vs actual results here that show the bug? is it simple to summarize? > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: output-with-implicit-cross-join.txt, > output-without-implicit-cross-join.txt, persons.csv, states.csv, > zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not correct in the sense that it should be > left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1663#comment-1663 ] Nicholas Chammas commented on SPARK-25150: -- [~cloud_fan] / [~srowen] - Would you consider this issue (particularly the one expressed when spark.sql.crossJoin.enabled is set to true) to be a correctness bug? I think it is, but I'd like a committer to confirm and add the appropriate label if necessary. > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: output-with-implicit-cross-join.txt, > output-without-implicit-cross-join.txt, persons.csv, states.csv, > zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not correct in the sense that it should be > left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25324) ML 2.4 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-25324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-25324: -- Fix Version/s: 2.4.0 > ML 2.4 QA: API: Java compatibility, docs > > > Key: SPARK-25324 > URL: https://issues.apache.org/jira/browse/SPARK-25324 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Blocker > Fix For: 2.4.0 > > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-25320: -- Fix Version/s: 2.4.0 > ML, Graph 2.4 QA: API: Binary incompatible changes > -- > > Key: SPARK-25320 > URL: https://issues.apache.org/jira/browse/SPARK-25320 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Blocker > Fix For: 2.4.0 > > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql
[ https://issues.apache.org/jira/browse/SPARK-25561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632137#comment-16632137 ] Michael Allman edited comment on SPARK-25561 at 9/28/18 5:08 PM: - cc [~cloud_fan] [~ekhliang] Hi [~karthik.manamcheri]. Thanks for reporting this. I can't take a look right now, but I believe we have test cases that exercise this scenario. If not, it's certainly a hole in our coverage. If we do, it may be that Hive's behavior in this scenario is version-dependent, and we don't have coverage for your version of Hive. What version of Hive are you using? Thanks. was (Author: michael): cc [~cloud_fan] [~ekhliang] Hi [~karthik.manamcheri]. Thanks for reporting this. I can't take a look right now, but I believe we have test cases that exercise this scenario. If not, it's certainly a whole in our coverage. If we do, it may be that Hive's behavior in this scenario is version-dependent, and we don't have coverage for your version of Hive. What version of Hive are you using? Thanks. > HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql > -- > > Key: SPARK-25561 > URL: https://issues.apache.org/jira/browse/SPARK-25561 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Karthik Manamcheri >Priority: Major > > In HiveShim.scala, the current behavior is that if > hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter > call to succeed. If it fails, we'll throw a RuntimeException. > However, this might not always be the case. Hive's direct SQL functionality > is best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark > should handle that exception correctly if Hive falls back to ORM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql
[ https://issues.apache.org/jira/browse/SPARK-25561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632137#comment-16632137 ] Michael Allman commented on SPARK-25561: cc [~cloud_fan] [~ekhliang] Hi [~karthik.manamcheri]. Thanks for reporting this. I can't take a look right now, but I believe we have test cases that exercise this scenario. If not, it's certainly a whole in our coverage. If we do, it may be that Hive's behavior in this scenario is version-dependent, and we don't have coverage for your version of Hive. What version of Hive are you using? Thanks. > HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql > -- > > Key: SPARK-25561 > URL: https://issues.apache.org/jira/browse/SPARK-25561 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Karthik Manamcheri >Priority: Major > > In HiveShim.scala, the current behavior is that if > hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter > call to succeed. If it fails, we'll throw a RuntimeException. > However, this might not always be the case. Hive's direct SQL functionality > is best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark > should handle that exception correctly if Hive falls back to ORM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23717) Leverage docker support in Hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-23717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632100#comment-16632100 ] Eric Yang edited comment on SPARK-23717 at 9/28/18 4:33 PM: It is possible to run standalone Spark in YARN docker containers without any code modification to spark. Here is an example yarnfile that I used to run mesosphere generated docker image and it ran fine: {code} { "name": "spark", "kerberos_principal" : { "principal_name" : "spark/_h...@example.com", "keytab" : "file:///etc/security/keytabs/spark.service.keytab" }, "version": "0.1", "components" : [ { "name": "driver", "number_of_containers": 1, "artifact": { "id": "mesosphere/spark:latest", "type": "DOCKER" }, "launch_command": "bash,-c,sleep 30 && ./sbin/start-master.sh", "resource": { "cpus": 1, "memory": "256" }, "run_privileged_container": true, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true", "SPARK_NO_DAEMONIZE":"true", "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131" }, "properties": { "docker.network": "host" } } }, { "name": "executor", "number_of_containers": 2, "artifact": { "id": "mesosphere/spark:latest", "type": "DOCKER" }, "launch_command": "bash,-c,sleep 30 && ./sbin/start-slave.sh spark://driver-0.spark.spark.ycluster:7077", "resource": { "cpus": 1, "memory": "256" }, "run_privileged_container": true, "dependencies": [ "driver" ], "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true", "SPARK_NO_DAEMONIZE":"true", "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131" }, "properties": { "docker.network": "host" } } } ] } {code} The reason for 30 seconds sleep is to ensure RegistryDNS has been refreshed and updated to respond to DNS queries. The sleep could be a lot shorter like 3 seconds. I did not spend much time to try to fine tune the DNS wait time. Further enhancement to pass in keytab and krb5.conf can enable access to secure HDFS, that would be exercise for the readers of this JIRA. was (Author: eyang): It is possible to run standalone Spark in YARN without any code modification to spark. Here is an example yarnfile that I used to run mesosphere generated docker image and it ran fine: {code} { "name": "spark", "kerberos_principal" : { "principal_name" : "spark/_h...@example.com", "keytab" : "file:///etc/security/keytabs/spark.service.keytab" }, "version": "0.1", "components" : [ { "name": "driver", "number_of_containers": 1, "artifact": { "id": "mesosphere/spark:latest", "type": "DOCKER" }, "launch_command": "bash,-c,sleep 30 && ./sbin/start-master.sh", "resource": { "cpus": 1, "memory": "256" }, "run_privileged_container": true, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true", "SPARK_NO_DAEMONIZE":"true", "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131" }, "properties": { "docker.network": "host" } } }, { "name": "executor", "number_of_containers": 2, "artifact": { "id": "mesosphere/spark:latest", "type": "DOCKER" }, "launch_command": "bash,-c,sleep 30 && ./sbin/start-slave.sh spark://driver-0.spark.spark.ycluster:7077", "resource": { "cpus": 1, "memory": "256" }, "run_privileged_container": true, "dependencies": [ "driver" ], "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true", "SPARK_NO_DAEMONIZE":"true", "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131" }, "properties": { "docker.network": "host" } } } ] } {code} The reason for 30 seconds sleep is to ensure RegistryDNS has been refreshed and updated to respond to DNS queries. The sleep could be a lot shorter like 3 seconds. I did not spend much time to try to fine tune the DNS wait time. Further enhancement to pass in keytab and krb5.conf can enable access to secure HDFS, that would be exercise for the readers of this JIRA. > Leverage docker support in Hadoop 3 > --- > > Key: SPARK-23717 > URL: https://issues.apache.org/jira/browse/SPARK-23717 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.4.0 >Reporter: Mridul Muralidharan >
[jira] [Commented] (SPARK-23717) Leverage docker support in Hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-23717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632100#comment-16632100 ] Eric Yang commented on SPARK-23717: --- It is possible to run standalone Spark in YARN without any code modification to spark. Here is an example yarnfile that I used to run mesosphere generated docker image and it ran fine: {code} { "name": "spark", "kerberos_principal" : { "principal_name" : "spark/_h...@example.com", "keytab" : "file:///etc/security/keytabs/spark.service.keytab" }, "version": "0.1", "components" : [ { "name": "driver", "number_of_containers": 1, "artifact": { "id": "mesosphere/spark:latest", "type": "DOCKER" }, "launch_command": "bash,-c,sleep 30 && ./sbin/start-master.sh", "resource": { "cpus": 1, "memory": "256" }, "run_privileged_container": true, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true", "SPARK_NO_DAEMONIZE":"true", "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131" }, "properties": { "docker.network": "host" } } }, { "name": "executor", "number_of_containers": 2, "artifact": { "id": "mesosphere/spark:latest", "type": "DOCKER" }, "launch_command": "bash,-c,sleep 30 && ./sbin/start-slave.sh spark://driver-0.spark.spark.ycluster:7077", "resource": { "cpus": 1, "memory": "256" }, "run_privileged_container": true, "dependencies": [ "driver" ], "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true", "SPARK_NO_DAEMONIZE":"true", "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131" }, "properties": { "docker.network": "host" } } } ] } {code} The reason for 30 seconds sleep is to ensure RegistryDNS has been refreshed and updated to respond to DNS queries. The sleep could be a lot shorter like 3 seconds. I did not spend much time to try to fine tune the DNS wait time. Further enhancement to pass in keytab and krb5.conf can enable access to secure HDFS, that would be exercise for the readers of this JIRA. > Leverage docker support in Hadoop 3 > --- > > Key: SPARK-23717 > URL: https://issues.apache.org/jira/browse/SPARK-23717 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.4.0 >Reporter: Mridul Muralidharan >Priority: Major > > The introduction of docker support in Apache Hadoop 3 can be leveraged by > Apache Spark for resolving multiple long standing shortcoming - particularly > related to package isolation. > It also allows for network isolation, where applicable, allowing for more > sophisticated cluster configuration/customization. > This jira will track the various tasks for enhancing spark to leverage > container support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml
[ https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632061#comment-16632061 ] Yuming Wang commented on SPARK-25553: - Thanks [~srowen] > Add EmptyInterpolatedStringChecker to scalastyle-config.xml > --- > > Key: SPARK-25553 > URL: https://issues.apache.org/jira/browse/SPARK-25553 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Priority: Minor > > h4. Justification > Empty interpolated strings are harder to read and not necessary. > > More details: > http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide
[ https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20937: -- Fix Version/s: (was: 2.4.1) (was: 2.5.0) 2.4.0 > Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, > DataFrames and Datasets Guide > - > > Key: SPARK-20937 > URL: https://issues.apache.org/jira/browse/SPARK-20937 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Assignee: Chenxiao Mao >Priority: Trivial > Fix For: 2.4.0 > > > As a follow-up to SPARK-20297 (and SPARK-10400) in which > {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala > and Hive, Spark SQL docs for [Parquet > Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration] > should have it documented. > p.s. It was asked about in [Why can't Impala read parquet files after Spark > SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow > today. > p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance > Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table > 3-10. Parquet data source options) that gives the option some wider publicity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25431) Fix function examples and unify the format of the example results.
[ https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25431: -- Fix Version/s: (was: 2.4.1) (was: 3.0.0) 2.4.0 > Fix function examples and unify the format of the example results. > -- > > Key: SPARK-25431 > URL: https://issues.apache.org/jira/browse/SPARK-25431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Minor > Fix For: 2.4.0 > > > There are some mistakes in examples of newly added functions. Also the format > of the example results are not unified. We should fix and unify them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21514) Hive has updated with new support for S3 and InsertIntoHiveTable.scala should update also
[ https://issues.apache.org/jira/browse/SPARK-21514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632003#comment-16632003 ] Nick Orka commented on SPARK-21514: --- Recently S3 increased request rate. Thus eventual consistency became a huge problem now for data lakes based on S3. This approach can fix the issue because this is exact spot where all Spark jobs fails. Can you change a priority of the ticket? This is a real stopper for many data pipelines. > Hive has updated with new support for S3 and InsertIntoHiveTable.scala should > update also > - > > Key: SPARK-21514 > URL: https://issues.apache.org/jira/browse/SPARK-21514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Javier Ros >Priority: Major > > Hive has updated adding new parameters to optimize the usage of S3, now you > can avoid the usage of S3 as the stagingdir using the parameters > hive.blobstore.supported.schemes & hive.blobstore.optimizations.enabled. > The InsertIntoHiveTable.scala file should be updated with the same > improvement to match the behavior of Hive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls
[ https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25565: -- Priority: Minor (was: Major) > Add scala style checker to check add Locale.ROOT to .toLowerCase and > .toUpperCase for internal calls > > > Key: SPARK-25565 > URL: https://issues.apache.org/jira/browse/SPARK-25565 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml
[ https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25553. --- Resolution: Won't Fix I'd say if anything, later, instead focus on removing uses of {{"...".format(...)}} and cases like {{s"..." + foo}} which should be {{s"...$foo"}} > Add EmptyInterpolatedStringChecker to scalastyle-config.xml > --- > > Key: SPARK-25553 > URL: https://issues.apache.org/jira/browse/SPARK-25553 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Priority: Minor > > h4. Justification > Empty interpolated strings are harder to read and not necessary. > > More details: > http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance
[ https://issues.apache.org/jira/browse/SPARK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17159: Assignee: (was: Apache Spark) > Improve FileInputDStream.findNewFiles list performance > -- > > Key: SPARK-17159 > URL: https://issues.apache.org/jira/browse/SPARK-17159 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.0.0 > Environment: spark against object stores >Reporter: Steve Loughran >Priority: Minor > > {{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that > calls getFileStatus() on every file, takes the output and does listStatus() > on the output. > This going to suffer on object stores, as dir listing and getFileStatus calls > are so expensive. It's clear this is a problem, as the method has code to > detect timeouts in the window and warn of problems. > It should be possible to make this faster -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance
[ https://issues.apache.org/jira/browse/SPARK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17159: Assignee: Apache Spark > Improve FileInputDStream.findNewFiles list performance > -- > > Key: SPARK-17159 > URL: https://issues.apache.org/jira/browse/SPARK-17159 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.0.0 > Environment: spark against object stores >Reporter: Steve Loughran >Assignee: Apache Spark >Priority: Minor > > {{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that > calls getFileStatus() on every file, takes the output and does listStatus() > on the output. > This going to suffer on object stores, as dir listing and getFileStatus calls > are so expensive. It's clear this is a problem, as the method has code to > detect timeouts in the window and warn of problems. > It should be possible to make this faster -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance
[ https://issues.apache.org/jira/browse/SPARK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-17159: --- > Improve FileInputDStream.findNewFiles list performance > -- > > Key: SPARK-17159 > URL: https://issues.apache.org/jira/browse/SPARK-17159 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.0.0 > Environment: spark against object stores >Reporter: Steve Loughran >Priority: Minor > > {{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that > calls getFileStatus() on every file, takes the output and does listStatus() > on the output. > This going to suffer on object stores, as dir listing and getFileStatus calls > are so expensive. It's clear this is a problem, as the method has code to > detect timeouts in the window and warn of problems. > It should be possible to make this faster -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) EventTime based sessionization
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631751#comment-16631751 ] Li Yuanjian commented on SPARK-10816: - Design doc: [https://docs.google.com/document/d/1zeAc7QKSO7J4-Yk06kc76kvldl-QHLCDJuu04d7k2bg/edit?usp=sharing] PR: [https://github.com/apache/spark/pull/22583] With a roughly checking with [~kabhwan] post doc and pr, we share several spots in design and implement, hope we can resolve this problem together! > EventTime based sessionization > -- > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-10816 Support session window natively.pdf, Session > Window Support For Structure Streaming.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10816) EventTime based sessionization
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Yuanjian updated SPARK-10816: Attachment: Session Window Support For Structure Streaming.pdf > EventTime based sessionization > -- > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-10816 Support session window natively.pdf, Session > Window Support For Structure Streaming.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) EventTime based sessionization
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631733#comment-16631733 ] Apache Spark commented on SPARK-10816: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/22583 > EventTime based sessionization > -- > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-10816 Support session window natively.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25567) [Spark Job History] Table listing in SQL Tab not display Sort Icon
[ https://issues.apache.org/jira/browse/SPARK-25567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631711#comment-16631711 ] shahid commented on SPARK-25567: Thanks. I will raise a PR > [Spark Job History] Table listing in SQL Tab not display Sort Icon > -- > > Key: SPARK-25567 > URL: https://issues.apache.org/jira/browse/SPARK-25567 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > 1. spark.sql.ui.retainedExecutions = 2 > 2. Run Beeline Jobs > 3. Open SQL Tab will list SQL Queries in table > 4. ID column header does not display Sort Icon compare to other UI Tabs like > Job Id in Jobs > 5. Id user clicks the Column Header Sorting is happening. > Expected Result: > User should be provided with Sort Icon like other UI tab. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25567) [Spark Job History] Table listing in SQL Tab not display Sort Icon
ABHISHEK KUMAR GUPTA created SPARK-25567: Summary: [Spark Job History] Table listing in SQL Tab not display Sort Icon Key: SPARK-25567 URL: https://issues.apache.org/jira/browse/SPARK-25567 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.1 Reporter: ABHISHEK KUMAR GUPTA 1. spark.sql.ui.retainedExecutions = 2 2. Run Beeline Jobs 3. Open SQL Tab will list SQL Queries in table 4. ID column header does not display Sort Icon compare to other UI Tabs like Job Id in Jobs 5. Id user clicks the Column Header Sorting is happening. Expected Result: User should be provided with Sort Icon like other UI tab. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25566) [Spark Job History] SQL UI Page does not support Pagination
[ https://issues.apache.org/jira/browse/SPARK-25566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631703#comment-16631703 ] shahid commented on SPARK-25566: Thanks for reporting. I am working on it. > [Spark Job History] SQL UI Page does not support Pagination > --- > > Key: SPARK-25566 > URL: https://issues.apache.org/jira/browse/SPARK-25566 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.1 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > 1. configure spark.sql.ui.retainedExecutions = 5 ( In Job History > Spark-default.conf ) > 2. Execute beeline Jobs more than 2 > 3. Open the UI page from the History Server > 4. Click SQL Tab > *Actual Output:* It shows all SQL Queries in Single Page. User has to scroll > whole page for specific SQL Queries. > *Expected:* It should show page wise as it has been displaying inn other UI > Tabs like Jobs, Stages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25566) [Spark Job History] SQL UI Page does not support Pagination
ABHISHEK KUMAR GUPTA created SPARK-25566: Summary: [Spark Job History] SQL UI Page does not support Pagination Key: SPARK-25566 URL: https://issues.apache.org/jira/browse/SPARK-25566 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.3.1 Reporter: ABHISHEK KUMAR GUPTA 1. configure spark.sql.ui.retainedExecutions = 5 ( In Job History Spark-default.conf ) 2. Execute beeline Jobs more than 2 3. Open the UI page from the History Server 4. Click SQL Tab *Actual Output:* It shows all SQL Queries in Single Page. User has to scroll whole page for specific SQL Queries. *Expected:* It should show page wise as it has been displaying inn other UI Tabs like Jobs, Stages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order
[ https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631700#comment-16631700 ] Apache Spark commented on SPARK-25505: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/22582 > The output order of grouping columns in Pivot is different from the input > order > --- > > Key: SPARK-25505 > URL: https://issues.apache.org/jira/browse/SPARK-25505 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Minor > Fix For: 2.4.0 > > > For example, > {code} > SELECT * FROM ( > SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, > "x" as x, "d" as d, "w" as w > FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR course IN ('dotNET', 'Java') > ) > {code} > The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, > b, c, d, w, x, y, z, ..." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls
[ https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25565: Assignee: Apache Spark > Add scala style checker to check add Locale.ROOT to .toLowerCase and > .toUpperCase for internal calls > > > Key: SPARK-25565 > URL: https://issues.apache.org/jira/browse/SPARK-25565 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls
[ https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631662#comment-16631662 ] Apache Spark commented on SPARK-25565: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/22581 > Add scala style checker to check add Locale.ROOT to .toLowerCase and > .toUpperCase for internal calls > > > Key: SPARK-25565 > URL: https://issues.apache.org/jira/browse/SPARK-25565 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls
[ https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25565: Assignee: (was: Apache Spark) > Add scala style checker to check add Locale.ROOT to .toLowerCase and > .toUpperCase for internal calls > > > Key: SPARK-25565 > URL: https://issues.apache.org/jira/browse/SPARK-25565 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls
[ https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631654#comment-16631654 ] Yuming Wang commented on SPARK-25565: - Thanks [~hyukjin.kwon] Please go ahead. > Add scala style checker to check add Locale.ROOT to .toLowerCase and > .toUpperCase for internal calls > > > Key: SPARK-25565 > URL: https://issues.apache.org/jira/browse/SPARK-25565 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23194) from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls
[ https://issues.apache.org/jira/browse/SPARK-23194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631652#comment-16631652 ] Daniel Mateus Pires commented on SPARK-23194: - Any news on this ? not being able to set the from_json mode and use the columnNameOfCorruptRecord option is pretty limiting, and the documentation of "from_json" suggests that all the spark.read.json options are available {code:java} * @param options options to control how the json is parsed. accepts the same options and the json data source. {code} > from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls > --- > > Key: SPARK-23194 > URL: https://issues.apache.org/jira/browse/SPARK-23194 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Burak Yavuz >Priority: Major > > from_json accepts Json parsing options such as being PERMISSIVE to parsing > errors or failing fast. It seems from the code that even though the default > option is to fail fast, we catch that exception and return nulls. > > In order to not change behavior, we should remove that try-catch block and > change the default to permissive, but allow failfast mode to indeed fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631605#comment-16631605 ] Kazuaki Ishizaki commented on SPARK-25538: -- Thank for upload a schema. While I looked at the schema, I am still not sure about the reason of this problem. I would appreciate it if you could find a good input data that can reproduce a problem. > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls
[ https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631575#comment-16631575 ] Hyukjin Kwon commented on SPARK-25565: -- I am taking a look for this. I will open a PR shortly if you don't mind. > Add scala style checker to check add Locale.ROOT to .toLowerCase and > .toUpperCase for internal calls > > > Key: SPARK-25565 > URL: https://issues.apache.org/jira/browse/SPARK-25565 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25508) Refactor OrcReadBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25508: Assignee: Apache Spark > Refactor OrcReadBenchmark to use main method > > > Key: SPARK-25508 > URL: https://issues.apache.org/jira/browse/SPARK-25508 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.5.0 >Reporter: yucai >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25508) Refactor OrcReadBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25508: Assignee: (was: Apache Spark) > Refactor OrcReadBenchmark to use main method > > > Key: SPARK-25508 > URL: https://issues.apache.org/jira/browse/SPARK-25508 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.5.0 >Reporter: yucai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25508) Refactor OrcReadBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631573#comment-16631573 ] Apache Spark commented on SPARK-25508: -- User 'yucai' has created a pull request for this issue: https://github.com/apache/spark/pull/22580 > Refactor OrcReadBenchmark to use main method > > > Key: SPARK-25508 > URL: https://issues.apache.org/jira/browse/SPARK-25508 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.5.0 >Reporter: yucai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
[ https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631568#comment-16631568 ] ice bai commented on SPARK-21774: - I met the same problem in Spark 2.3.0. The flowlling is some tests ``` spark-sql> select ''>0; true Time taken: 0.078 seconds, Fetched 1 row(s) spark-sql> select ''>0; NULL Time taken: 0.065 seconds, Fetched 1 row(s) spark-sql> select '1.0'=1; true Time taken: 0.054 seconds, Fetched 1 row(s) spark-sql> select '1.2'=1; true Time taken: 0.07 seconds, Fetched 1 row(s) ``` When set log level to trace, I found this: === Applying Rule org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings === !'Project [unresolvedalias((> 0), None)] 'Project [unresolvedalias((cast( as int) > 0), None)] +- OneRowRelation +- OneRowRelation > The rule PromoteStrings cast string to a wrong data type > > > Key: SPARK-21774 > URL: https://issues.apache.org/jira/browse/SPARK-21774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: StanZhai >Priority: Critical > Labels: correctness > > Data > {code} > create temporary view tb as select * from values > ("0", 1), > ("-0.1", 2), > ("1", 3) > as grouping(a, b) > {code} > SQL: > {code} > select a, b from tb where a=0 > {code} > The result which is wrong: > {code} > ++---+ > | a| b| > ++---+ > | 0| 1| > |-0.1| 2| > ++---+ > {code} > Logical Plan: > {code} > == Parsed Logical Plan == > 'Project ['a] > +- 'Filter ('a = 0) >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > a: string > Project [a#8528] > +- Filter (cast(a#8528 as int) = 0) >+- SubqueryAlias src > +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] > +- LocalRelation [_1#8525, _2#8526] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631550#comment-16631550 ] Jungtaek Lim commented on SPARK-24630: -- I think it would be better to describe actual queries (any single query or some scenarios which are composed to multiple queries) which structured streaming cannot and new proposal can, so that everyone can feel benefits to support on this. > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls
Yuming Wang created SPARK-25565: --- Summary: Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls Key: SPARK-25565 URL: https://issues.apache.org/jira/browse/SPARK-25565 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 2.5.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls
[ https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-25565: Component/s: (was: Block Manager) Build > Add scala style checker to check add Locale.ROOT to .toLowerCase and > .toUpperCase for internal calls > > > Key: SPARK-25565 > URL: https://issues.apache.org/jira/browse/SPARK-25565 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
[ https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631513#comment-16631513 ] Apache Spark commented on SPARK-25429: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22579 > SparkListenerBus inefficient due to > 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure > > > Key: SPARK-25429 > URL: https://issues.apache.org/jira/browse/SPARK-25429 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: DENG FEI >Priority: Major > > {code:java} > private def updateStageMetrics( > stageId: Int, > attemptId: Int, > taskId: Long, > accumUpdates: Seq[AccumulableInfo], > succeeded: Boolean): Unit = { > Option(stageMetrics.get(stageId)).foreach { metrics => > if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) { > return > } > val oldTaskMetrics = metrics.taskMetrics.get(taskId) > if (oldTaskMetrics != null && oldTaskMetrics.succeeded) { > return > } > val updates = accumUpdates > .filter { acc => acc.update.isDefined && > metrics.accumulatorIds.contains(acc.id) } > .sortBy(_.id) > if (updates.isEmpty) { > return > } > val ids = new Array[Long](updates.size) > val values = new Array[Long](updates.size) > updates.zipWithIndex.foreach { case (acc, idx) => > ids(idx) = acc.id > // In a live application, accumulators have Long values, but when > reading from event > // logs, they have String values. For now, assume all accumulators > are Long and covert > // accordingly. > values(idx) = acc.update.get match { > case s: String => s.toLong > case l: Long => l > case o => throw new IllegalArgumentException(s"Unexpected: $o") > } > } > // TODO: storing metrics by task ID can cause metrics for the same task > index to be > // counted multiple times, for example due to speculation or > re-attempts. > metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, > succeeded)) > } > } > {code} > 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated > many accumulator, it's inefficient use Arrray#contains. > Actually, application may timeout while quit and will killed by RM on YARN > mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631508#comment-16631508 ] Apache Spark commented on SPARK-25564: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/22578 > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25564: Assignee: Apache Spark > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631507#comment-16631507 ] Apache Spark commented on SPARK-25564: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/22578 > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25564: Assignee: (was: Apache Spark) > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25554) Avro logical types get ignored in SchemaConverters.toSqlType
[ https://issues.apache.org/jira/browse/SPARK-25554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631505#comment-16631505 ] Liang-Chi Hsieh commented on SPARK-25554: - hmm, I think Spark 2.4 should have comprehensive support for Avro logical types. {code:java} { "type" : "record", "name" : "name", "namespace" : "namespace", "doc" : "docs", "fields" : [ { "name" : "field1", "type" : [ "null", { "type" : "int", "logicalType" : "date" } ], "doc" : "doc" } ] }{code} The DataFrame schema for above Avro file: {code} root |-- field1: date (nullable = true) {code} >From your attached maven dependencies, looks like you are using {{spark-avro}} >and Spark 2.3? So I think it might be an issue of {{spark-avro}}. > Avro logical types get ignored in SchemaConverters.toSqlType > > > Key: SPARK-25554 > URL: https://issues.apache.org/jira/browse/SPARK-25554 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: Below is the maven dependencies: > {code:java} > > org.apache.avro > avro > 1.8.2 > > > com.databricks > spark-avro_2.11 > 4.0.0 > > > > org.apache.spark > spark-core_2.11 > 2.3.0 > > > org.apache.spark > spark-sql_2.11 > 2.3.0 > > {code} >Reporter: Yanan Li >Priority: Major > > Having Avro schema defined as follow: > {code:java} > { >"namespace": "com.xxx.avro", >"name": "Book", >"type": "record", >"fields": [{ > "name": "name", > "type": ["null", "string"], > "default": null > }, { > "name": "author", > "type": ["null", "string"], > "default": null > }, { > "name": "published_date", > "type": ["null", {"type": "int", "logicalType": "date"}], > "default": null > } >] > } > {code} > Spark Schema converted from above Avro schema, logical type "date" gets > ignored. > {code:java} > StructType(StructField(name,StringType,true),StructField(author,StringType,true),StructField(published_date,IntegerType,true)) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order
[ https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-25505: --- Assignee: Maryann Xue > The output order of grouping columns in Pivot is different from the input > order > --- > > Key: SPARK-25505 > URL: https://issues.apache.org/jira/browse/SPARK-25505 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Minor > Fix For: 2.4.0 > > > For example, > {code} > SELECT * FROM ( > SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, > "x" as x, "d" as d, "w" as w > FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR course IN ('dotNET', 'Java') > ) > {code} > The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, > b, c, d, w, x, y, z, ..." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order
[ https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25505: Fix Version/s: 2.4.0 > The output order of grouping columns in Pivot is different from the input > order > --- > > Key: SPARK-25505 > URL: https://issues.apache.org/jira/browse/SPARK-25505 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Minor > Fix For: 2.4.0 > > > For example, > {code} > SELECT * FROM ( > SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, > "x" as x, "d" as d, "w" as w > FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR course IN ('dotNET', 'Java') > ) > {code} > The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, > b, c, d, w, x, y, z, ..." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order
[ https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25505. - Resolution: Fixed > The output order of grouping columns in Pivot is different from the input > order > --- > > Key: SPARK-25505 > URL: https://issues.apache.org/jira/browse/SPARK-25505 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Minor > Fix For: 2.4.0 > > > For example, > {code} > SELECT * FROM ( > SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, > "x" as x, "d" as d, "w" as w > FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR course IN ('dotNET', 'Java') > ) > {code} > The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, > b, c, d, w, x, y, z, ..." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25564) Add output bytes metrics for each Executor
[ https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-25564: --- Summary: Add output bytes metrics for each Executor (was: LiveExecutor misses the OutputBytes metrics) > Add output bytes metrics for each Executor > -- > > Key: SPARK-25564 > URL: https://issues.apache.org/jira/browse/SPARK-25564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Lantao Jin >Priority: Minor > > LiveExecutor only statistics the total input bytes. And total output bytes > for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25564) LiveExecutor misses the OutputBytes metrics
Lantao Jin created SPARK-25564: -- Summary: LiveExecutor misses the OutputBytes metrics Key: SPARK-25564 URL: https://issues.apache.org/jira/browse/SPARK-25564 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.2 Reporter: Lantao Jin LiveExecutor only statistics the total input bytes. And total output bytes for each executor also has the equal importance like input. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) EventTime based sessionization
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631398#comment-16631398 ] Li Yuanjian commented on SPARK-10816: - Great thanks for [~kabhwan] notice me, just linked SPARK-22565 as duplicated with this, sorry for just searching "session window" before and lost this, will still find others duplicated jira. As discussed in SPARK-22565, we also meet this problem while doing the migration of streaming app running on other system to Structure Streaming. We solved this by implement the session window as build-in function and gave internal beta version based on Apache Spark 2.3.0 just week ago. After steady running online for real product env, we are doing the code clean work and doc translating. As discussed with Jungtaek, we also wished to join the discussion here and will give PR and design doc today. The preview pr I'll submit contains others patch. cc [~liulinhong] [~ivoson] [~yanlin-Lynn] [~LiangchangZ] , please watching this issue. > EventTime based sessionization > -- > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-10816 Support session window natively.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org