date:20180928

[jira] [Commented] (SPARK-25572) SparkR tests failed on CRAN on Java 10

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632796#comment-16632796
 ] 

Apache Spark commented on SPARK-25572:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/22589

> SparkR tests failed on CRAN on Java 10
> --
>
> Key: SPARK-25572
> URL: https://issues.apache.org/jira/browse/SPARK-25572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> follow up to SPARK-24255
> from 2.3.2 release we can see that CRAN doesn't seem to respect the system 
> requirements as running tests - we have seen cases where SparkR is run on 
> Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt 
> skipping all tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25572) SparkR tests failed on CRAN on Java 10

2018-09-28 Thread Felix Cheung (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-25572:
-
Summary: SparkR tests failed on CRAN on Java 10  (was: SparkR to skip tests 
because Java 10)

> SparkR tests failed on CRAN on Java 10
> --
>
> Key: SPARK-25572
> URL: https://issues.apache.org/jira/browse/SPARK-25572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> follow up to SPARK-24255
> from 2.3.2 release we can see that CRAN doesn't seem to respect the system 
> requirements as running tests - we have seen cases where SparkR is run on 
> Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt 
> skipping all tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25572) SparkR to skip tests because Java 10

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25572:


Assignee: Apache Spark  (was: Felix Cheung)

> SparkR to skip tests because Java 10
> 
>
> Key: SPARK-25572
> URL: https://issues.apache.org/jira/browse/SPARK-25572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Major
>
> follow up to SPARK-24255
> from 2.3.2 release we can see that CRAN doesn't seem to respect the system 
> requirements as running tests - we have seen cases where SparkR is run on 
> Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt 
> skipping all tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25572) SparkR to skip tests because Java 10

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632795#comment-16632795
 ] 

Apache Spark commented on SPARK-25572:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/22589

> SparkR to skip tests because Java 10
> 
>
> Key: SPARK-25572
> URL: https://issues.apache.org/jira/browse/SPARK-25572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> follow up to SPARK-24255
> from 2.3.2 release we can see that CRAN doesn't seem to respect the system 
> requirements as running tests - we have seen cases where SparkR is run on 
> Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt 
> skipping all tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25572) SparkR to skip tests because Java 10

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25572:


Assignee: Felix Cheung  (was: Apache Spark)

> SparkR to skip tests because Java 10
> 
>
> Key: SPARK-25572
> URL: https://issues.apache.org/jira/browse/SPARK-25572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> follow up to SPARK-24255
> from 2.3.2 release we can see that CRAN doesn't seem to respect the system 
> requirements as running tests - we have seen cases where SparkR is run on 
> Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt 
> skipping all tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25572) SparkR to skip tests because Java 10

2018-09-28 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-25572:


 Summary: SparkR to skip tests because Java 10
 Key: SPARK-25572
 URL: https://issues.apache.org/jira/browse/SPARK-25572
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Felix Cheung
Assignee: Felix Cheung


follow up to SPARK-24255

from 2.3.2 release we can see that CRAN doesn't seem to respect the system 
requirements as running tests - we have seen cases where SparkR is run on Java 
10, which unfortunately Spark does not start on. For 2.4, lets attempt skipping 
all tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25571) Add withColumnsRenamed method to Dataset

2018-09-28 Thread Chaerim Yeo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632777#comment-16632777
 ] 

Chaerim Yeo commented on SPARK-25571:
-

I'm working on it now.

> Add withColumnsRenamed method to Dataset
> 
>
> Key: SPARK-25571
> URL: https://issues.apache.org/jira/browse/SPARK-25571
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chaerim Yeo
>Priority: Major
>
> There are two general approaches to rename several columns.
>  * Using *withColumnRenamed* method
>  * Using *select* method
> {code}
> // Using withColumnRenamed
> ds.withColumnRenamed("first_name", "firstName")
>   .withColumnRenamed("last_name", "lastName")
>   .withColumnRenamed("postal_code", "postalCode")
> // Using select
> ds.select(
>   $"id",
>   $"first_name" as "firstName",
>   $"last_name" as "lastName",
>   $"address",
>   $"postal_code" as "postalCode"
> )
> {code}
> However, both approaches are still inefficient and redundant due to following 
> limitations.
>  * withColumnRenamed: it is required to call method several times
>  * select: it is required to pass all columns to select method
> It is necessary to implement new method, such as *withColumnsRenamed*, which 
> can rename many columns at once.
> {code}
> ds.withColumnsRenamed(
>   "first_name" -> "firstName",
>   "last_name" -> "lastName",
>   "postal_code" -> "postalCode"
> )
> // or
> ds.withColumnsRenamed(Map(
>   "first_name" -> "firstName",
>   "last_name" -> "lastName",
>   "postal_code" -> "postalCode"
> ))
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25571) Add withColumnsRenamed method to Dataset

2018-09-28 Thread Chaerim Yeo (JIRA)

Chaerim Yeo created SPARK-25571:
---

 Summary: Add withColumnsRenamed method to Dataset
 Key: SPARK-25571
 URL: https://issues.apache.org/jira/browse/SPARK-25571
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.2
Reporter: Chaerim Yeo


There are two general approaches to rename several columns.
 * Using *withColumnRenamed* method
 * Using *select* method

{code}
// Using withColumnRenamed
ds.withColumnRenamed("first_name", "firstName")
  .withColumnRenamed("last_name", "lastName")
  .withColumnRenamed("postal_code", "postalCode")

// Using select
ds.select(
  $"id",
  $"first_name" as "firstName",
  $"last_name" as "lastName",
  $"address",
  $"postal_code" as "postalCode"
)
{code}
However, both approaches are still inefficient and redundant due to following 
limitations.
 * withColumnRenamed: it is required to call method several times
 * select: it is required to pass all columns to select method

It is necessary to implement new method, such as *withColumnsRenamed*, which 
can rename many columns at once.
{code}
ds.withColumnsRenamed(
  "first_name" -> "firstName",
  "last_name" -> "lastName",
  "postal_code" -> "postalCode"
)
// or
ds.withColumnsRenamed(Map(
  "first_name" -> "firstName",
  "last_name" -> "lastName",
  "postal_code" -> "postalCode"
))
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632771#comment-16632771
 ] 

Apache Spark commented on SPARK-25262:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/22588

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite

2018-09-28 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25570.
--
   Resolution: Fixed
Fix Version/s: 2.4.1
   2.3.3
   2.5.0

Issue resolved by pull request 22587
[https://github.com/apache/spark/pull/22587]

> Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-25570
> URL: https://issues.apache.org/jira/browse/SPARK-25570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.5.0, 2.3.3, 2.4.1
>
>
> This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite 
> by using the latest Spark 2.3.2 because the Apache mirror will remove the old 
> Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail 
> because SPARK-24813 implements a fallback logic, but it causes many trials in 
> all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite

2018-09-28 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25570:


Assignee: Dongjoon Hyun

> Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-25570
> URL: https://issues.apache.org/jira/browse/SPARK-25570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite 
> by using the latest Spark 2.3.2 because the Apache mirror will remove the old 
> Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail 
> because SPARK-24813 implements a fallback logic, but it causes many trials in 
> all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-28 Thread Steven Rand (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632729#comment-16632729
 ] 

Steven Rand commented on SPARK-25538:
-

[~kiszk] that makes sense, I'll try to do so. The issue I've been having so far 
is that when I run the UDF I've written to change the data (while preserving 
number of duplicate rows), the resulting DataFrame doesn't reproduce the issue.

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25559) Just remove the unsupported predicates in Parquet

2018-09-28 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25559.
---
   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22574
[https://github.com/apache/spark/pull/22574]

> Just remove the unsupported predicates in Parquet
> -
>
> Key: SPARK-25559
> URL: https://issues.apache.org/jira/browse/SPARK-25559
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.5.0
>
>
> Currently, in *ParquetFilters*, if one of the children predicates is not 
> supported by Parquet, the entire predicates will be thrown away. In fact, if 
> the unsupported predicate is in the top level *And* condition or in the child 
> before hitting *Not* or *Or* condition, it's safe to just remove the 
> unsupported one as unhandled filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632667#comment-16632667
 ] 

Apache Spark commented on SPARK-25570:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22587

> Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-25570
> URL: https://issues.apache.org/jira/browse/SPARK-25570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite 
> by using the latest Spark 2.3.2 because the Apache mirror will remove the old 
> Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail 
> because SPARK-24813 implements a fallback logic, but it causes many trials in 
> all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25570:


Assignee: (was: Apache Spark)

> Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-25570
> URL: https://issues.apache.org/jira/browse/SPARK-25570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite 
> by using the latest Spark 2.3.2 because the Apache mirror will remove the old 
> Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail 
> because SPARK-24813 implements a fallback logic, but it causes many trials in 
> all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25570:


Assignee: Apache Spark

> Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-25570
> URL: https://issues.apache.org/jira/browse/SPARK-25570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite 
> by using the latest Spark 2.3.2 because the Apache mirror will remove the old 
> Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail 
> because SPARK-24813 implements a fallback logic, but it causes many trials in 
> all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632666#comment-16632666
 ] 

Apache Spark commented on SPARK-25570:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22587

> Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-25570
> URL: https://issues.apache.org/jira/browse/SPARK-25570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite 
> by using the latest Spark 2.3.2 because the Apache mirror will remove the old 
> Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail 
> because SPARK-24813 implements a fallback logic, but it causes many trials in 
> all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite

2018-09-28 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-25570:
-

 Summary: Replace 2.3.1 with 2.3.2 in 
HiveExternalCatalogVersionsSuite
 Key: SPARK-25570
 URL: https://issues.apache.org/jira/browse/SPARK-25570
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.3.3, 2.4.0
Reporter: Dongjoon Hyun


This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite 
by using the latest Spark 2.3.2 because the Apache mirror will remove the old 
Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail because 
SPARK-24813 implements a fallback logic, but it causes many trials in all 
builds over `branch-2.3/branch-2.4/master`. We had better fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25449) Don't send zero accumulators in heartbeats

2018-09-28 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-25449.
--
   Resolution: Fixed
 Assignee: Mukul Murthy
Fix Version/s: 2.5.0

> Don't send zero accumulators in heartbeats
> --
>
> Key: SPARK-25449
> URL: https://issues.apache.org/jira/browse/SPARK-25449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
> Fix For: 2.5.0
>
>
> Heartbeats sent from executors to the driver every 10 seconds contain metrics 
> and are generally on the order of a few KBs. However, for large jobs with 
> lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks 
> to die with heartbeat failures. We can mitigate this by not sending zero 
> metrics to the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure

2018-09-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25429.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 2.5.0

> SparkListenerBus inefficient due to 
> 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
> 
>
> Key: SPARK-25429
> URL: https://issues.apache.org/jira/browse/SPARK-25429
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.5.0
>
>
> {code:java}
> private def updateStageMetrics(
>   stageId: Int,
>   attemptId: Int,
>   taskId: Long,
>   accumUpdates: Seq[AccumulableInfo],
>   succeeded: Boolean): Unit = {
> Option(stageMetrics.get(stageId)).foreach { metrics =>
>   if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) {
> return
>   }
>   val oldTaskMetrics = metrics.taskMetrics.get(taskId)
>   if (oldTaskMetrics != null && oldTaskMetrics.succeeded) {
> return
>   }
>   val updates = accumUpdates
> .filter { acc => acc.update.isDefined && 
> metrics.accumulatorIds.contains(acc.id) }
> .sortBy(_.id)
>   if (updates.isEmpty) {
> return
>   }
>   val ids = new Array[Long](updates.size)
>   val values = new Array[Long](updates.size)
>   updates.zipWithIndex.foreach { case (acc, idx) =>
> ids(idx) = acc.id
> // In a live application, accumulators have Long values, but when 
> reading from event
> // logs, they have String values. For now, assume all accumulators 
> are Long and covert
> // accordingly.
> values(idx) = acc.update.get match {
>   case s: String => s.toLong
>   case l: Long => l
>   case o => throw new IllegalArgumentException(s"Unexpected: $o")
> }
>   }
>   // TODO: storing metrics by task ID can cause metrics for the same task 
> index to be
>   // counted multiple times, for example due to speculation or 
> re-attempts.
>   metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, 
> succeeded))
> }
>   }
> {code}
> 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated 
> many accumulator, it's inefficient use Arrray#contains.
> Actually, application may timeout while quit and will killed by RM on YARN 
> mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25458) Support FOR ALL COLUMNS in ANALYZE TABLE

2018-09-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25458.
-
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 2.5.0

> Support FOR ALL COLUMNS in ANALYZE TABLE 
> -
>
> Key: SPARK-25458
> URL: https://issues.apache.org/jira/browse/SPARK-25458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.5.0
>
>
> Currently, to collect the statistics of all the columns, users need to 
> specify the names of all the columns when calling the command "ANALYZE TABLE 
> ... FOR COLUMNS...". This is not user friendly. Instead, we can introduce the 
> following SQL command to achieve it without specifying the column names.
> {code:java}
>ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25569) Failing a Spark job when an accumulator cannot be updated

2018-09-28 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-25569:


 Summary: Failing a Spark job when an accumulator cannot be updated
 Key: SPARK-25569
 URL: https://issues.apache.org/jira/browse/SPARK-25569
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


Currently, when Spark fails to merge an accumulator updates from a task, it 
will not fail the task. (See 
https://github.com/apache/spark/blob/b7d80349b0e367d78cab238e62c2ec353f0f12b3/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1266)
 So an accumulator update failure may be ignored silently. Some user may want 
to use accumulators in business critical things, and would like to fail a job 
when an accumulator is broken.

We can add a flag to always fail a Spark job when hitting an accumulator 
failure. Or we can add a new property to an accumulator and only fail a spark 
job when such accumulator fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25542) Flaky test: OpenHashMapSuite

2018-09-28 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632568#comment-16632568
 ] 

Dongjoon Hyun commented on SPARK-25542:
---

I marked this as 2.4.1 because we are in the middle of RC2 vote. cc [~cloud_fan]

> Flaky test: OpenHashMapSuite
> 
>
> Key: SPARK-25542
> URL: https://issues.apache.org/jira/browse/SPARK-25542
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.5.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.1
>
>
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/]
>  (Sep 25, 2018 5:52:56 PM)
> {code:java}
> org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a 
> sbt.testing.SuiteSelector)
> Failing for the past 1 build (Since #96585 )
> Took 0 ms.
> Error Message
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
> Stacktrace
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: 
> Java heap space
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117)
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115)
>   at 
> org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171)
>   at 
> org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25542) Flaky test: OpenHashMapSuite

2018-09-28 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25542:
--
Fix Version/s: (was: 2.4.0)
   2.4.1

> Flaky test: OpenHashMapSuite
> 
>
> Key: SPARK-25542
> URL: https://issues.apache.org/jira/browse/SPARK-25542
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.5.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.1
>
>
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/]
>  (Sep 25, 2018 5:52:56 PM)
> {code:java}
> org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a 
> sbt.testing.SuiteSelector)
> Failing for the past 1 build (Since #96585 )
> Took 0 ms.
> Error Message
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
> Stacktrace
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: 
> Java heap space
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117)
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115)
>   at 
> org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171)
>   at 
> org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25542) Flaky test: OpenHashMapSuite

2018-09-28 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25542:
-

Assignee: Liang-Chi Hsieh

> Flaky test: OpenHashMapSuite
> 
>
> Key: SPARK-25542
> URL: https://issues.apache.org/jira/browse/SPARK-25542
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.5.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/]
>  (Sep 25, 2018 5:52:56 PM)
> {code:java}
> org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a 
> sbt.testing.SuiteSelector)
> Failing for the past 1 build (Since #96585 )
> Took 0 ms.
> Error Message
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
> Stacktrace
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: 
> Java heap space
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117)
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115)
>   at 
> org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171)
>   at 
> org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25542) Flaky test: OpenHashMapSuite

2018-09-28 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25542.
---
Resolution: Fixed

Resolved via https://github.com/apache/spark/pull/22569

> Flaky test: OpenHashMapSuite
> 
>
> Key: SPARK-25542
> URL: https://issues.apache.org/jira/browse/SPARK-25542
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.5.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/]
>  (Sep 25, 2018 5:52:56 PM)
> {code:java}
> org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a 
> sbt.testing.SuiteSelector)
> Failing for the past 1 build (Since #96585 )
> Took 0 ms.
> Error Message
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
> Stacktrace
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: 
> Java heap space
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117)
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115)
>   at 
> org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171)
>   at 
> org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25542) Flaky test: OpenHashMapSuite

2018-09-28 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25542:
--
Fix Version/s: 2.4.0

> Flaky test: OpenHashMapSuite
> 
>
> Key: SPARK-25542
> URL: https://issues.apache.org/jira/browse/SPARK-25542
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/]
>  (Sep 25, 2018 5:52:56 PM)
> {code:java}
> org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a 
> sbt.testing.SuiteSelector)
> Failing for the past 1 build (Since #96585 )
> Took 0 ms.
> Error Message
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
> Stacktrace
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: 
> Java heap space
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117)
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115)
>   at 
> org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171)
>   at 
> org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25568:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Continue to update the remaining accumulators when failing to update one 
> accumulator
> 
>
> Key: SPARK-25568
> URL: https://issues.apache.org/jira/browse/SPARK-25568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> Currently when failing to update an accumulator, 
> DAGScheduler.updateAccumulators will skip the remaining accumulators. We 
> should try to update the remaining accumulators if possible so that they can 
> still report correct values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25568:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Continue to update the remaining accumulators when failing to update one 
> accumulator
> 
>
> Key: SPARK-25568
> URL: https://issues.apache.org/jira/browse/SPARK-25568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Currently when failing to update an accumulator, 
> DAGScheduler.updateAccumulators will skip the remaining accumulators. We 
> should try to update the remaining accumulators if possible so that they can 
> still report correct values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632547#comment-16632547
 ] 

Apache Spark commented on SPARK-25568:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22586

> Continue to update the remaining accumulators when failing to update one 
> accumulator
> 
>
> Key: SPARK-25568
> URL: https://issues.apache.org/jira/browse/SPARK-25568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Currently when failing to update an accumulator, 
> DAGScheduler.updateAccumulators will skip the remaining accumulators. We 
> should try to update the remaining accumulators if possible so that they can 
> still report correct values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator

2018-09-28 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-25568:


 Summary: Continue to update the remaining accumulators when 
failing to update one accumulator
 Key: SPARK-25568
 URL: https://issues.apache.org/jira/browse/SPARK-25568
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2, 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Currently when failing to update an accumulator, 
DAGScheduler.updateAccumulators will skip the remaining accumulators. We should 
try to update the remaining accumulators if possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator

2018-09-28 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25568:
-
Description: 
Currently when failing to update an accumulator, 
DAGScheduler.updateAccumulators will skip the remaining accumulators. We should 
try to update the remaining accumulators if possible so that they can still 
report correct values.


  was:Currently when failing to update an accumulator, 
DAGScheduler.updateAccumulators will skip the remaining accumulators. We should 
try to update the remaining accumulators if possible.


> Continue to update the remaining accumulators when failing to update one 
> accumulator
> 
>
> Key: SPARK-25568
> URL: https://issues.apache.org/jira/browse/SPARK-25568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Currently when failing to update an accumulator, 
> DAGScheduler.updateAccumulators will skip the remaining accumulators. We 
> should try to update the remaining accumulators if possible so that they can 
> still report correct values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632512#comment-16632512
 ] 

Nicholas Chammas commented on SPARK-25150:
--

Correct, this isn't a cross join. It's just a plain inner join.

In theory, whether cross joins are enabled or not should have no bearing on the 
result. However, what we're seeing is that without them enabled we get an 
incorrect error and with them enabled we get incorrect results.

If we were actually trying a cross join (i.e. no {{on=(...)}} condition 
specified) I think those results (with the 4 output rows) would still be 
incorrect since you'd expect NH's population to be combined with RI's stats in 
one of the output rows, but that's not the case. You'd also expect MA to show 
up in the output, too.

> The second join joins on a column in {{states}}, but that is not a DataFrame 
> used in that join. Is that the problem?

Not sure what you mean here. Both joins join on {{states}}, which is the first 
DataFrame in the definition of {{analysis}}.

 

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23285) Allow spark.executor.cores to be fractional

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632511#comment-16632511
 ] 

Apache Spark commented on SPARK-23285:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22585

> Allow spark.executor.cores to be fractional
> ---
>
> Key: SPARK-23285
> URL: https://issues.apache.org/jira/browse/SPARK-23285
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Scheduler, Spark Submit
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Assignee: Yinan Li
>Priority: Minor
> Fix For: 2.4.0
>
>
> There is a strong check for an integral number of cores per executor in 
> [SparkSubmitArguments.scala#L270-L272|https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L270-L272].
>  Given we're reusing that property in K8s, does it make sense to relax it?
>  
> K8s treats CPU as a "compressible resource" and can actually assign millicpus 
> to individual containers. Also to be noted - spark.driver.cores has no such 
> check in place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632491#comment-16632491
 ] 

Apache Spark commented on SPARK-25262:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22584

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632490#comment-16632490
 ] 

Apache Spark commented on SPARK-25262:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22584

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632485#comment-16632485
 ] 

Sean Owen commented on SPARK-25150:
---

Hm, I am not sure I understand the example yet – help me clarify here. We have 
three dataframes, really; states, humans, zombies:

 
{code:java}
State,Total Population,Total Area
RI,120,30
MA,800,1700
NH,330,910

+-+-+
|State|count|
+-+-+
|   RI|2|
|   NH|1|
+-+-+

+-+-+
|State|count|
+-+-+
|   RI|1|
|   MA|1|
+-+-+{code}
You join all three on state:
{code:java}
analysis = (
states
.join(
total_humans,
on=(states['State'] == total_humans['State'])
)
.join(
total_zombies,
on=(states['State'] == total_zombies['State'])
)
.orderBy(states['State'], ascending=True)
.select(
states['State'],
states['Total Population'],
total_humans['count'].alias('Total Humans'),
total_zombies['count'].alias('Total Zombies'),
)
)
{code}
and you get
{code:java}
+-+++-+
|State|Total Population|Total Humans|Total Zombies|
+-+++-+
|   NH| 330|   1|1|
|   NH| 330|   1|1|
|   RI| 120|   2|1|
|   RI| 120|   2|1|
+-+++-+{code}
But say you expect
{code:java}
+-+++-+
|State|Total Population|Total Humans|Total Zombies|
+-+++-+
|   RI| 120|   2|1|
+-+++-+{code}
 

First, this isn't a cross join right? the message says it thinks there is no 
join condition and wonders if you're really trying to do a cross join, but 
you're not, so enabling it isn't helping. If these were cross-joins, the output 
would be correct I think?

The second join joins on a column in {{states}}, but that is not a DataFrame 
used in that join. Is that the problem?

 

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql

2018-09-28 Thread Karthik Manamcheri (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632483#comment-16632483
 ] 

Karthik Manamcheri commented on SPARK-25561:


[~michael] thanks for the prompt reply. This is hard to test because the 
problem happens only in the case when HMS goes into fallback ORM mode. For this 
to happen, we need to have the direct SQL query fail in HMS. There are no 
consistent bugs (that I know of) which can be used to test this in a 
deterministic fashion. 

I was able to run into this running Hive 1.1.0. However, as I understand HMS 
behavior of falling back to ORM has been the same in Hive from the beginning. 
Not sure.

> HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql
> --
>
> Key: SPARK-25561
> URL: https://issues.apache.org/jira/browse/SPARK-25561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Karthik Manamcheri
>Priority: Major
>
> In HiveShim.scala, the current behavior is that if 
> hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter 
> call to succeed. If it fails, we'll throw a RuntimeException.
> However, this might not always be the case. Hive's direct SQL functionality 
> is best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark 
> should handle that exception correctly if Hive falls back to ORM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632381#comment-16632381
 ] 

Nicholas Chammas commented on SPARK-25150:
--

([~petertoth] - Seeing your comment edit now.) OK, so it seems the two problems 
I identified are accurate, but they have a common root cause. Thanks for 
confirming.

[~srowen] - Given Peter's confirmation that the results with cross join enabled 
are incorrect, I believe we should mark this as a correctness issue.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632281#comment-16632281
 ] 

Nicholas Chammas commented on SPARK-25150:
--

I've uploaded the expected output.

I realize that the reproduction I've attached to this ticket 
(zombie-analysis.py plus the related files), though complete and 
self-contained, is a bit verbose. If it's not helpful enough I will see if I 
can boil it down further.

[~petertoth] - I suggest you take another look at the output with cross joins 
enabled and compare it to what (I think) is the correct expected output. If I'm 
understanding things correctly, there are two issues: 1) the bad error when 
cross join is not enabled (there should be no error), and 2) the incorrect 
results when cross join _is_ enabled (the results I just uploaded).

Your PR doesn't appear to investigate or address the incorrect results issue, 
so I'm not sure if it would fix that too of if I am just mistaken about there 
being a second issue.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Peter Toth (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632274#comment-16632274
 ] 

Peter Toth edited comment on SPARK-25150 at 9/28/18 7:28 PM:
-

[~nchammas], sorry for the late reply.

There is only one issue here. Please see zombie-analysis.py, it contains 2 
joins and both joins define the condition explicitly, so setting 
spark.sql.crossJoin.enabled=true {color:#33}should not have any 
effect.{color}

{color:#33}The root cause of the error you see when 
spark.sql.crossJoin.enabled=false (default) and the incorrect results when 
spark.sql.crossJoin.enabled=true is the same, the join condition is handled 
incorrectly.{color}

{color:#33}Please see my PR's description for further details: 
[https://github.com/apache/spark/pull/22318]{color}

 


was (Author: petertoth):
[~nchammas], sorry for the late reply.

There is only one issue here. Please see zombie-analysis.py, it contains 2 
joins and both joins define the condition explicitly, so setting 
spark.sql.crossJoin.enabled=true {color:#33}should not have any 
effect.{color}

{color:#33}Simply the SQL statement should not fail, please see my PR's 
description for further details: 
[https://github.com/apache/spark/pull/22318]{color}

 

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-25150:
-
Attachment: expected-output.txt

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Peter Toth (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632274#comment-16632274
 ] 

Peter Toth commented on SPARK-25150:


[~nchammas], sorry for the late reply.

There is only one issue here. Please see zombie-analysis.py, it contains 2 
joins and both joins define the condition explicitly, so setting 
spark.sql.crossJoin.enabled=true {color:#33}should not have any 
effect.{color}

{color:#33}Simply the SQL statement should not fail, please see my PR's 
description for further details: 
[https://github.com/apache/spark/pull/22318]{color}

 

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-25150:
-
Description: 
I have two DataFrames, A and B. From B, I have derived two additional 
DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
confusing error:
{code:java}
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
{code}
Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, Spark 
appears to give me incorrect answers.

I am not sure if I am missing something obvious, or if there is some kind of 
bug here. The "join condition is missing" error is confusing and doesn't make 
sense to me, and the seemingly incorrect output is concerning.

I've attached a reproduction, along with the output I'm seeing with and without 
the implicit cross join enabled.

I realize the join I've written is not "correct" in the sense that it should be 
left outer join instead of an inner join (since some of the aggregates are not 
available for all states), but that doesn't explain Spark's behavior.

  was:
I have two DataFrames, A and B. From B, I have derived two additional 
DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
confusing error:
{code:java}
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
{code}
Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, Spark 
appears to give me incorrect answers.

I am not sure if I am missing something obvious, or if there is some kind of 
bug here. The "join condition is missing" error is confusing and doesn't make 
sense to me, and the seemingly incorrect output is concerning.

I've attached a reproduction, along with the output I'm seeing with and without 
the implicit cross join enabled.

I realize the join I've written is not correct in the sense that it should be 
left outer join instead of an inner join (since some of the aggregates are not 
available for all states), but that doesn't explain Spark's behavior.


> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632252#comment-16632252
 ] 

Nicholas Chammas commented on SPARK-25150:
--

The attachments on this ticket contain a complete reproduction. The comment 
towards the beginning of zombie-analysis.py points to the config that, when 
enabled, appears to yield incorrect results. (Without the config enabled we get 
a confusing/incorrect error, which is a second issue.)

The results with and without the config enabled are also attached here. I will 
add another attachment showing the expected results.

I believe some folks over on the linked PR provided a simpler reproduction of 
part of this issue, but I haven't taken a close look at it to see if it 
captures the same two issues (incorrect results + confusing/incorrect error).

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632234#comment-16632234
 ] 

Sean Owen commented on SPARK-25150:
---

What's an example of expected vs actual results here that show the bug? is it 
simple to summarize?

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-28 Thread Nicholas Chammas (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1663#comment-1663
 ] 

Nicholas Chammas commented on SPARK-25150:
--

[~cloud_fan] / [~srowen] - Would you consider this issue (particularly the one 
expressed when spark.sql.crossJoin.enabled is set to true) to be a correctness 
bug? I think it is, but I'd like a committer to confirm and add the appropriate 
label if necessary.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25324) ML 2.4 QA: API: Java compatibility, docs

2018-09-28 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25324:
--
Fix Version/s: 2.4.0

> ML 2.4 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-25324
> URL: https://issues.apache.org/jira/browse/SPARK-25324
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes

2018-09-28 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25320:
--
Fix Version/s: 2.4.0

> ML, Graph 2.4 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-25320
> URL: https://issues.apache.org/jira/browse/SPARK-25320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql

2018-09-28 Thread Michael Allman (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632137#comment-16632137
 ] 

Michael Allman edited comment on SPARK-25561 at 9/28/18 5:08 PM:
-

cc [~cloud_fan] [~ekhliang]

Hi [~karthik.manamcheri]. Thanks for reporting this. I can't take a look right 
now, but I believe we have test cases that exercise this scenario. If not, it's 
certainly a hole in our coverage. If we do, it may be that Hive's behavior in 
this scenario is version-dependent, and we don't have coverage for your version 
of Hive. What version of Hive are you using?

Thanks.


was (Author: michael):
cc [~cloud_fan] [~ekhliang]

Hi [~karthik.manamcheri]. Thanks for reporting this. I can't take a look right 
now, but I believe we have test cases that exercise this scenario. If not, it's 
certainly a whole in our coverage. If we do, it may be that Hive's behavior in 
this scenario is version-dependent, and we don't have coverage for your version 
of Hive. What version of Hive are you using?

Thanks.

> HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql
> --
>
> Key: SPARK-25561
> URL: https://issues.apache.org/jira/browse/SPARK-25561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Karthik Manamcheri
>Priority: Major
>
> In HiveShim.scala, the current behavior is that if 
> hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter 
> call to succeed. If it fails, we'll throw a RuntimeException.
> However, this might not always be the case. Hive's direct SQL functionality 
> is best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark 
> should handle that exception correctly if Hive falls back to ORM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql

2018-09-28 Thread Michael Allman (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632137#comment-16632137
 ] 

Michael Allman commented on SPARK-25561:


cc [~cloud_fan] [~ekhliang]

Hi [~karthik.manamcheri]. Thanks for reporting this. I can't take a look right 
now, but I believe we have test cases that exercise this scenario. If not, it's 
certainly a whole in our coverage. If we do, it may be that Hive's behavior in 
this scenario is version-dependent, and we don't have coverage for your version 
of Hive. What version of Hive are you using?

Thanks.

> HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql
> --
>
> Key: SPARK-25561
> URL: https://issues.apache.org/jira/browse/SPARK-25561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Karthik Manamcheri
>Priority: Major
>
> In HiveShim.scala, the current behavior is that if 
> hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter 
> call to succeed. If it fails, we'll throw a RuntimeException.
> However, this might not always be the case. Hive's direct SQL functionality 
> is best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark 
> should handle that exception correctly if Hive falls back to ORM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23717) Leverage docker support in Hadoop 3

2018-09-28 Thread Eric Yang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632100#comment-16632100
 ] 

Eric Yang edited comment on SPARK-23717 at 9/28/18 4:33 PM:


It is possible to run standalone Spark in YARN docker containers without any 
code modification to spark.  Here is an example yarnfile that I used to run 
mesosphere generated docker image and it ran fine:

{code}
{
  "name": "spark",
  "kerberos_principal" : {
"principal_name" : "spark/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/spark.service.keytab"
  },
  "version": "0.1",
  "components" :
  [
{
  "name": "driver",
  "number_of_containers": 1,
  "artifact": {
"id": "mesosphere/spark:latest",
"type": "DOCKER"
  },
  "launch_command": "bash,-c,sleep 30 && ./sbin/start-master.sh",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "run_privileged_container": true,
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "SPARK_NO_DAEMONIZE":"true",
  "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131"
},
"properties": {
  "docker.network": "host"
}
  }
},
{
  "name": "executor",
  "number_of_containers": 2,
  "artifact": {
"id": "mesosphere/spark:latest",
"type": "DOCKER"
  },
  "launch_command": "bash,-c,sleep 30 && ./sbin/start-slave.sh 
spark://driver-0.spark.spark.ycluster:7077",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "run_privileged_container": true,
  "dependencies": [ "driver" ],
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "SPARK_NO_DAEMONIZE":"true",
  "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131"
},
"properties": {
  "docker.network": "host"
}
  }
}
  ]
}
{code}

The reason for 30 seconds sleep is to ensure RegistryDNS has been refreshed and 
updated to respond to DNS queries.  The sleep could be a lot shorter like 3 
seconds.  I did not spend much time to try to fine tune the DNS wait time.  
Further enhancement to pass in keytab and krb5.conf can enable access to secure 
HDFS, that would be exercise for the readers of this JIRA.


was (Author: eyang):
It is possible to run standalone Spark in YARN without any code modification to 
spark.  Here is an example yarnfile that I used to run mesosphere generated 
docker image and it ran fine:

{code}
{
  "name": "spark",
  "kerberos_principal" : {
"principal_name" : "spark/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/spark.service.keytab"
  },
  "version": "0.1",
  "components" :
  [
{
  "name": "driver",
  "number_of_containers": 1,
  "artifact": {
"id": "mesosphere/spark:latest",
"type": "DOCKER"
  },
  "launch_command": "bash,-c,sleep 30 && ./sbin/start-master.sh",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "run_privileged_container": true,
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "SPARK_NO_DAEMONIZE":"true",
  "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131"
},
"properties": {
  "docker.network": "host"
}
  }
},
{
  "name": "executor",
  "number_of_containers": 2,
  "artifact": {
"id": "mesosphere/spark:latest",
"type": "DOCKER"
  },
  "launch_command": "bash,-c,sleep 30 && ./sbin/start-slave.sh 
spark://driver-0.spark.spark.ycluster:7077",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "run_privileged_container": true,
  "dependencies": [ "driver" ],
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "SPARK_NO_DAEMONIZE":"true",
  "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131"
},
"properties": {
  "docker.network": "host"
}
  }
}
  ]
}
{code}

The reason for 30 seconds sleep is to ensure RegistryDNS has been refreshed and 
updated to respond to DNS queries.  The sleep could be a lot shorter like 3 
seconds.  I did not spend much time to try to fine tune the DNS wait time.  
Further enhancement to pass in keytab and krb5.conf can enable access to secure 
HDFS, that would be exercise for the readers of this JIRA.

> Leverage docker support in Hadoop 3
> ---
>
> Key: SPARK-23717
> URL: https://issues.apache.org/jira/browse/SPARK-23717
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.4.0
>Reporter: Mridul Muralidharan
>

[jira] [Commented] (SPARK-23717) Leverage docker support in Hadoop 3

2018-09-28 Thread Eric Yang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632100#comment-16632100
 ] 

Eric Yang commented on SPARK-23717:
---

It is possible to run standalone Spark in YARN without any code modification to 
spark.  Here is an example yarnfile that I used to run mesosphere generated 
docker image and it ran fine:

{code}
{
  "name": "spark",
  "kerberos_principal" : {
"principal_name" : "spark/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/spark.service.keytab"
  },
  "version": "0.1",
  "components" :
  [
{
  "name": "driver",
  "number_of_containers": 1,
  "artifact": {
"id": "mesosphere/spark:latest",
"type": "DOCKER"
  },
  "launch_command": "bash,-c,sleep 30 && ./sbin/start-master.sh",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "run_privileged_container": true,
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "SPARK_NO_DAEMONIZE":"true",
  "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131"
},
"properties": {
  "docker.network": "host"
}
  }
},
{
  "name": "executor",
  "number_of_containers": 2,
  "artifact": {
"id": "mesosphere/spark:latest",
"type": "DOCKER"
  },
  "launch_command": "bash,-c,sleep 30 && ./sbin/start-slave.sh 
spark://driver-0.spark.spark.ycluster:7077",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "run_privileged_container": true,
  "dependencies": [ "driver" ],
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
  "SPARK_NO_DAEMONIZE":"true",
  "JAVA_HOME":"/usr/lib/jvm/jre1.8.0_131"
},
"properties": {
  "docker.network": "host"
}
  }
}
  ]
}
{code}

The reason for 30 seconds sleep is to ensure RegistryDNS has been refreshed and 
updated to respond to DNS queries.  The sleep could be a lot shorter like 3 
seconds.  I did not spend much time to try to fine tune the DNS wait time.  
Further enhancement to pass in keytab and krb5.conf can enable access to secure 
HDFS, that would be exercise for the readers of this JIRA.

> Leverage docker support in Hadoop 3
> ---
>
> Key: SPARK-23717
> URL: https://issues.apache.org/jira/browse/SPARK-23717
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.4.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> The introduction of docker support in Apache Hadoop 3 can be leveraged by 
> Apache Spark for resolving multiple long standing shortcoming - particularly 
> related to package isolation.
> It also allows for network isolation, where applicable, allowing for more 
> sophisticated cluster configuration/customization.
> This jira will track the various tasks for enhancing spark to leverage 
> container support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml

2018-09-28 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632061#comment-16632061
 ] 

Yuming Wang commented on SPARK-25553:
-

Thanks [~srowen]

> Add EmptyInterpolatedStringChecker to scalastyle-config.xml
> ---
>
> Key: SPARK-25553
> URL: https://issues.apache.org/jira/browse/SPARK-25553
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Minor
>
> h4. Justification
> Empty interpolated strings are harder to read and not necessary.
>  
> More details:
> http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide

2018-09-28 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20937:
--
Fix Version/s: (was: 2.4.1)
   (was: 2.5.0)
   2.4.0

> Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, 
> DataFrames and Datasets Guide
> -
>
> Key: SPARK-20937
> URL: https://issues.apache.org/jira/browse/SPARK-20937
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Chenxiao Mao
>Priority: Trivial
> Fix For: 2.4.0
>
>
> As a follow-up to SPARK-20297 (and SPARK-10400) in which 
> {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala 
> and Hive, Spark SQL docs for [Parquet 
> Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
>  should have it documented.
> p.s. It was asked about in [Why can't Impala read parquet files after Spark 
> SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow 
> today.
> p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance 
> Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 
> 3-10. Parquet data source options) that gives the option some wider publicity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25431) Fix function examples and unify the format of the example results.

2018-09-28 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25431:
--
Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> Fix function examples and unify the format of the example results.
> --
>
> Key: SPARK-25431
> URL: https://issues.apache.org/jira/browse/SPARK-25431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
> Fix For: 2.4.0
>
>
> There are some mistakes in examples of newly added functions. Also the format 
> of the example results are not unified. We should fix and unify them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21514) Hive has updated with new support for S3 and InsertIntoHiveTable.scala should update also

2018-09-28 Thread Nick Orka (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632003#comment-16632003
 ] 

Nick Orka commented on SPARK-21514:
---

Recently S3 increased request rate. Thus eventual consistency became a huge 
problem now for data lakes based on S3. This approach can fix the issue because 
this is exact spot where all Spark jobs fails. Can you change a priority of the 
ticket? 

This is a real stopper for many data pipelines.

> Hive has updated with new support for S3 and InsertIntoHiveTable.scala should 
> update also
> -
>
> Key: SPARK-21514
> URL: https://issues.apache.org/jira/browse/SPARK-21514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Javier Ros
>Priority: Major
>
> Hive has updated adding new parameters to optimize the usage of S3, now you 
> can avoid the usage of S3 as the stagingdir using the parameters 
> hive.blobstore.supported.schemes & hive.blobstore.optimizations.enabled.
> The InsertIntoHiveTable.scala file should be updated with the same 
> improvement to match the behavior of Hive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls

2018-09-28 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25565:
--
Priority: Minor  (was: Major)

> Add scala style checker to check add Locale.ROOT to .toLowerCase and 
> .toUpperCase for internal calls
> 
>
> Key: SPARK-25565
> URL: https://issues.apache.org/jira/browse/SPARK-25565
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25553) Add EmptyInterpolatedStringChecker to scalastyle-config.xml

2018-09-28 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25553.
---
Resolution: Won't Fix

I'd say if anything, later, instead focus on removing uses of 
{{"...".format(...)}} and cases like {{s"..." + foo}} which should be 
{{s"...$foo"}}

> Add EmptyInterpolatedStringChecker to scalastyle-config.xml
> ---
>
> Key: SPARK-25553
> URL: https://issues.apache.org/jira/browse/SPARK-25553
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Minor
>
> h4. Justification
> Empty interpolated strings are harder to read and not necessary.
>  
> More details:
> http://www.scalastyle.org/rules-dev.html#org_scalastyle_scalariform_EmptyInterpolatedStringChecker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17159:


Assignee: (was: Apache Spark)

> Improve FileInputDStream.findNewFiles list performance
> --
>
> Key: SPARK-17159
> URL: https://issues.apache.org/jira/browse/SPARK-17159
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: spark against object stores
>Reporter: Steve Loughran
>Priority: Minor
>
> {{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that 
> calls getFileStatus() on every file, takes the output and does listStatus() 
> on the output.
> This going to suffer on object stores, as dir listing and getFileStatus calls 
> are so expensive. It's clear this is a problem, as the method has code to 
> detect timeouts in the window and warn of problems.
> It should be possible to make this faster



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17159:


Assignee: Apache Spark

> Improve FileInputDStream.findNewFiles list performance
> --
>
> Key: SPARK-17159
> URL: https://issues.apache.org/jira/browse/SPARK-17159
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: spark against object stores
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Minor
>
> {{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that 
> calls getFileStatus() on every file, takes the output and does listStatus() 
> on the output.
> This going to suffer on object stores, as dir listing and getFileStatus calls 
> are so expensive. It's clear this is a problem, as the method has code to 
> detect timeouts in the window and warn of problems.
> It should be possible to make this faster



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance

2018-09-28 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-17159:
---

> Improve FileInputDStream.findNewFiles list performance
> --
>
> Key: SPARK-17159
> URL: https://issues.apache.org/jira/browse/SPARK-17159
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: spark against object stores
>Reporter: Steve Loughran
>Priority: Minor
>
> {{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that 
> calls getFileStatus() on every file, takes the output and does listStatus() 
> on the output.
> This going to suffer on object stores, as dir listing and getFileStatus calls 
> are so expensive. It's clear this is a problem, as the method has code to 
> detect timeouts in the window and warn of problems.
> It should be possible to make this faster



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2018-09-28 Thread Li Yuanjian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631751#comment-16631751
 ] 

Li Yuanjian commented on SPARK-10816:
-

Design doc: 
[https://docs.google.com/document/d/1zeAc7QKSO7J4-Yk06kc76kvldl-QHLCDJuu04d7k2bg/edit?usp=sharing]

PR: [https://github.com/apache/spark/pull/22583]
With a roughly checking with [~kabhwan] post doc and pr, we share several spots 
in design and implement, hope we can resolve this problem together!

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10816) EventTime based sessionization

2018-09-28 Thread Li Yuanjian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian updated SPARK-10816:

Attachment: Session Window Support For Structure Streaming.pdf

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631733#comment-16631733
 ] 

Apache Spark commented on SPARK-10816:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/22583

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25567) [Spark Job History] Table listing in SQL Tab not display Sort Icon

2018-09-28 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631711#comment-16631711
 ] 

shahid commented on SPARK-25567:


Thanks. I will raise a PR

> [Spark Job History] Table listing in SQL Tab not display Sort Icon
> --
>
> Key: SPARK-25567
> URL: https://issues.apache.org/jira/browse/SPARK-25567
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> 1. spark.sql.ui.retainedExecutions = 2
> 2. Run Beeline Jobs
> 3. Open SQL Tab will list SQL Queries in table
> 4. ID column header does not display Sort Icon compare to other UI Tabs like 
> Job Id in  Jobs
> 5. Id user clicks the Column Header Sorting is happening. 
> Expected Result:
> User should be provided with Sort Icon like other UI tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25567) [Spark Job History] Table listing in SQL Tab not display Sort Icon

2018-09-28 Thread ABHISHEK KUMAR GUPTA (JIRA)

ABHISHEK KUMAR GUPTA created SPARK-25567:


 Summary: [Spark Job History] Table listing in SQL Tab not display 
Sort Icon
 Key: SPARK-25567
 URL: https://issues.apache.org/jira/browse/SPARK-25567
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1
Reporter: ABHISHEK KUMAR GUPTA


1. spark.sql.ui.retainedExecutions = 2
2. Run Beeline Jobs
3. Open SQL Tab will list SQL Queries in table
4. ID column header does not display Sort Icon compare to other UI Tabs like 
Job Id in  Jobs
5. Id user clicks the Column Header Sorting is happening. 
Expected Result:
User should be provided with Sort Icon like other UI tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25566) [Spark Job History] SQL UI Page does not support Pagination

2018-09-28 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631703#comment-16631703
 ] 

shahid commented on SPARK-25566:


Thanks for reporting. I am working on it.

> [Spark Job History] SQL UI Page does not support Pagination
> ---
>
> Key: SPARK-25566
> URL: https://issues.apache.org/jira/browse/SPARK-25566
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> 1. configure spark.sql.ui.retainedExecutions = 5 ( In Job History 
> Spark-default.conf )
> 2. Execute beeline Jobs more than 2
> 3. Open the UI page from the History Server 
> 4. Click SQL Tab
> *Actual Output:*  It shows all SQL Queries in Single Page. User has to scroll 
> whole page for specific SQL Queries.
> *Expected:* It should show page wise as it has been displaying inn other UI 
> Tabs like Jobs, Stages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25566) [Spark Job History] SQL UI Page does not support Pagination

2018-09-28 Thread ABHISHEK KUMAR GUPTA (JIRA)

ABHISHEK KUMAR GUPTA created SPARK-25566:


 Summary: [Spark Job History] SQL UI Page does not support 
Pagination
 Key: SPARK-25566
 URL: https://issues.apache.org/jira/browse/SPARK-25566
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.3.1
Reporter: ABHISHEK KUMAR GUPTA


1. configure spark.sql.ui.retainedExecutions = 5 ( In Job History 
Spark-default.conf )
2. Execute beeline Jobs more than 2
3. Open the UI page from the History Server 
4. Click SQL Tab
*Actual Output:*  It shows all SQL Queries in Single Page. User has to scroll 
whole page for specific SQL Queries.
*Expected:* It should show page wise as it has been displaying inn other UI 
Tabs like Jobs, Stages.







--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631700#comment-16631700
 ] 

Apache Spark commented on SPARK-25505:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22582

> The output order of grouping columns in Pivot is different from the input 
> order
> ---
>
> Key: SPARK-25505
> URL: https://issues.apache.org/jira/browse/SPARK-25505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Minor
> Fix For: 2.4.0
>
>
> For example,
> {code}
> SELECT * FROM (
>   SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
> "x" as x, "d" as d, "w" as w
>   FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ('dotNET', 'Java')
> )
> {code}
> The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, 
> b, c, d, w, x, y, z, ..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25565:


Assignee: Apache Spark

> Add scala style checker to check add Locale.ROOT to .toLowerCase and 
> .toUpperCase for internal calls
> 
>
> Key: SPARK-25565
> URL: https://issues.apache.org/jira/browse/SPARK-25565
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631662#comment-16631662
 ] 

Apache Spark commented on SPARK-25565:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22581

> Add scala style checker to check add Locale.ROOT to .toLowerCase and 
> .toUpperCase for internal calls
> 
>
> Key: SPARK-25565
> URL: https://issues.apache.org/jira/browse/SPARK-25565
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25565:


Assignee: (was: Apache Spark)

> Add scala style checker to check add Locale.ROOT to .toLowerCase and 
> .toUpperCase for internal calls
> 
>
> Key: SPARK-25565
> URL: https://issues.apache.org/jira/browse/SPARK-25565
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls

2018-09-28 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631654#comment-16631654
 ] 

Yuming Wang commented on SPARK-25565:
-

Thanks [~hyukjin.kwon] Please go ahead.

> Add scala style checker to check add Locale.ROOT to .toLowerCase and 
> .toUpperCase for internal calls
> 
>
> Key: SPARK-25565
> URL: https://issues.apache.org/jira/browse/SPARK-25565
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23194) from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls

2018-09-28 Thread Daniel Mateus Pires (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631652#comment-16631652
 ] 

Daniel Mateus Pires commented on SPARK-23194:
-

Any news on this ? not being able to set the from_json mode and use the 
columnNameOfCorruptRecord option is pretty limiting, and the documentation of 
"from_json" suggests that all the spark.read.json options are available


{code:java}
   * @param options options to control how the json is parsed. accepts the same 
options and the json data source.
{code}


> from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls
> ---
>
> Key: SPARK-23194
> URL: https://issues.apache.org/jira/browse/SPARK-23194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Burak Yavuz
>Priority: Major
>
> from_json accepts Json parsing options such as being PERMISSIVE to parsing 
> errors or failing fast. It seems from the code that even though the default 
> option is to fail fast, we catch that exception and return nulls.
>  
> In order to not change behavior, we should remove that try-catch block and 
> change the default to permissive, but allow failfast mode to indeed fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-28 Thread Kazuaki Ishizaki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631605#comment-16631605
 ] 

Kazuaki Ishizaki commented on SPARK-25538:
--

Thank for upload a schema. While I looked at the schema, I am still not sure 
about the reason of this problem.
I would appreciate it if you could find a good input data that can reproduce a 
problem.

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls

2018-09-28 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631575#comment-16631575
 ] 

Hyukjin Kwon commented on SPARK-25565:
--

I am taking a look for this. I will open a PR shortly if you don't mind.

> Add scala style checker to check add Locale.ROOT to .toLowerCase and 
> .toUpperCase for internal calls
> 
>
> Key: SPARK-25565
> URL: https://issues.apache.org/jira/browse/SPARK-25565
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25508) Refactor OrcReadBenchmark to use main method

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25508:


Assignee: Apache Spark

> Refactor OrcReadBenchmark to use main method
> 
>
> Key: SPARK-25508
> URL: https://issues.apache.org/jira/browse/SPARK-25508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: yucai
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25508) Refactor OrcReadBenchmark to use main method

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25508:


Assignee: (was: Apache Spark)

> Refactor OrcReadBenchmark to use main method
> 
>
> Key: SPARK-25508
> URL: https://issues.apache.org/jira/browse/SPARK-25508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: yucai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25508) Refactor OrcReadBenchmark to use main method

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631573#comment-16631573
 ] 

Apache Spark commented on SPARK-25508:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22580

> Refactor OrcReadBenchmark to use main method
> 
>
> Key: SPARK-25508
> URL: https://issues.apache.org/jira/browse/SPARK-25508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: yucai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2018-09-28 Thread ice bai (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631568#comment-16631568
 ] 

ice bai commented on SPARK-21774:
-

I met the same problem in Spark 2.3.0. The flowlling is some tests
 ```
 spark-sql> select ''>0;
 true
 Time taken: 0.078 seconds, Fetched 1 row(s)
 spark-sql> select ''>0;
 NULL
 Time taken: 0.065 seconds, Fetched 1 row(s)
 spark-sql> select '1.0'=1;
 true
 Time taken: 0.054 seconds, Fetched 1 row(s)
 spark-sql> select '1.2'=1;
 true
 Time taken: 0.07 seconds, Fetched 1 row(s)
 ```

When set log level to trace, I found this:

=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings ===
!'Project [unresolvedalias((> 0), None)] 'Project 
[unresolvedalias((cast( as int) > 0), None)]
 +- OneRowRelation +- OneRowRelation

> The rule PromoteStrings cast string to a wrong data type
> 
>
> Key: SPARK-21774
> URL: https://issues.apache.org/jira/browse/SPARK-21774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StanZhai
>Priority: Critical
>  Labels: correctness
>
> Data
> {code}
> create temporary view tb as select * from values
> ("0", 1),
> ("-0.1", 2),
> ("1", 3)
> as grouping(a, b)
> {code}
> SQL:
> {code}
> select a, b from tb where a=0
> {code}
> The result which is wrong:
> {code}
> ++---+
> |   a|  b|
> ++---+
> |   0|  1|
> |-0.1|  2|
> ++---+
> {code}
> Logical Plan:
> {code}
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'Filter ('a = 0)
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> a: string
> Project [a#8528]
> +- Filter (cast(a#8528 as int) = 0)
>+- SubqueryAlias src
>   +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
>  +- LocalRelation [_1#8525, _2#8526]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-09-28 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631550#comment-16631550
 ] 

Jungtaek Lim commented on SPARK-24630:
--

I think it would be better to describe actual queries (any single query or some 
scenarios which are composed to multiple queries) which structured streaming 
cannot and new proposal can, so that everyone can feel benefits to support on 
this.

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls

2018-09-28 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-25565:
---

 Summary: Add scala style checker to check add Locale.ROOT to 
.toLowerCase and .toUpperCase for internal calls
 Key: SPARK-25565
 URL: https://issues.apache.org/jira/browse/SPARK-25565
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 2.5.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls

2018-09-28 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25565:

Component/s: (was: Block Manager)
 Build

> Add scala style checker to check add Locale.ROOT to .toLowerCase and 
> .toUpperCase for internal calls
> 
>
> Key: SPARK-25565
> URL: https://issues.apache.org/jira/browse/SPARK-25565
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631513#comment-16631513
 ] 

Apache Spark commented on SPARK-25429:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22579

> SparkListenerBus inefficient due to 
> 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
> 
>
> Key: SPARK-25429
> URL: https://issues.apache.org/jira/browse/SPARK-25429
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Priority: Major
>
> {code:java}
> private def updateStageMetrics(
>   stageId: Int,
>   attemptId: Int,
>   taskId: Long,
>   accumUpdates: Seq[AccumulableInfo],
>   succeeded: Boolean): Unit = {
> Option(stageMetrics.get(stageId)).foreach { metrics =>
>   if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) {
> return
>   }
>   val oldTaskMetrics = metrics.taskMetrics.get(taskId)
>   if (oldTaskMetrics != null && oldTaskMetrics.succeeded) {
> return
>   }
>   val updates = accumUpdates
> .filter { acc => acc.update.isDefined && 
> metrics.accumulatorIds.contains(acc.id) }
> .sortBy(_.id)
>   if (updates.isEmpty) {
> return
>   }
>   val ids = new Array[Long](updates.size)
>   val values = new Array[Long](updates.size)
>   updates.zipWithIndex.foreach { case (acc, idx) =>
> ids(idx) = acc.id
> // In a live application, accumulators have Long values, but when 
> reading from event
> // logs, they have String values. For now, assume all accumulators 
> are Long and covert
> // accordingly.
> values(idx) = acc.update.get match {
>   case s: String => s.toLong
>   case l: Long => l
>   case o => throw new IllegalArgumentException(s"Unexpected: $o")
> }
>   }
>   // TODO: storing metrics by task ID can cause metrics for the same task 
> index to be
>   // counted multiple times, for example due to speculation or 
> re-attempts.
>   metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, 
> succeeded))
> }
>   }
> {code}
> 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated 
> many accumulator, it's inefficient use Arrray#contains.
> Actually, application may timeout while quit and will killed by RM on YARN 
> mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25564) Add output bytes metrics for each Executor

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631508#comment-16631508
 ] 

Apache Spark commented on SPARK-25564:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22578

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25564:


Assignee: Apache Spark

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25564) Add output bytes metrics for each Executor

2018-09-28 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631507#comment-16631507
 ] 

Apache Spark commented on SPARK-25564:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22578

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25564) Add output bytes metrics for each Executor

2018-09-28 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25564:


Assignee: (was: Apache Spark)

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25554) Avro logical types get ignored in SchemaConverters.toSqlType

2018-09-28 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631505#comment-16631505
 ] 

Liang-Chi Hsieh commented on SPARK-25554:
-

hmm, I think Spark 2.4 should have comprehensive support for Avro logical types.
{code:java}

{
  "type" : "record",
  "name" : "name",
  "namespace" : "namespace",
  "doc" : "docs",
  "fields" : [ {
    "name" : "field1",
    "type" : [ "null", {
  "type" : "int",
  "logicalType" : "date"
    } ],
    "doc" : "doc"
  } ]
}{code}

The DataFrame schema for above Avro file:
{code}
root
 |-- field1: date (nullable = true)
{code}

>From your attached maven dependencies, looks like you are using {{spark-avro}} 
>and Spark 2.3? So I think it might be an issue of {{spark-avro}}.

> Avro logical types get ignored in SchemaConverters.toSqlType
> 
>
> Key: SPARK-25554
> URL: https://issues.apache.org/jira/browse/SPARK-25554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Below is the maven dependencies:
> {code:java}
> 
> org.apache.avro
> avro
> 1.8.2
> 
> 
> com.databricks
> spark-avro_2.11
> 4.0.0
> 
> 
> 
> org.apache.spark
> spark-core_2.11
> 2.3.0
> 
> 
> org.apache.spark
> spark-sql_2.11
> 2.3.0
> 
> {code}
>Reporter: Yanan Li
>Priority: Major
>
> Having Avro schema defined as follow:
> {code:java}
> {
>"namespace": "com.xxx.avro",
>"name": "Book",
>"type": "record",
>"fields": [{
>  "name": "name",
>  "type": ["null", "string"],
>  "default": null
>   }, {
>  "name": "author",
>  "type": ["null", "string"],
>  "default": null
>   }, {
>  "name": "published_date",
>  "type": ["null", {"type": "int", "logicalType": "date"}],
>  "default": null
>   }
>]
> }
> {code}
> Spark Schema converted from above Avro schema, logical type "date" gets 
> ignored.
> {code:java}
> StructType(StructField(name,StringType,true),StructField(author,StringType,true),StructField(published_date,IntegerType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order

2018-09-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25505:
---

Assignee: Maryann Xue

> The output order of grouping columns in Pivot is different from the input 
> order
> ---
>
> Key: SPARK-25505
> URL: https://issues.apache.org/jira/browse/SPARK-25505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Minor
> Fix For: 2.4.0
>
>
> For example,
> {code}
> SELECT * FROM (
>   SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
> "x" as x, "d" as d, "w" as w
>   FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ('dotNET', 'Java')
> )
> {code}
> The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, 
> b, c, d, w, x, y, z, ..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order

2018-09-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25505:

Fix Version/s: 2.4.0

> The output order of grouping columns in Pivot is different from the input 
> order
> ---
>
> Key: SPARK-25505
> URL: https://issues.apache.org/jira/browse/SPARK-25505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Minor
> Fix For: 2.4.0
>
>
> For example,
> {code}
> SELECT * FROM (
>   SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
> "x" as x, "d" as d, "w" as w
>   FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ('dotNET', 'Java')
> )
> {code}
> The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, 
> b, c, d, w, x, y, z, ..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order

2018-09-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25505.
-
Resolution: Fixed

> The output order of grouping columns in Pivot is different from the input 
> order
> ---
>
> Key: SPARK-25505
> URL: https://issues.apache.org/jira/browse/SPARK-25505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Minor
> Fix For: 2.4.0
>
>
> For example,
> {code}
> SELECT * FROM (
>   SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
> "x" as x, "d" as d, "w" as w
>   FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ('dotNET', 'Java')
> )
> {code}
> The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, 
> b, c, d, w, x, y, z, ..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25564) Add output bytes metrics for each Executor

2018-09-28 Thread Lantao Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-25564:
---
Summary: Add output bytes metrics for each Executor  (was: LiveExecutor 
misses the OutputBytes metrics)

> Add output bytes metrics for each Executor
> --
>
> Key: SPARK-25564
> URL: https://issues.apache.org/jira/browse/SPARK-25564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Lantao Jin
>Priority: Minor
>
> LiveExecutor only statistics the total input bytes. And total output bytes 
> for each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25564) LiveExecutor misses the OutputBytes metrics

2018-09-28 Thread Lantao Jin (JIRA)

Lantao Jin created SPARK-25564:
--

 Summary: LiveExecutor misses the OutputBytes metrics
 Key: SPARK-25564
 URL: https://issues.apache.org/jira/browse/SPARK-25564
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2
Reporter: Lantao Jin


LiveExecutor only statistics the total input bytes. And total output bytes for 
each executor also has the equal importance like input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2018-09-28 Thread Li Yuanjian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631398#comment-16631398
 ] 

Li Yuanjian commented on SPARK-10816:
-

Great thanks for [~kabhwan] notice me, just linked SPARK-22565 as duplicated 
with this, sorry for just searching "session window" before and lost this, will 
still find others duplicated jira.

As discussed in SPARK-22565, we also meet this problem while doing the 
migration of streaming app running on other system to Structure Streaming. We 
solved this by implement the session window as build-in function and gave 
internal beta version based on Apache Spark 2.3.0 just week ago. After steady 
running online for real product env, we are doing the code clean work and doc 
translating.

As discussed with Jungtaek, we also wished to join the discussion here and will 
give PR and design doc today.

The preview pr I'll submit contains others patch. cc [~liulinhong] [~ivoson]  
[~yanlin-Lynn] [~LiangchangZ] , please watching this issue.

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

96 matches

Mail list logo