date:20180822

[jira] [Resolved] (SPARK-25133) Documentaion: AVRO data source guide

2018-08-22 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25133.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22121
[https://github.com/apache/spark/pull/22121]

> Documentaion: AVRO data source guide
> 
>
> Key: SPARK-25133
> URL: https://issues.apache.org/jira/browse/SPARK-25133
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25133) Documentaion: AVRO data source guide

2018-08-22 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25133:
---

Assignee: Gengliang Wang

> Documentaion: AVRO data source guide
> 
>
> Key: SPARK-25133
> URL: https://issues.apache.org/jira/browse/SPARK-25133
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-08-22 Thread Leo Gallucci (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589721#comment-16589721
 ] 

Leo Gallucci commented on SPARK-18112:
--

Same issue. It only gets resolved if I remove spark-hive_2.11-2.3.1.jar but 
then pyspark and sparklyr stop working.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25121) Support multi-part column name for hint resolution

2018-08-22 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589662#comment-16589662
 ] 

Takeshi Yamamuro commented on SPARK-25121:
--

Anybody starts working on this? If no, I could take this.

> Support multi-part column name for hint resolution
> --
>
> Key: SPARK-25121
> URL: https://issues.apache.org/jira/browse/SPARK-25121
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> After supporting multi-part names in 
> https://github.com/apache/spark/pull/17185, we also need to consider how to 
> resolve the hints for broadcast hints. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25197) Read from java.nio.Path

2018-08-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25197.
--
Resolution: Won't Fix

I would just simply convert the path to string. Wouldn't necessarily add many 
overloaded APIs

> Read from java.nio.Path
> ---
>
> Key: SPARK-25197
> URL: https://issues.apache.org/jira/browse/SPARK-25197
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Nick Sutcliffe
>Priority: Trivial
>
> It should be possible to provide a java.nio.Path to the read methods on 
> DataFrameReader, as well as String. For example:
> {code:java}
> val somePath:java.nio.Path = ???
> spark.read.parquet(somePath)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy

2018-08-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25176:
-
Priority: Major  (was: Critical)

> Kryo fails to serialize a parametrised type hierarchy
> -
>
> Key: SPARK-25176
> URL: https://issues.apache.org/jira/browse/SPARK-25176
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Mikhail Pryakhin
>Priority: Major
>
> I'm using the latest spark version spark-core_2.11:2.3.1 which 
> transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
> com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo 
> serializer contains an issue [1,2] which results in throwing 
> ClassCastExceptions when serialising parameterised type hierarchy.
> This issue has been fixed in kryo version 4.0.0 [3]. It would be great to 
> have this update in Spark as well. Could you please upgrade the version of 
> com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
> You can find a simple test to reproduce the issue [4].
> [1] https://github.com/EsotericSoftware/kryo/issues/384
> [2] https://github.com/EsotericSoftware/kryo/issues/377
> [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
> [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy

2018-08-22 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589649#comment-16589649
 ] 

Hyukjin Kwon commented on SPARK-25176:
--

(please avoid to set Critical+ which is usually reserved for committers)

> Kryo fails to serialize a parametrised type hierarchy
> -
>
> Key: SPARK-25176
> URL: https://issues.apache.org/jira/browse/SPARK-25176
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Mikhail Pryakhin
>Priority: Major
>
> I'm using the latest spark version spark-core_2.11:2.3.1 which 
> transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
> com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo 
> serializer contains an issue [1,2] which results in throwing 
> ClassCastExceptions when serialising parameterised type hierarchy.
> This issue has been fixed in kryo version 4.0.0 [3]. It would be great to 
> have this update in Spark as well. Could you please upgrade the version of 
> com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
> You can find a simple test to reproduce the issue [4].
> [1] https://github.com/EsotericSoftware/kryo/issues/384
> [2] https://github.com/EsotericSoftware/kryo/issues/377
> [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
> [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25132) Case-insensitive field resolution when reading from Parquet

2018-08-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25132:
-
Summary: Case-insensitive field resolution when reading from Parquet  (was: 
Case-insensitive field resolution when reading from Parquet/ORC)

> Case-insensitive field resolution when reading from Parquet
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC

2018-08-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25132:
-
Summary: Case-insensitive field resolution when reading from Parquet/ORC  
(was: Case-insensitive field resolution when reading from Parquet)

> Case-insensitive field resolution when reading from Parquet/ORC
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25132) Case-insensitive field resolution when reading from Parquet

2018-08-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25132:
-
Summary: Case-insensitive field resolution when reading from Parquet  (was: 
Case-insensitive field resolution when reading from Parquet/ORC)

> Case-insensitive field resolution when reading from Parquet
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25192) Remove SupportsPushdownCatalystFilter

2018-08-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25192.
--
Resolution: Duplicate

> Remove SupportsPushdownCatalystFilter
> -
>
> Key: SPARK-25192
> URL: https://issues.apache.org/jira/browse/SPARK-25192
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> This interface was created for file source(not migrated yet) to implement 
> hive partition pruning. It turns out that, we can also use the public 
> {{Filter}} API in the hive partition pruning. Thus we don't need 
> SupportsPushdownCatalystFilter at all, which exposed the internal Expression 
> class. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25205) typo in spark.network.crypto.keyFactoryIteration

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25205:


Assignee: (was: Apache Spark)

> typo in spark.network.crypto.keyFactoryIteration
> 
>
> Key: SPARK-25205
> URL: https://issues.apache.org/jira/browse/SPARK-25205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Imran Rashid
>Priority: Trivial
>
> I happened to notice this typo "spark.networy.crypto.keyFactoryIteration".  
> probably nobody ever uses this conf, but still should be fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25205) typo in spark.network.crypto.keyFactoryIteration

2018-08-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589636#comment-16589636
 ] 

Apache Spark commented on SPARK-25205:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/22195

> typo in spark.network.crypto.keyFactoryIteration
> 
>
> Key: SPARK-25205
> URL: https://issues.apache.org/jira/browse/SPARK-25205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Imran Rashid
>Priority: Trivial
>
> I happened to notice this typo "spark.networy.crypto.keyFactoryIteration".  
> probably nobody ever uses this conf, but still should be fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25205) typo in spark.network.crypto.keyFactoryIteration

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25205:


Assignee: Apache Spark

> typo in spark.network.crypto.keyFactoryIteration
> 
>
> Key: SPARK-25205
> URL: https://issues.apache.org/jira/browse/SPARK-25205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Trivial
>
> I happened to notice this typo "spark.networy.crypto.keyFactoryIteration".  
> probably nobody ever uses this conf, but still should be fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25205) typo in spark.network.crypto.keyFactoryIteration

2018-08-22 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-25205:


 Summary: typo in spark.network.crypto.keyFactoryIteration
 Key: SPARK-25205
 URL: https://issues.apache.org/jira/browse/SPARK-25205
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Imran Rashid


I happened to notice this typo "spark.networy.crypto.keyFactoryIteration".  
probably nobody ever uses this conf, but still should be fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25167) Minor fixes for R sql tests (tests that fail in development environment)

2018-08-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25167.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22161
[https://github.com/apache/spark/pull/22161]

> Minor fixes for R sql tests (tests that fail in development environment)
> 
>
> Key: SPARK-25167
> URL: https://issues.apache.org/jira/browse/SPARK-25167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 2.4.0
>
>
> A few SQL tests for R are failing development environment (Mac). 
> *  The catalog api tests assumes catalog artifacts named "foo" to be non 
> existent. I think name such as foo and bar are common and developers use it 
> frequently. 
> *  One test assumes that we only have one database in the system. I had more 
> than one and it caused the test to fail. I have changed that check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25167) Minor fixes for R sql tests (tests that fail in development environment)

2018-08-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25167:


Assignee: Dilip Biswal

> Minor fixes for R sql tests (tests that fail in development environment)
> 
>
> Key: SPARK-25167
> URL: https://issues.apache.org/jira/browse/SPARK-25167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
>
> A few SQL tests for R are failing development environment (Mac). 
> *  The catalog api tests assumes catalog artifacts named "foo" to be non 
> existent. I think name such as foo and bar are common and developers use it 
> frequently. 
> *  One test assumes that we only have one database in the system. I had more 
> than one and it caused the test to fail. I have changed that check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25196) Analyze column statistics in cached query

2018-08-22 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589604#comment-16589604
 ] 

Takeshi Yamamuro edited comment on SPARK-25196 at 8/23/18 2:28 AM:
---

yea, sure. I'll make a pr and do more discussion there after branch-2.4 cut. 
thanks.


was (Author: maropu):
yea, sure. So, I'll make a pr after branch-2.4 cut. thanks.

> Analyze column statistics in cached query
> -
>
> Key: SPARK-25196
> URL: https://issues.apache.org/jira/browse/SPARK-25196
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In common usecases, users read catalog table data, join/aggregate them, and 
> then cache the result for following reuse. Since we are only allowed to 
> analyze column statistics in catalog tables via ANALYZE commands, the 
> optimization depends on non-existing or inaccurate column statistics of 
> cached data. So, I think it'd be nice if Spark could analyze cached data and 
> hold temporary column statistics for InMemoryRelation.
> For example, we might be able to add a new API (e.g., 
> analyzeColumnCacheQuery) to do so in CacheManager;
>  POC: 
> [https://github.com/apache/spark/compare/master...maropu:AnalyzeCacheQuery]
> {code:java}
> scala> sql("SET spark.sql.cbo.enabled=true")
> scala> sql("SET spark.sql.statistics.histogram.enabled=true")
> scala> spark.range(1000).selectExpr("id % 33 AS c0", "rand() AS c1", "0 AS 
> c2").write.saveAsTable("t")
> scala> sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS c0, c1, c2")
> scala> val cacheManager = spark.sharedState.cacheManager
> scala> def printColumnStats(data: org.apache.spark.sql.DataFrame) = {
>  |   data.queryExecution.optimizedPlan.stats.attributeStats.foreach {
>  | case (k, v) => println(s"[$k]: $v")
>  |   }
>  | }
> scala> def df() = spark.table("t").groupBy("c0").agg(count("c1").as("v1"), 
> sum("c2").as("v2"))
> // Prints column statistics in catalog table `t`
> scala> printColumnStats(spark.table("t"))
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
> [c1#7074]: 
> ColumnStat(Some(997),Some(5.958619423369615E-4),Some(0.9988009488973438),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@4ef69c53)))
> [c2#7075]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(4),Some(4),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@7cbaf548)))
> // Prints column statistics on query result `df`
> scala> printColumnStats(df())
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
> // Prints column statistics on cached data of `df`
> scala> printColumnStats(df().cache)
> 
> // A new API described above
> scala> cacheManager.analyzeColumnCacheQuery(df(), "v1" :: "v2" :: Nil)
>   
>   
> // Then, prints again
> scala> printColumnStats(df())
> [v1#7101L]: 
> ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
> [v2#7103L]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
> scala> cacheManager.analyzeColumnCacheQuery(df(), "c0" :: Nil)
> scala> printColumnStats(df())
> [v1#7101L]: 
> ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
> [v2#7103L]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@626bcfc8)))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25196) Analyze column statistics in cached query

2018-08-22 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589604#comment-16589604
 ] 

Takeshi Yamamuro commented on SPARK-25196:
--

yea, sure. So, I'll make a pr after branch-2.4 cut. thanks.

> Analyze column statistics in cached query
> -
>
> Key: SPARK-25196
> URL: https://issues.apache.org/jira/browse/SPARK-25196
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In common usecases, users read catalog table data, join/aggregate them, and 
> then cache the result for following reuse. Since we are only allowed to 
> analyze column statistics in catalog tables via ANALYZE commands, the 
> optimization depends on non-existing or inaccurate column statistics of 
> cached data. So, I think it'd be nice if Spark could analyze cached data and 
> hold temporary column statistics for InMemoryRelation.
> For example, we might be able to add a new API (e.g., 
> analyzeColumnCacheQuery) to do so in CacheManager;
>  POC: 
> [https://github.com/apache/spark/compare/master...maropu:AnalyzeCacheQuery]
> {code:java}
> scala> sql("SET spark.sql.cbo.enabled=true")
> scala> sql("SET spark.sql.statistics.histogram.enabled=true")
> scala> spark.range(1000).selectExpr("id % 33 AS c0", "rand() AS c1", "0 AS 
> c2").write.saveAsTable("t")
> scala> sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS c0, c1, c2")
> scala> val cacheManager = spark.sharedState.cacheManager
> scala> def printColumnStats(data: org.apache.spark.sql.DataFrame) = {
>  |   data.queryExecution.optimizedPlan.stats.attributeStats.foreach {
>  | case (k, v) => println(s"[$k]: $v")
>  |   }
>  | }
> scala> def df() = spark.table("t").groupBy("c0").agg(count("c1").as("v1"), 
> sum("c2").as("v2"))
> // Prints column statistics in catalog table `t`
> scala> printColumnStats(spark.table("t"))
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
> [c1#7074]: 
> ColumnStat(Some(997),Some(5.958619423369615E-4),Some(0.9988009488973438),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@4ef69c53)))
> [c2#7075]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(4),Some(4),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@7cbaf548)))
> // Prints column statistics on query result `df`
> scala> printColumnStats(df())
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
> // Prints column statistics on cached data of `df`
> scala> printColumnStats(df().cache)
> 
> // A new API described above
> scala> cacheManager.analyzeColumnCacheQuery(df(), "v1" :: "v2" :: Nil)
>   
>   
> // Then, prints again
> scala> printColumnStats(df())
> [v1#7101L]: 
> ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
> [v2#7103L]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
> scala> cacheManager.analyzeColumnCacheQuery(df(), "c0" :: Nil)
> scala> printColumnStats(df())
> [v1#7101L]: 
> ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
> [v2#7103L]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@626bcfc8)))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25202) SQL Function Split Should Respect Limit Argument

2018-08-22 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589601#comment-16589601
 ] 

Liang-Chi Hsieh commented on SPARK-25202:
-

Let me see if I have time to do this today later.

> SQL Function Split Should Respect Limit Argument
> 
>
> Key: SPARK-25202
> URL: https://issues.apache.org/jira/browse/SPARK-25202
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Parker Hegstrom
>Priority: Minor
>
> Adds support for the setting {{limit}} in the sql split function



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25202) SQL Function Split Should Respect Limit Argument

2018-08-22 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589596#comment-16589596
 ] 

Wenchen Fan commented on SPARK-25202:
-

SGTM

> SQL Function Split Should Respect Limit Argument
> 
>
> Key: SPARK-25202
> URL: https://issues.apache.org/jira/browse/SPARK-25202
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Parker Hegstrom
>Priority: Minor
>
> Adds support for the setting {{limit}} in the sql split function



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23932) High-order function: zip_with(array, array, function) → array

2018-08-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589588#comment-16589588
 ] 

Apache Spark commented on SPARK-23932:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22194

> High-order function: zip_with(array, array, function) → 
> array
> ---
>
> Key: SPARK-23932
> URL: https://issues.apache.org/jira/browse/SPARK-23932
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the two given arrays, element-wise, into a single array using 
> function. Both arrays must be the same length.
> {noformat}
> SELECT zip_with(ARRAY[1, 3, 5], ARRAY['a', 'b', 'c'], (x, y) -> (y, x)); -- 
> [ROW('a', 1), ROW('b', 3), ROW('c', 5)]
> SELECT zip_with(ARRAY[1, 2], ARRAY[3, 4], (x, y) -> x + y); -- [4, 6]
> SELECT zip_with(ARRAY['a', 'b', 'c'], ARRAY['d', 'e', 'f'], (x, y) -> 
> concat(x, y)); -- ['ad', 'be', 'cf']
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25202) SQL Function Split Should Respect Limit Argument

2018-08-22 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589558#comment-16589558
 ] 

Liang-Chi Hsieh commented on SPARK-25202:
-

I saw Presto has this support. Is it worth adding this? cc [~cloud_fan]

> SQL Function Split Should Respect Limit Argument
> 
>
> Key: SPARK-25202
> URL: https://issues.apache.org/jira/browse/SPARK-25202
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Parker Hegstrom
>Priority: Minor
>
> Adds support for the setting {{limit}} in the sql split function



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25178) Directly ship the StructType objects of the keySchema / valueSchema for xxxHashMapGenerator

2018-08-22 Thread Kazuaki Ishizaki (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25178:
-
Summary: Directly ship the StructType objects of the keySchema / 
valueSchema for xxxHashMapGenerator  (was: Use dummy name for 
xxxHashMapGenerator key/value schema field)

> Directly ship the StructType objects of the keySchema / valueSchema for 
> xxxHashMapGenerator
> ---
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25198) org.apache.spark.sql.catalyst.parser.ParseException: DataType json is not supported.

2018-08-22 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589547#comment-16589547
 ] 

Liang-Chi Hsieh commented on SPARK-25198:
-

I think the {{customSchema}} here refers to Spark's Catalyst SQL datatypes, not 
Postgres datatypes. For Postgres json column, string type column should be 
mapped to it. Doesn't it work?

> org.apache.spark.sql.catalyst.parser.ParseException: DataType json is not 
> supported.
> 
>
> Key: SPARK-25198
> URL: https://issues.apache.org/jira/browse/SPARK-25198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: Ubuntu 18.04, Spark 2.3.1, 
> org.postgresql:postgresql:42.2.4
>Reporter: antonkulaga
>Priority: Major
>
> Whenever I try to save the dataframe with one of the columns with JSON string 
> inside to the latest Postgres I get 
> org.apache.spark.sql.catalyst.parser.ParseException: DataType json is not 
> supported. As Postgres supports JSON well and I use the latest postgresql 
> client I expect it to work. Here is an example of the code that crashes
> val columnTypes = """id integer, parameters json, title text, gsm text, gse 
> text, organism text, characteristics text, molecule text, model text, 
> description text, treatment_protocol text, extract_protocol text, source_name 
> text,data_processing text, submission_date text,last_update_date text, status 
> text, type text, contact text, gpl text"""
> myDataframe.write.format("jdbc").option("url", 
> "jdbc:postgresql://db/sequencing").option("customSchema", 
> columnTypes).option("dbtable", "test").option("user", 
> "postgres").option("password", "changeme").save()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25127) DataSourceV2: Remove SupportsPushDownCatalystFilters

2018-08-22 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25127.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22185
[https://github.com/apache/spark/pull/22185]

> DataSourceV2: Remove SupportsPushDownCatalystFilters
> 
>
> Key: SPARK-25127
> URL: https://issues.apache.org/jira/browse/SPARK-25127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Ryan Blue
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
>
> Discussion about adding TableCatalog on the dev list focused around whether 
> Expression should be used in the public DataSourceV2 API, with 
> SupportsPushDownCatalystFilters as an example of where it is already exposed. 
> The early consensus is that Expression should not be exposed in the public 
> API.
> From [~rxin]:
> bq. I completely disagree with using Expression in critical public APIs that 
> we expect a lot of developers to use . . . If we are depending on Expressions 
> on the more common APIs in dsv2 already, we should revisit that.
> The main use of this API is to pass Expression to FileFormat classes that 
> used Expression instead of Filter. External sources also use it for more 
> complex push-down, like {{to_date(ts) = '2018-05-13'}}, but those uses can be 
> done with Analyzer rules or when translating to Filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25186) Stabilize Data Source V2 API

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25186:


Assignee: Apache Spark

> Stabilize Data Source V2 API 
> -
>
> Key: SPARK-25186
> URL: https://issues.apache.org/jira/browse/SPARK-25186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25186) Stabilize Data Source V2 API

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25186:


Assignee: (was: Apache Spark)

> Stabilize Data Source V2 API 
> -
>
> Key: SPARK-25186
> URL: https://issues.apache.org/jira/browse/SPARK-25186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25186) Stabilize Data Source V2 API

2018-08-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589489#comment-16589489
 ] 

Apache Spark commented on SPARK-25186:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/22193

> Stabilize Data Source V2 API 
> -
>
> Key: SPARK-25186
> URL: https://issues.apache.org/jira/browse/SPARK-25186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589483#comment-16589483
 ] 

Apache Spark commented on SPARK-24918:
--

User 'NiharS' has created a pull request for this issue:
https://github.com/apache/spark/pull/22192

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24785) Making sure REPL prints Spark UI info and then Welcome message

2018-08-22 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24785:
---

Assignee: DB Tsai

> Making sure REPL prints Spark UI info and then Welcome message
> --
>
> Key: SPARK-24785
> URL: https://issues.apache.org/jira/browse/SPARK-24785
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> After [SPARK-24418] the welcome message will be printed first, and then scala 
> prompt will be shown before the Spark UI info is printed as the following.
> {code:java}
>  apache-spark git:(scala-2.11.12) ✗ ./bin/spark-shell 
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> Spark context Web UI available at http://192.168.1.169:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1528180279528).
> Spark session available as 'spark'.
> scala> 
> {code}
> Although it's a minor issue, but visually, it doesn't look nice as the 
> existing behavior. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24785) Making sure REPL prints Spark UI info and then Welcome message

2018-08-22 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-24785.
-
Resolution: Fixed

> Making sure REPL prints Spark UI info and then Welcome message
> --
>
> Key: SPARK-24785
> URL: https://issues.apache.org/jira/browse/SPARK-24785
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> After [SPARK-24418] the welcome message will be printed first, and then scala 
> prompt will be shown before the Spark UI info is printed as the following.
> {code:java}
>  apache-spark git:(scala-2.11.12) ✗ ./bin/spark-shell 
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> Spark context Web UI available at http://192.168.1.169:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1528180279528).
> Spark session available as 'spark'.
> scala> 
> {code}
> Although it's a minor issue, but visually, it doesn't look nice as the 
> existing behavior. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25204) rate source test is flaky

2018-08-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589404#comment-16589404
 ] 

Apache Spark commented on SPARK-25204:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/22191

> rate source test is flaky
> -
>
> Key: SPARK-25204
> URL: https://issues.apache.org/jira/browse/SPARK-25204
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Jose Torres
>Priority: Minor
>
> We try to restart a manually clocked rate stream in a test. This is 
> inherently race-prone, because the stream will go backwards in time (and 
> throw an assertion failure) if the clock can't be incremented before it tries 
> to schedule the first batch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25204) rate source test is flaky

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25204:


Assignee: (was: Apache Spark)

> rate source test is flaky
> -
>
> Key: SPARK-25204
> URL: https://issues.apache.org/jira/browse/SPARK-25204
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Jose Torres
>Priority: Minor
>
> We try to restart a manually clocked rate stream in a test. This is 
> inherently race-prone, because the stream will go backwards in time (and 
> throw an assertion failure) if the clock can't be incremented before it tries 
> to schedule the first batch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25204) rate source test is flaky

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25204:


Assignee: Apache Spark

> rate source test is flaky
> -
>
> Key: SPARK-25204
> URL: https://issues.apache.org/jira/browse/SPARK-25204
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Minor
>
> We try to restart a manually clocked rate stream in a test. This is 
> inherently race-prone, because the stream will go backwards in time (and 
> throw an assertion failure) if the clock can't be incremented before it tries 
> to schedule the first batch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25204) rate source test is flaky

2018-08-22 Thread Jose Torres (JIRA)

Jose Torres created SPARK-25204:
---

 Summary: rate source test is flaky
 Key: SPARK-25204
 URL: https://issues.apache.org/jira/browse/SPARK-25204
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Jose Torres


We try to restart a manually clocked rate stream in a test. This is inherently 
race-prone, because the stream will go backwards in time (and throw an 
assertion failure) if the clock can't be incremented before it tries to 
schedule the first batch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25188) Add WriteConfig

2018-08-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589367#comment-16589367
 ] 

Apache Spark commented on SPARK-25188:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/22190

> Add WriteConfig
> ---
>
> Key: SPARK-25188
> URL: https://issues.apache.org/jira/browse/SPARK-25188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current write API is not flexible enough to implement more complex write 
> operations like `replaceWhere`. We can follow the read API and add a 
> `WriteConfig` to make it more flexible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25188) Add WriteConfig

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25188:


Assignee: (was: Apache Spark)

> Add WriteConfig
> ---
>
> Key: SPARK-25188
> URL: https://issues.apache.org/jira/browse/SPARK-25188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current write API is not flexible enough to implement more complex write 
> operations like `replaceWhere`. We can follow the read API and add a 
> `WriteConfig` to make it more flexible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25188) Add WriteConfig

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25188:


Assignee: Apache Spark

> Add WriteConfig
> ---
>
> Key: SPARK-25188
> URL: https://issues.apache.org/jira/browse/SPARK-25188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>
> The current write API is not flexible enough to implement more complex write 
> operations like `replaceWhere`. We can follow the read API and add a 
> `WriteConfig` to make it more flexible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25163) Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with compression

2018-08-22 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-25163.
--
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.4.0

> Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with 
> compression
> --
>
> Key: SPARK-25163
> URL: https://issues.apache.org/jira/browse/SPARK-25163
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> I saw it failed multiple times on Jenkins:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4813/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling_with_compression/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25203) spark sql, union all does not propagate child partitioning (when possible)

2018-08-22 Thread Eyal Farago (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589356#comment-16589356
 ] 

Eyal Farago commented on SPARK-25203:
-

seems I was wrong regarding the resulting distribution,

modified the query a bit:
{code:java}
select *, spark_partition_id() as P  from (select * from t1DU distribute by c1)
-- !query 15 schema
struct
-- !query 15 output
1   a   43
2   a   174
2   b   174
3   b   51
{code}

it's now obvious that records are properly clustered. (I guess the original 
query computed the partition id BEFORE the repartitioning).

> spark sql, union all does not propagate child partitioning (when possible)
> --
>
> Key: SPARK-25203
> URL: https://issues.apache.org/jira/browse/SPARK-25203
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Eyal Farago
>Priority: Major
>
> in spark-sql, union all does not propagate partitioning when all child plans 
> have the same partitioning, this causes introduction of non necessary 
> Exchange nodes when parent operator requires a distribution satisfied by this 
> partitioning.
>  
> {code:java}
> CREATE OR REPLACE TEMPORARY VIEW t1 AS VALUES (1, 'a'), (2, 'b') tbl(c1, c2);
> CREATE OR REPLACE TEMPORARY VIEW t1D1 AS select c1, c2 from t1 distribute by 
> c1;
> CREATE OR REPLACE TEMPORARY VIEW t1D2 AS select c1 + 1 as c11, c2 from t1 
> distribute by c11;
> create or REPLACE TEMPORARY VIEW t1DU as
> select * from t1D1
> UNION ALL
> select * from t1D2;
> EXPLAIN select * from t1DU distribute by c1;
> == Physical Plan ==
> Exchange hashpartitioning(c1#x, 200)
> +- Union
>:- Exchange hashpartitioning(c1#x, 200)
>:  +- LocalTableScan [c1#x, c2#x]
>+- Exchange hashpartitioning(c11#x, 200)
>   +- LocalTableScan [c11#x, c2#x]
> {code}
> the Exchange introduced in the last query is unnecessary since the unioned 
> data is already partitioned by column _c1_, in fact the equivalent RDD 
> operation identifies this scenario and introduces a PartitionerAwareUnionRDD 
> which maintains children's shared partitioner.
> I suggest modifying modifying org.apache.spark.sql.execution.UnionExec by 
> overriding _outputPartitioning_ in a way that identifies common partitioning 
> among child plans and use that (falling back to default implementation 
> otherwise).
> furthermore, it seems current implementation does not properly clusters data:
> {code:java}
> select *, spark_partition_id() as P  from t1DU distribute by c1
> -- !query 15 schema
> struct
> -- !query 15 output
> 1 a   43
> 2 a   374
> 2 b   174
> 3 b   251
> {code}
> notice _c1=2_ in partitions 174 and 374.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25119) stages in wrong order within job page DAG chart

2018-08-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589352#comment-16589352
 ] 

Apache Spark commented on SPARK-25119:
--

User 'yunjzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22177

> stages in wrong order within job page DAG chart
> ---
>
> Key: SPARK-25119
> URL: https://issues.apache.org/jira/browse/SPARK-25119
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Yunjian Zhang
>Priority: Minor
> Attachments: Screen Shot 2018-08-14 at 3.35.34 PM.png
>
>
> {color:#33}multiple stages for same job are shown with wrong order in DAG 
> Visualization of job page.{color}
> e.g.
> stage27   stage19 stage20 stage24 stage21



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25119) stages in wrong order within job page DAG chart

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25119:


Assignee: (was: Apache Spark)

> stages in wrong order within job page DAG chart
> ---
>
> Key: SPARK-25119
> URL: https://issues.apache.org/jira/browse/SPARK-25119
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Yunjian Zhang
>Priority: Minor
> Attachments: Screen Shot 2018-08-14 at 3.35.34 PM.png
>
>
> {color:#33}multiple stages for same job are shown with wrong order in DAG 
> Visualization of job page.{color}
> e.g.
> stage27   stage19 stage20 stage24 stage21



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25119) stages in wrong order within job page DAG chart

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25119:


Assignee: Apache Spark

> stages in wrong order within job page DAG chart
> ---
>
> Key: SPARK-25119
> URL: https://issues.apache.org/jira/browse/SPARK-25119
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Yunjian Zhang
>Assignee: Apache Spark
>Priority: Minor
> Attachments: Screen Shot 2018-08-14 at 3.35.34 PM.png
>
>
> {color:#33}multiple stages for same job are shown with wrong order in DAG 
> Visualization of job page.{color}
> e.g.
> stage27   stage19 stage20 stage24 stage21



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25195) Extending from_json function

2018-08-22 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589335#comment-16589335
 ] 

Maxim Gekk commented on SPARK-25195:


> Problem number 1: The from_json function accepts as a schema only StructType 
> or ArrayType(StructType), but not an ArrayType of primitives.

This was fixed recently: https://github.com/apache/spark/pull/21439

> Extending from_json function
> 
>
> Key: SPARK-25195
> URL: https://issues.apache.org/jira/browse/SPARK-25195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
>   Dear Spark and PySpark maintainers,
>   I hope, that opening a JIRA issue is the correct way to request an 
> improvement. If it's not, please forgive me and kindly instruct me on how to 
> do it instead.
>   At our company, we are currently rewriting a lot of old MapReduce code with 
> SPARK, and the following use-case is quite frequent: Some string-valued 
> dataframe columns are JSON-arrays, and we want to parse them into array-typed 
> columns.
>   Problem number 1: The from_json function accepts as a schema only 
> StructType or ArrayType(StructType), but not an ArrayType of primitives. 
> Submitting the schema in a string form like 
> {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat}
>  does not work either, the error message says, among other things, 
> {noformat}data type mismatch: Input schema array must be a struct or 
> an array of structs.{noformat}
>   Problem number 2: Sometimes, in our JSON arrays we have elements of 
> different types. For example, we might have some JSON array like 
> {noformat}["string_value", 0, true, null]{noformat} which is JSON-valid with 
> schema 
> {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat}
>  (and, for instance the Python json.loads function has no problem parsing 
> this), but such a schema is not recognized, at all. The error message gets 
> quite unreadable after the words {noformat}ParseException: u'\nmismatched 
> input{noformat}
>   Here is some simple Python code to reproduce the problems (using pyspark 
> 2.3.1 and pandas 0.23.4):
>   {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", 
> false, null]', '["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> ArrayType(StringType( # Does not work, because not a struct of array 
> of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> 
> '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
>  # Does not work at all
>   {noformat}
>   For now, we have to use a UDF function, which calls python's json.loads, 
> but this is, for obvious reasons, suboptimal. If you could extend the 
> functionality of the Spark from_json function in the next release, this would 
> be really helpful. Thank you in advance!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25203) spark sql, union all does not propagate child partitioning (when possible)

2018-08-22 Thread Eyal Farago (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589332#comment-16589332
 ] 

Eyal Farago commented on SPARK-25203:
-

CC: [~hvanhovell], [~cloud_fan]

> spark sql, union all does not propagate child partitioning (when possible)
> --
>
> Key: SPARK-25203
> URL: https://issues.apache.org/jira/browse/SPARK-25203
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Eyal Farago
>Priority: Major
>
> in spark-sql, union all does not propagate partitioning when all child plans 
> have the same partitioning, this causes introduction of non necessary 
> Exchange nodes when parent operator requires a distribution satisfied by this 
> partitioning.
>  
> {code:java}
> CREATE OR REPLACE TEMPORARY VIEW t1 AS VALUES (1, 'a'), (2, 'b') tbl(c1, c2);
> CREATE OR REPLACE TEMPORARY VIEW t1D1 AS select c1, c2 from t1 distribute by 
> c1;
> CREATE OR REPLACE TEMPORARY VIEW t1D2 AS select c1 + 1 as c11, c2 from t1 
> distribute by c11;
> create or REPLACE TEMPORARY VIEW t1DU as
> select * from t1D1
> UNION ALL
> select * from t1D2;
> EXPLAIN select * from t1DU distribute by c1;
> == Physical Plan ==
> Exchange hashpartitioning(c1#x, 200)
> +- Union
>:- Exchange hashpartitioning(c1#x, 200)
>:  +- LocalTableScan [c1#x, c2#x]
>+- Exchange hashpartitioning(c11#x, 200)
>   +- LocalTableScan [c11#x, c2#x]
> {code}
> the Exchange introduced in the last query is unnecessary since the unioned 
> data is already partitioned by column _c1_, in fact the equivalent RDD 
> operation identifies this scenario and introduces a PartitionerAwareUnionRDD 
> which maintains children's shared partitioner.
> I suggest modifying modifying org.apache.spark.sql.execution.UnionExec by 
> overriding _outputPartitioning_ in a way that identifies common partitioning 
> among child plans and use that (falling back to default implementation 
> otherwise).
> furthermore, it seems current implementation does not properly clusters data:
> {code:java}
> select *, spark_partition_id() as P  from t1DU distribute by c1
> -- !query 15 schema
> struct
> -- !query 15 output
> 1 a   43
> 2 a   374
> 2 b   174
> 3 b   251
> {code}
> notice _c1=2_ in partitions 174 and 374.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25203) spark sql, union all does not propagate child partitioning (when possible)

2018-08-22 Thread Eyal Farago (JIRA)

Eyal Farago created SPARK-25203:
---

 Summary: spark sql, union all does not propagate child 
partitioning (when possible)
 Key: SPARK-25203
 URL: https://issues.apache.org/jira/browse/SPARK-25203
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 2.3.0, 2.2.0, 2.4.0
Reporter: Eyal Farago


in spark-sql, union all does not propagate partitioning when all child plans 
have the same partitioning, this causes introduction of non necessary Exchange 
nodes when parent operator requires a distribution satisfied by this 
partitioning.

 
{code:java}
CREATE OR REPLACE TEMPORARY VIEW t1 AS VALUES (1, 'a'), (2, 'b') tbl(c1, c2);
CREATE OR REPLACE TEMPORARY VIEW t1D1 AS select c1, c2 from t1 distribute by c1;
CREATE OR REPLACE TEMPORARY VIEW t1D2 AS select c1 + 1 as c11, c2 from t1 
distribute by c11;

create or REPLACE TEMPORARY VIEW t1DU as
select * from t1D1
UNION ALL
select * from t1D2;

EXPLAIN select * from t1DU distribute by c1;

== Physical Plan ==
Exchange hashpartitioning(c1#x, 200)
+- Union
   :- Exchange hashpartitioning(c1#x, 200)
   :  +- LocalTableScan [c1#x, c2#x]
   +- Exchange hashpartitioning(c11#x, 200)
  +- LocalTableScan [c11#x, c2#x]
{code}

the Exchange introduced in the last query is unnecessary since the unioned data 
is already partitioned by column _c1_, in fact the equivalent RDD operation 
identifies this scenario and introduces a PartitionerAwareUnionRDD which 
maintains children's shared partitioner.

I suggest modifying modifying org.apache.spark.sql.execution.UnionExec by 
overriding _outputPartitioning_ in a way that identifies common partitioning 
among child plans and use that (falling back to default implementation 
otherwise).

furthermore, it seems current implementation does not properly clusters data:

{code:java}
select *, spark_partition_id() as P  from t1DU distribute by c1
-- !query 15 schema
struct
-- !query 15 output
1   a   43
2   a   374
2   b   174
3   b   251
{code}

notice _c1=2_ in partitions 174 and 374.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25202) SQL Function Split Should Respect Limit Argument

2018-08-22 Thread Parker Hegstrom (JIRA)

Parker Hegstrom created SPARK-25202:
---

 Summary: SQL Function Split Should Respect Limit Argument
 Key: SPARK-25202
 URL: https://issues.apache.org/jira/browse/SPARK-25202
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Parker Hegstrom


Adds support for the setting {{limit}} in the sql split function



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25126) avoid creating OrcFile.Reader for all orc files

2018-08-22 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589326#comment-16589326
 ] 

Steve Loughran commented on SPARK-25126:


+ [~dongjoon]

> avoid creating OrcFile.Reader for all orc files
> ---
>
> Key: SPARK-25126
> URL: https://issues.apache.org/jira/browse/SPARK-25126
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Rao Fu
>Priority: Minor
>
> We have a spark job that starts by reading orc files under an S3 directory 
> and we noticed the job consumes a lot of memory when both the number of orc 
> files and the size of the file are large. The memory bloat went away with the 
> following workaround.
> 1) create a DataSet from a single orc file.
> Dataset rowsForFirstFile = spark.read().format("orc").load(oneFile);
> 2) when creating DataSet from all files under the directory, use the 
> schema from the previous DataSet.
> Dataset rows = 
> spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path);
> I believe the issue is due to the fact in order to infer the schema a 
> FileReader is created for each orc file under the directory although only the 
> first one is used. The FileReader creation loads the metadata of the orc file 
> and the memory consumption is very high when there are many files under the 
> directory.
> The issue exists in both 2.0 and HEAD.
> In 2.0, OrcFileOperator.readSchema is used.
> [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L95]
> In HEAD, OrcUtils.readSchema is used.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L82
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2018-08-22 Thread Matt Sicker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589325#comment-16589325
 ] 

Matt Sicker commented on SPARK-6305:


Could be possible that nobody is swapping it out for JUL since JUL is terrible. 
:)

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2018-08-22 Thread Chris Martin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589311#comment-16589311
 ] 

Chris Martin commented on SPARK-6305:
-

I don't think that's such a big deal so long as Spark can have a hard 
dependency on log4j directly (as opposed to being slf4j with log4j as the 
default backend).  At the moment I'm assuming the former and it's relatively 
easy (although not as easy as log4j 1.x) to adjust the log level as needed.  If 
one needs to consider other logging backends (e.g. JUL) then all this gets more 
complicated as the log level manipulation needs to work for each backend we 
support  and if we need to consider log4j 1.x as yet another backend then my 
approach is not going to work!

For waht ti's worth this is one of the reasons why I'm a bit confused by 
logging.scala.  There's some code (and a comment) in there which implies that 
users should be able to swap out for JUL if they want.  I can't see how this 
will ever work through as all the code for switching log levels is assuming 
log4j loggers!

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2018-08-22 Thread Matt Sicker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589307#comment-16589307
 ] 

Matt Sicker commented on SPARK-6305:


Right, both slf4j-api and log4j-api (the API half of Log4j2) provide only ways 
to use logging, not configure it. Any use cases you find that aren't supported 
by the Configurator class in log4j-core can always be added.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2018-08-22 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589303#comment-16589303
 ] 

Sean Owen commented on SPARK-6305:
--

Ah, Java 9 is a good point. That may force the issue.

Yes you can accommodate some log4j 1.x uses on the classpath by slipping in the 
log4j 1 to SLF4J shim as well. 

There's no way to control log levels in slf4j right? that is always delegated 
to the underlying provider? I think a bit of that is unavoidable as some stuff 
in Spark wants to control logging. Maybe it can be stripped down or 
centralized, sure.

Stuff in external/ would be treated the same way. It's meant to run with Spark. 
Some of those modules like Flume might go away in Spark 3.

I think it will have to target Spark 3, but can target Spark 3.

 

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2018-08-22 Thread Chris Martin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589293#comment-16589293
 ] 

Chris Martin commented on SPARK-6305:
-

Thanks [~srowen] and [~ste...@apache.org] for the feedback.

So far my strategy has been to exclude log4j 1.x (log4j:log4j) and the log4j 
1.x slf4j binding (org.slf4j.slf4j-log4j12) from transitive dependencies.  In 
their place I'm adding in the in the log4j-1.2-api bridge which should provide 
the log4j 1.x classes they expect with the output redirected to  log4j2.  
Hopefully this should avoid the stacktrace issue that Steve mentions but that 
would depend if any of the dependencies are doing anything funky. 

The only problems I forsee with this are:

1) There a bunch of stuff going on in logging.scala to do with lifecycle 
management and potential use of JUL that I'm genuinely unsure as to what it's 
trying to achieve.  I might have to ask on the developer mailing list to find 
out what's going on here, but if anyone here understands then do let me know.  
From what I've seen there's no need to shade any of this, but it's perfectly 
possible I might be missing something.

2) I'm less familiar with the projects in external- and I'm not entirely sure 
under what environments they should run.  I'm going to leave these til the end 
when hopefully I'll understand this a bit more!

3)  As has been mentioned- if and when we decide to move to log4j2- everyone's 
existing properties files will need to change (and from what I've read on the 
log4j jira- they will never have perfect backwards compatibility).  For now I'm 
just seeing if we can make spark use log4j2.

 

thanks,

 

Chris

 

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection

2018-08-22 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589288#comment-16589288
 ] 

Wes McKinney commented on SPARK-21375:
--

Seems there might be some requirements that need to be propagated upstream to 
Arrow. If so, please create a follow on JIRA, thanks!

> Add date and timestamp support to ArrowConverters for toPandas() collection
> ---
>
> Key: SPARK-21375
> URL: https://issues.apache.org/jira/browse/SPARK-21375
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.3.0
>
>
> Date and timestamp are not yet supported in DataFrame.toPandas() using 
> ArrowConverters.  These are common types for data analysis used in both Spark 
> and Pandas and should be supported.
> There is a discrepancy with the way that PySpark and Arrow store timestamps, 
> without timezone specified, internally.  PySpark takes a UTC timestamp that 
> is adjusted to local time and Arrow is in UTC time.  Hopefully there is a 
> clean way to resolve this.
> Spark internal storage spec:
> * *DateType* stored as days
> * *Timestamp* stored as microseconds 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25162) Kubernetes 'in-cluster' client mode and value of spark.driver.host

2018-08-22 Thread Yinan Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589286#comment-16589286
 ] 

Yinan Li commented on SPARK-25162:
--

> Where the driver is running _outside-cluster client_ mode,  would you 
>recommend a default behavior of deriving the IP address of the host on which 
>the driver  is running (provided that IP address is routable from inside the 
>cluster) and giving the user the option to override and supply a FQDN or 
>routable IP address for the driver?

The philosophy behind the client mode in the Kubernetes deployment mode is to 
not be opinionated on how users setup network connectivity from the executors 
to the driver. So it's really up to the users to decide what's the best way to 
provide such connectivity. Please check out 
https://github.com/apache/spark/blob/master/docs/running-on-kubernetes.md#client-mode.

> Kubernetes 'in-cluster' client mode and value of spark.driver.host
> --
>
> Key: SPARK-25162
> URL: https://issues.apache.org/jira/browse/SPARK-25162
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
> Environment: A java program, deployed to kubernetes, that establishes 
> a Spark Context in client mode. 
> Not using spark-submit.
> Kubernetes 1.10
> AWS EKS
>  
>  
>Reporter: James Carter
>Priority: Minor
>
> When creating Kubernetes scheduler 'in-cluster' using client mode, the value 
> for spark.driver.host can be derived from the IP address of the driver pod.
> I observed that the value of _spark.driver.host_ defaulted to the value of 
> _spark.kubernetes.driver.pod.name_, which is not a valid hostname.  This 
> caused the executors to fail to establish a connection back to the driver.
> As a work around, in my configuration I pass the driver's pod name _and_ the 
> driver's ip address to ensure that executors can establish a connection with 
> the driver.
> _spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: 
> metadata.name
> _spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp
> e.g.
> Deployment:
> {noformat}
> env:
> - name: DRIVER_POD_NAME
>   valueFrom:
> fieldRef:
>   fieldPath: metadata.name
> - name: DRIVER_POD_IP
>   valueFrom:
> fieldRef:
>   fieldPath: status.podIP
> {noformat}
>  
> Application Properties:
> {noformat}
> config[spark.kubernetes.driver.pod.name]: ${DRIVER_POD_NAME}
> config[spark.driver.host]: ${DRIVER_POD_IP}
> {noformat}
>  
> BasicExecutorFeatureStep.scala:
> {code:java}
> private val driverUrl = RpcEndpointAddress(
> kubernetesConf.get("spark.driver.host"),
> kubernetesConf.sparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT),
> CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
> {code}
>  
> Ideally only _spark.kubernetes.driver.pod.name_ would need be provided in 
> this deployment scenario.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25194) Kubernetes - Define cpu and memory limit to init container

2018-08-22 Thread Yinan Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589283#comment-16589283
 ] 

Yinan Li commented on SPARK-25194:
--

The upcoming Spark 2.4 gets rid of the init-container and switch to running 
{{spark-submit}} in client mode in the driver to download remote dependencies. 
Given that 2.4 is coming soon, I would suggest waiting for and using it 
instead. 

> Kubernetes - Define cpu and memory limit to init container
> --
>
> Key: SPARK-25194
> URL: https://issues.apache.org/jira/browse/SPARK-25194
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Daniel Majano
>Priority: Major
>  Labels: features
>
> Hi,
>  
> Recently I have started to work with spark under kubernetes. We have all our 
> kubernetes clusters with resources quotes, so if you want to do a deploy yo 
> need to define container cpu and memory limit.
>  
> With driver and executors this is ok due to with spark submit props you can 
> define this limits. But today for one of my projects, I need to load an 
> external dependency. I have tried to define the dependency with --jars and 
> the link with https so then, the init container will pop up and you don't 
> have the possibility to define limits and the submitter failed due to he 
> can't start the pod with driver + init container.
>  
>  
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25184) Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"

2018-08-22 Thread Tathagata Das (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-25184.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22182
[https://github.com/apache/spark/pull/22182]

> Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"
> ---
>
> Key: SPARK-25184
> URL: https://issues.apache.org/jira/browse/SPARK-25184
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code}
> Assert on query failed: Check total state rows = List(1), updated state rows 
> = List(2): Array() did not equal List(1) incorrect total rows, recent 
> progresses:
> {
>   "id" : "3598002e-0120-4937-8a36-226e0af992b6",
>   "runId" : "e7efe911-72fb-48aa-ba35-775057eabe55",
>   "name" : null,
>   "timestamp" : "1970-01-01T00:00:12.000Z",
>   "batchId" : 3,
>   "numInputRows" : 0,
>   "durationMs" : {
> "getEndOffset" : 0,
> "setOffsetRange" : 0,
> "triggerExecution" : 0
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "MemoryStream[value#474622]",
> "startOffset" : 2,
> "endOffset" : 2,
> "numInputRows" : 0
>   } ],
>   "sink" : {
> "description" : "MemorySink"
>   }
> }
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:55)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:33)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$executeAction$1$11.apply$mcZ$sp(StreamTest.scala:657)
>   
> org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:428)
>   
> org.apache.spark.sql.streaming.StreamTest$class.executeAction$1(StreamTest.scala:657)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:775)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:762)
> == Progress ==
>
> StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
>AddData to MemoryStream[value#474622]: a
>AdvanceManualClock(1000)
>CheckNewAnswer: [a,1]
>AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(1))
>AddData to MemoryStream[value#474622]: b
>AdvanceManualClock(1000)
>CheckNewAnswer: [b,1]
>AssertOnQuery(, Check total state rows = List(2), updated state 
> rows = List(1))
>AddData to MemoryStream[value#474622]: b
>AdvanceManualClock(1)
>CheckNewAnswer: [a,-1],[b,2]
>AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(2))
>StopStream
>
> StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
>AddData to MemoryStream[value#474622]: c
>AdvanceManualClock(11000)
>CheckNewAnswer: [b,-1],[c,1]
> => AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(2))
>AdvanceManualClock(12000)
>AssertOnQuery(, )
>AssertOnQuery(, name)
>CheckNewAnswer: [c,-1]
>AssertOnQuery(, Check total state rows = List(0), updated state 
> rows = List(0))
> == Stream ==
> Output Mode: Update
> Stream state: {MemoryStream[value#474622]: 3}
> Thread state: alive
> Thread stack trace: java.lang.Object.wait(Native Method)
> org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:61)
> org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34)
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:65)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:166)
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:293)
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203)
> == Sink ==
> 0: [a,1]
> 1: [b,1]
> 2: [a,-1] [b,2]
> 3: [b,-1] [c,1]
> == Plan ==
> == Parsed Logical Plan ==
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) 
> AS

[jira] [Assigned] (SPARK-25184) Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"

2018-08-22 Thread Tathagata Das (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-25184:
-

Assignee: Tathagata Das

> Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"
> ---
>
> Key: SPARK-25184
> URL: https://issues.apache.org/jira/browse/SPARK-25184
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code}
> Assert on query failed: Check total state rows = List(1), updated state rows 
> = List(2): Array() did not equal List(1) incorrect total rows, recent 
> progresses:
> {
>   "id" : "3598002e-0120-4937-8a36-226e0af992b6",
>   "runId" : "e7efe911-72fb-48aa-ba35-775057eabe55",
>   "name" : null,
>   "timestamp" : "1970-01-01T00:00:12.000Z",
>   "batchId" : 3,
>   "numInputRows" : 0,
>   "durationMs" : {
> "getEndOffset" : 0,
> "setOffsetRange" : 0,
> "triggerExecution" : 0
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "MemoryStream[value#474622]",
> "startOffset" : 2,
> "endOffset" : 2,
> "numInputRows" : 0
>   } ],
>   "sink" : {
> "description" : "MemorySink"
>   }
> }
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:55)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:33)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$executeAction$1$11.apply$mcZ$sp(StreamTest.scala:657)
>   
> org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:428)
>   
> org.apache.spark.sql.streaming.StreamTest$class.executeAction$1(StreamTest.scala:657)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:775)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:762)
> == Progress ==
>
> StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
>AddData to MemoryStream[value#474622]: a
>AdvanceManualClock(1000)
>CheckNewAnswer: [a,1]
>AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(1))
>AddData to MemoryStream[value#474622]: b
>AdvanceManualClock(1000)
>CheckNewAnswer: [b,1]
>AssertOnQuery(, Check total state rows = List(2), updated state 
> rows = List(1))
>AddData to MemoryStream[value#474622]: b
>AdvanceManualClock(1)
>CheckNewAnswer: [a,-1],[b,2]
>AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(2))
>StopStream
>
> StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
>AddData to MemoryStream[value#474622]: c
>AdvanceManualClock(11000)
>CheckNewAnswer: [b,-1],[c,1]
> => AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(2))
>AdvanceManualClock(12000)
>AssertOnQuery(, )
>AssertOnQuery(, name)
>CheckNewAnswer: [c,-1]
>AssertOnQuery(, Check total state rows = List(0), updated state 
> rows = List(0))
> == Stream ==
> Output Mode: Update
> Stream state: {MemoryStream[value#474622]: 3}
> Thread state: alive
> Thread stack trace: java.lang.Object.wait(Native Method)
> org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:61)
> org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34)
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:65)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:166)
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:293)
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203)
> == Sink ==
> 0: [a,1]
> 1: [b,1]
> 2: [a,-1] [b,2]
> 3: [b,-1] [c,1]
> == Plan ==
> == Parsed Logical Plan ==
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) 
> AS _1#474630, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
> StringType, fromString,

[jira] [Created] (SPARK-25201) Synchronization performed on AtomicReference in LevelDB class

2018-08-22 Thread Ted Yu (JIRA)

Ted Yu created SPARK-25201:
--

 Summary: Synchronization performed on AtomicReference in LevelDB 
class
 Key: SPARK-25201
 URL: https://issues.apache.org/jira/browse/SPARK-25201
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.3.1
Reporter: Ted Yu


Here is related code:
{code}
  void closeIterator(LevelDBIterator it) throws IOException {
synchronized (this._db) {
{code}
There are more than one occurrence of such usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2018-08-22 Thread Matt Sicker (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589255#comment-16589255
 ] 

Matt Sicker commented on SPARK-6305:


Unless you're willing to patch Log4j 1.x and maintain a fork, it doesn't work 
in Java 9 and beyond. That will certainly cause more problems than the 
alternative.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25199) InferSchema "all Strings" if one of many CSVs is empty

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25199:


Assignee: Apache Spark

> InferSchema "all Strings" if one of many CSVs is empty
> --
>
> Key: SPARK-25199
> URL: https://issues.apache.org/jira/browse/SPARK-25199
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
> Environment: I discovered this on AWS Glue, which uses Spark 2.2.1
>Reporter: Neil McGuigan
>Assignee: Apache Spark
>Priority: Minor
>  Labels: newbie
>
> Spark can load multiple CSV files in one read:
> df = spark.read.format("csv").option("header", "true").option("inferSchema", 
> "true").load("/*.csv")
> However, if one of these files is empty (though it has a header), Spark will 
> set all column types to "String"
> Spark should skip a file for inference if it contains no (non-header) rows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25199) InferSchema "all Strings" if one of many CSVs is empty

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25199:


Assignee: (was: Apache Spark)

> InferSchema "all Strings" if one of many CSVs is empty
> --
>
> Key: SPARK-25199
> URL: https://issues.apache.org/jira/browse/SPARK-25199
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
> Environment: I discovered this on AWS Glue, which uses Spark 2.2.1
>Reporter: Neil McGuigan
>Priority: Minor
>  Labels: newbie
>
> Spark can load multiple CSV files in one read:
> df = spark.read.format("csv").option("header", "true").option("inferSchema", 
> "true").load("/*.csv")
> However, if one of these files is empty (though it has a header), Spark will 
> set all column types to "String"
> Spark should skip a file for inference if it contains no (non-header) rows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25199) InferSchema "all Strings" if one of many CSVs is empty

2018-08-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589252#comment-16589252
 ] 

Apache Spark commented on SPARK-25199:
--

User 'yunjzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22177

> InferSchema "all Strings" if one of many CSVs is empty
> --
>
> Key: SPARK-25199
> URL: https://issues.apache.org/jira/browse/SPARK-25199
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
> Environment: I discovered this on AWS Glue, which uses Spark 2.2.1
>Reporter: Neil McGuigan
>Priority: Minor
>  Labels: newbie
>
> Spark can load multiple CSV files in one read:
> df = spark.read.format("csv").option("header", "true").option("inferSchema", 
> "true").load("/*.csv")
> However, if one of these files is empty (though it has a header), Spark will 
> set all column types to "String"
> Spark should skip a file for inference if it contains no (non-header) rows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-22 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589249#comment-16589249
 ] 

Marcelo Vanzin commented on SPARK-24918:


Unless he gives you push access to his repo, that's really the only option you 
have.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-22 Thread Nihar Sheth (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589248#comment-16589248
 ] 

Nihar Sheth commented on SPARK-24918:
-

[~irashid] has asked me to add testing to his PR. I'm not sure what the 
standard procedure is, can I just open another PR with his changes and the 
tests?

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25200) Allow setting HADOOP_CONF_DIR as a spark property

2018-08-22 Thread Adam Balogh (JIRA)

Adam Balogh created SPARK-25200:
---

 Summary: Allow setting HADOOP_CONF_DIR as a spark property
 Key: SPARK-25200
 URL: https://issues.apache.org/jira/browse/SPARK-25200
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Adam Balogh


When submitting applications to Yarn in cluster mode, using the 
InProcessLauncher, spark finds the cluster's configuration files based on the 
HADOOP_CONF_DIR environment variable. This does not make it possible to submit 
to more than one Yarn clusters concurrently using the InProcessLauncher.

I think we should make it possible to define HADOOP_CONF_DIR as a spark 
property, so it can be different for each spark submission.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25164:


Assignee: Apache Spark

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25164:


Assignee: (was: Apache Spark)

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-08-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589228#comment-16589228
 ] 

Apache Spark commented on SPARK-25164:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/22188

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field

2018-08-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589220#comment-16589220
 ] 

Apache Spark commented on SPARK-25178:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22187

> Use dummy name for xxxHashMapGenerator key/value schema field
> -
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25178:


Assignee: (was: Apache Spark)

> Use dummy name for xxxHashMapGenerator key/value schema field
> -
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25178:


Assignee: Apache Spark

> Use dummy name for xxxHashMapGenerator key/value schema field
> -
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25199) InferSchema "all Strings" if one of many CSVs is empty

2018-08-22 Thread Neil McGuigan (JIRA)

Neil McGuigan created SPARK-25199:
-

 Summary: InferSchema "all Strings" if one of many CSVs is empty
 Key: SPARK-25199
 URL: https://issues.apache.org/jira/browse/SPARK-25199
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.2.1
 Environment: I discovered this on AWS Glue, which uses Spark 2.2.1
Reporter: Neil McGuigan


Spark can load multiple CSV files in one read:

df = spark.read.format("csv").option("header", "true").option("inferSchema", 
"true").load("/*.csv")

However, if one of these files is empty (though it has a header), Spark will 
set all column types to "String"

Spark should skip a file for inference if it contains no (non-header) rows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the characters per row before truncation when a user runs.show()

2018-08-22 Thread Andrew K Long (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew K Long updated SPARK-24442:
--
Summary: Add configuration parameter to adjust the numbers of records and 
the characters per row before truncation when a user runs.show()  (was: Add 
configuration parameter to adjust the numbers of records and the charters per 
row before truncation when a user runs.show())

> Add configuration parameter to adjust the numbers of records and the 
> characters per row before truncation when a user runs.show()
> -
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25147) GroupedData.apply pandas_udf crashing

2018-08-22 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589177#comment-16589177
 ] 

Bryan Cutler commented on SPARK-25147:
--

Works for me on linux with:
Python 3.6.6
pyarrow 0.10.0
pandas 23.4
numpy 1.14.3

Maybe only on MacOS?

> GroupedData.apply pandas_udf crashing
> -
>
> Key: SPARK-25147
> URL: https://issues.apache.org/jira/browse/SPARK-25147
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: OS: Mac OS 10.13.6
> Python: 2.7.15, 3.6.6
> PyArrow: 0.10.0
> Pandas: 0.23.4
> Numpy: 1.15.0
>Reporter: Mike Sukmanowsky
>Priority: Major
>
> Running the following example taken straight from the docs results in 
> {{org.apache.spark.SparkException: Python worker exited unexpectedly 
> (crashed)}} for reasons that aren't clear from any logs I can see:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> spark = (
> SparkSession
> .builder
> .appName("pandas_udf")
> .getOrCreate()
> )
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v")
> )
> @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP)
> def normalize(pdf):
> v = pdf.v
> return pdf.assign(v=(v - v.mean()) / v.std())
> (
> df
> .groupby("id")
> .apply(normalize)
> .show()
> )
> {code}
>  See output.log for 
> [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28].
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25181) Block Manager master and slave thread pools are unbounded

2018-08-22 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-25181.
--
   Resolution: Fixed
 Assignee: Mukul Murthy
Fix Version/s: 2.4.0

> Block Manager master and slave thread pools are unbounded
> -
>
> Key: SPARK-25181
> URL: https://issues.apache.org/jira/browse/SPARK-25181
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have 
> thread pools with unbounded numbers of threads. In certain cases, this can 
> lead to driver OOM errors. We should add an upper bound on the number of 
> threads in these thread pools; this should not break any existing behavior 
> because they still have queues of size Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2018-08-22 Thread Alexander (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589151#comment-16589151
 ] 

Alexander commented on SPARK-7768:
--

Spark Maintainers, [~viirya], [~jodersky], what do you think?

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25183) Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; race conditions can arise

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25183:


Assignee: (was: Apache Spark)

> Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; 
> race conditions can arise
> --
>
> Key: SPARK-25183
> URL: https://issues.apache.org/jira/browse/SPARK-25183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Spark's HiveServer2 registers a shutdown hook with the JVM 
> {{Runtime.addShutdownHook()}} which can happen in parallel with the 
> ShutdownHookManager sequence of spark & Hadoop, which operate the shutdowns 
> in an ordered sequence.
> This has some risks
> * FS shutdown before rename of logs completes, SPARK-6933
> * Delays of rename on object stores may block FS close operation, which, on 
> clusters with timeouts hooks (HADOOP-12950) of FileSystem.closeAll() can 
> force a kill of that shutdown hook and other problems.
> General outcome: logs aren't present.
> Proposed fix: 
> * register hook with {{org.apache.spark.util.ShutdownHookManager}}
> * HADOOP-15679 to make shutdown wait time configurable, so O(data) renames 
> don't trigger timeouts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25183) Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; race conditions can arise

2018-08-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589149#comment-16589149
 ] 

Apache Spark commented on SPARK-25183:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/22186

> Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; 
> race conditions can arise
> --
>
> Key: SPARK-25183
> URL: https://issues.apache.org/jira/browse/SPARK-25183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Spark's HiveServer2 registers a shutdown hook with the JVM 
> {{Runtime.addShutdownHook()}} which can happen in parallel with the 
> ShutdownHookManager sequence of spark & Hadoop, which operate the shutdowns 
> in an ordered sequence.
> This has some risks
> * FS shutdown before rename of logs completes, SPARK-6933
> * Delays of rename on object stores may block FS close operation, which, on 
> clusters with timeouts hooks (HADOOP-12950) of FileSystem.closeAll() can 
> force a kill of that shutdown hook and other problems.
> General outcome: logs aren't present.
> Proposed fix: 
> * register hook with {{org.apache.spark.util.ShutdownHookManager}}
> * HADOOP-15679 to make shutdown wait time configurable, so O(data) renames 
> don't trigger timeouts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25183) Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; race conditions can arise

2018-08-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25183:


Assignee: Apache Spark

> Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; 
> race conditions can arise
> --
>
> Key: SPARK-25183
> URL: https://issues.apache.org/jira/browse/SPARK-25183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Minor
>
> Spark's HiveServer2 registers a shutdown hook with the JVM 
> {{Runtime.addShutdownHook()}} which can happen in parallel with the 
> ShutdownHookManager sequence of spark & Hadoop, which operate the shutdowns 
> in an ordered sequence.
> This has some risks
> * FS shutdown before rename of logs completes, SPARK-6933
> * Delays of rename on object stores may block FS close operation, which, on 
> clusters with timeouts hooks (HADOOP-12950) of FileSystem.closeAll() can 
> force a kill of that shutdown hook and other problems.
> General outcome: logs aren't present.
> Proposed fix: 
> * register hook with {{org.apache.spark.util.ShutdownHookManager}}
> * HADOOP-15679 to make shutdown wait time configurable, so O(data) renames 
> don't trigger timeouts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2018-08-22 Thread Alexander (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589146#comment-16589146
 ] 

Alexander commented on SPARK-7768:
--

Thanks [~eje] and [~metasim] for the awesome examples! I haven't even 
considered writing custom struct-types for my UDTs yet but this is completely 
doable! This UDTRegistration API is not friendly enough to be used by Spark 
beginners (having to use InternalRow!) but there's absolutely no reason it 
should still be private! TDigest and Tile are a bit complex for in introductory 
Scala users to understand so maybe it makes sense to put together a simpler 
example. I made an SO post talking about very basic usage of UDTRegistration 
here: 
https://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-dataset/51957666#51957666

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7768) Make user-defined type (UDT) API public

2018-08-22 Thread Alexander (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589146#comment-16589146
 ] 

Alexander edited comment on SPARK-7768 at 8/22/18 5:24 PM:
---

Thanks [~eje] and [~metasim] for the awesome examples! I haven't even 
considered writing custom struct-types for my UDTs yet but this is completely 
doable! This UDTRegistration API is not friendly enough to be used by Spark 
beginners (having to use InternalRow!) but there's absolutely no reason it 
should still be private! TDigest and Tile are a bit complex for in introductory 
Scala users to understand so maybe it makes sense to put together a simpler 
example. I made an SO post talking about very basic usage of UDTRegistration 
here: 

[https://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-dataset/51957666#51957666]


was (Author: alexanderioffe):
Thanks [~eje] and [~metasim] for the awesome examples! I haven't even 
considered writing custom struct-types for my UDTs yet but this is completely 
doable! This UDTRegistration API is not friendly enough to be used by Spark 
beginners (having to use InternalRow!) but there's absolutely no reason it 
should still be private! TDigest and Tile are a bit complex for in introductory 
Scala users to understand so maybe it makes sense to put together a simpler 
example. I made an SO post talking about very basic usage of UDTRegistration 
here: 
https://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-dataset/51957666#51957666

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23698) Spark code contains numerous undefined names in Python 3

2018-08-22 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589143#comment-16589143
 ] 

Bryan Cutler commented on SPARK-23698:
--

Followup resolved by pull request 20838
https://github.com/apache/spark/pull/20838


> Spark code contains numerous undefined names in Python 3
> 
>
> Key: SPARK-23698
> URL: https://issues.apache.org/jira/browse/SPARK-23698
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: cclauss
>Assignee: cclauss
>Priority: Minor
> Fix For: 2.4.0
>
>
> flake8 testing of https://github.com/apache/spark on Python 3.6.3
> $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source 
> --statistics*
> ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input'
> result = raw_input("\n%s (y/n): " % prompt)
>  ^
> ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input'
> primary_author = raw_input(
>  ^
> ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input'
> pick_ref = raw_input("Enter a branch name [%s]: " % default_branch)
>^
> ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input'
> jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id)
>   ^
> ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input'
> fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % 
> default_fix_versions)
>^
> ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input'
> raw_assignee = raw_input(
>^
> ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input'
> pr_num = raw_input("Which pull request would you like to merge? (e.g. 
> 34): ")
>  ^
> ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input'
> result = raw_input("Would you like to use the modified title? (y/n): 
> ")
>  ^
> ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input'
> while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y":
>   ^
> ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input'
> response = raw_input("%s [y/n]: " % msg)
>^
> ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode'
> author = unidecode.unidecode(unicode(author, "UTF-8")).strip()
>  ^
> ./python/setup.py:37:11: F821 undefined name '__version__'
> VERSION = __version__
>   ^
> ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer'
> dispatch[buffer] = save_buffer
>  ^
> ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file'
> dispatch[file] = save_file
>  ^
> ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode'
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> ^
> ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long'
> intlike = (int, long)
> ^
> ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
> return self._sc._jvm.Time(long(timestamp * 1000))
>   ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 
> undefined name 'xrange'
> for i in xrange(50):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 
> undefined name 'xrange'
> for j in xrange(5):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 
> undefined name 'xrange'
> for k in xrange(20022):
>  ^
> 20F821 undefined name 'raw_input'
> 20



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-22 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-25105.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22100
[https://github.com/apache/spark/pull/22100]

> Importing all of pyspark.sql.functions should bring PandasUDFType in as well
> 
>
> Key: SPARK-25105
> URL: https://issues.apache.org/jira/browse/SPARK-25105
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: kevin yu
>Priority: Trivial
> Fix For: 2.4.0
>
>
>  
> {code:java}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> Traceback (most recent call last):
>  File "", line 1, in 
> NameError: name 'PandasUDFType' is not defined
>  
> {code}
> When explicitly imported it works fine:
> {code:java}
>  
> >>> from pyspark.sql.functions import PandasUDFType
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> {code}
>  
> We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-22 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-25105:


Assignee: kevin yu

> Importing all of pyspark.sql.functions should bring PandasUDFType in as well
> 
>
> Key: SPARK-25105
> URL: https://issues.apache.org/jira/browse/SPARK-25105
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: kevin yu
>Priority: Trivial
> Fix For: 2.4.0
>
>
>  
> {code:java}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> Traceback (most recent call last):
>  File "", line 1, in 
> NameError: name 'PandasUDFType' is not defined
>  
> {code}
> When explicitly imported it works fine:
> {code:java}
>  
> >>> from pyspark.sql.functions import PandasUDFType
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> {code}
>  
> We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25198) org.apache.spark.sql.catalyst.parser.ParseException: DataType json is not supported.

2018-08-22 Thread antonkulaga (JIRA)

antonkulaga created SPARK-25198:
---

 Summary: org.apache.spark.sql.catalyst.parser.ParseException: 
DataType json is not supported.
 Key: SPARK-25198
 URL: https://issues.apache.org/jira/browse/SPARK-25198
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
 Environment: Ubuntu 18.04, Spark 2.3.1, 
org.postgresql:postgresql:42.2.4
Reporter: antonkulaga


Whenever I try to save the dataframe with one of the columns with JSON string 
inside to the latest Postgres I get 
org.apache.spark.sql.catalyst.parser.ParseException: DataType json is not 
supported. As Postgres supports JSON well and I use the latest postgresql 
client I expect it to work. Here is an example of the code that crashes

val columnTypes = """id integer, parameters json, title text, gsm text, gse 
text, organism text, characteristics text, molecule text, model text, 
description text, treatment_protocol text, extract_protocol text, source_name 
text,data_processing text, submission_date text,last_update_date text, status 
text, type text, contact text, gpl text"""

myDataframe.write.format("jdbc").option("url", 
"jdbc:postgresql://db/sequencing").option("customSchema", 
columnTypes).option("dbtable", "test").option("user", 
"postgres").option("password", "changeme").save()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25188) Add WriteConfig

2018-08-22 Thread Ryan Blue (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589088#comment-16589088
 ] 

Ryan Blue commented on SPARK-25188:
---

One update to that proposal: {{BatchOverwriteSupport}} should be split into two 
interfaces: one for dynamic overwrite and one for overwrite using filter 
expressions. That supports the two use cases separately, since some sources 
won't support dynamic partition overwrite.

> Add WriteConfig
> ---
>
> Key: SPARK-25188
> URL: https://issues.apache.org/jira/browse/SPARK-25188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current write API is not flexible enough to implement more complex write 
> operations like `replaceWhere`. We can follow the read API and add a 
> `WriteConfig` to make it more flexible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25188) Add WriteConfig

2018-08-22 Thread Ryan Blue (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589086#comment-16589086
 ] 

Ryan Blue commented on SPARK-25188:
---

Here's the original proposal for adding a write config:

The read side has a {{ScanConfig}}, but the write side doesn't have an 
equivalent object that tracks a particular write. I think if we introduce one, 
the API would be more similar between the read and write sides, and we would 
have a better API for overwrite operations. I propose adding a {{WriteConfig}} 
object and passing it like this:

{code:lang=java}
interface BatchWriteSupport {
  WriteConfig newWriteConfig(writeOptions: Map[String, String])

  DataWriterFactory createWriterFactory(WriteConfig)

  void commit(WriteConfig, WriterCommitMessage[])
}
{code}

That allows us to pass options for the write that affect how the WriterFactory 
operates. For example, in Iceberg I could request using Orc as the underlying 
format instead of Parquet. (I also suggested an addition like this for the read 
side.)

The other benefit of adding {{WriteConfig}} is that it provides a clean way of 
adding the ReplaceData operations. The ones I'm currently working on are 
ReplaceDynamicPartitions and ReplaceData. The first one removes any data in 
partitions that are being written to, and the second one replaces data based on 
a filter: e.g. {{df.writeTo(t).overwrite($"day" == "2018-08-15")}}. The 
information about replacement could be carried by {{WriteConfig}} to {{commit}} 
and would be created with a support interface:

{code:lang=java}
interface BatchOverwriteSupport extends BatchWriteSupport {
  WriteConfig newOverwrite(writeOptions, filters: Filter[])

  WriteConfig newDynamicOverwrite(writeOptions)
}
{code}

This is much cleaner than staging a delete and then running a write to complete 
the operation. All of the information about what to overwrite is just passed to 
the commit operation that can handle it at once. This is much better for 
dynamic partition replacement because the partitions to be replaced aren't even 
known by Spark before the write.

Last, this adds a place for write life-cycle operations that matches the 
ScanConfig read life-cycle. This could be used to perform operations like 
getting a write lock on a Hive table if someone wanted to support Hive's 
locking mechanism in the future.

> Add WriteConfig
> ---
>
> Key: SPARK-25188
> URL: https://issues.apache.org/jira/browse/SPARK-25188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current write API is not flexible enough to implement more complex write 
> operations like `replaceWhere`. We can follow the read API and add a 
> `WriteConfig` to make it more flexible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25127) DataSourceV2: Remove SupportsPushDownCatalystFilters

2018-08-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589084#comment-16589084
 ] 

Apache Spark commented on SPARK-25127:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22185

> DataSourceV2: Remove SupportsPushDownCatalystFilters
> 
>
> Key: SPARK-25127
> URL: https://issues.apache.org/jira/browse/SPARK-25127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.4.0
>Reporter: Ryan Blue
>Assignee: Reynold Xin
>Priority: Major
>
> Discussion about adding TableCatalog on the dev list focused around whether 
> Expression should be used in the public DataSourceV2 API, with 
> SupportsPushDownCatalystFilters as an example of where it is already exposed. 
> The early consensus is that Expression should not be exposed in the public 
> API.
> From [~rxin]:
> bq. I completely disagree with using Expression in critical public APIs that 
> we expect a lot of developers to use . . . If we are depending on Expressions 
> on the more common APIs in dsv2 already, we should revisit that.
> The main use of this API is to pass Expression to FileFormat classes that 
> used Expression instead of Filter. External sources also use it for more 
> complex push-down, like {{to_date(ts) = '2018-05-13'}}, but those uses can be 
> done with Analyzer rules or when translating to Filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25190) Better operator pushdown API

2018-08-22 Thread Ryan Blue (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589070#comment-16589070
 ] 

Ryan Blue commented on SPARK-25190:
---

The main problem I have with the current pushdown API is that Spark gets 
information back from pushdown before all pushdown is finished. I like the idea 
to have an immutable ScanConfig that is the result of pushdown operations, so 
it is clear that pushdown calls are finished before getting a reader factory or 
asking for statistics.

Unlike those methods that accept a ScanConfig, using {{SupportsPushDownXYZ}} 
for pushdown means that Spark is getting the results of pushdown operations 
back before all pushdown is complete. That means that the same confusion over 
pushdown order still exists, although the problem is fixed for some operations. 
I think that all feedback from pushdown should come from the ScanConfig.

> Better operator pushdown API
> 
>
> Key: SPARK-25190
> URL: https://issues.apache.org/jira/browse/SPARK-25190
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current operator pushdown API is a little hard to use. It defines several 
> {{SupportsPushdownXYZ}} interfaces and ask the implementation to be mutable 
> to store the pushdown result. We should design a builder like API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25187) Revisit the life cycle of ReadSupport instances.

2018-08-22 Thread Ryan Blue (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589060#comment-16589060
 ] 

Ryan Blue edited comment on SPARK-25187 at 8/22/18 3:58 PM:


The need for {{newScanConfigBuilder}} to take key-value options doesn't require 
a change to the life-cycle of {{ReadSupport}} instances. There are options that 
are related to scan configuration, and not to source configuration. If data 
sources are free to reuse {{ReadSupport}} instances, then scan options must be 
passed to configure the scan.

HBase provides a good example of the difference. HBase table options would 
include where the data lives, like the HBase host to connect to. HBase scan 
options would include the MVCC timestamp to request for a scan. A HBase 
ReadSupport can be reused, which means that the MVCC timestamp used should be 
one passed to the scan, not the one passed to when creating the {{ReadSupport}}.

I understand that this is a little confusing because right now both sets of 
options are mixed together when using 
{{spark.read.format("fmt").option(...).load()}}. The only way to set both types 
of options is to pass them to the {{DataFrameReader}}. That makes it appear 
that there is only one set of options for a source. But, consider sources that 
are stored in the the session catalog. Those sources are stored with 
source/table configuration, the {{OPTIONS}} passed in when creating the table. 
When reading these tables, we can also pass options to the {{DataFrameReader}}, 
which need to be passed when creating a scan of those sources.


was (Author: rdblue):
The need for {{newScanConfigBuilder}} to take key-value options doesn't require 
a change to the life-cycle of {{ReadSupport}} instances. There are options that 
are related to scan configuration, and not to source configuration. If data 
sources are free to reuse {{ReadSupport}} instances, then scan options must be 
passed to configure the scan.

HBase provides a good example of the difference. HBase table options would 
include where the data lives, like the HBase host to connect to. HBase scan 
options would include the MVCC timestamp to request for a scan. A HBase 
ReadSupport can be reused, which means that the MVCC timestamp used should be 
one passed to the scan, not the one passed to when creating the {{ReadSupport}}.

I understand that this is a little confusing because right now both sets of 
options are mixed together when using 
{{spark.read.format("fmt").option(...).load()}}. The only way to set these 
options is to pass them to the {{DataFrameReader}}. That makes it appear that 
there is only one set of options for a source. But, consider sources that are 
stored in the the session catalog. Those sources are stored with source/table 
configuration, the {{OPTIONS}} passed in when creating the table. When reading 
these tables, we can also pass options to the {{DataFrameReader}}, which need 
to be passed when creating a scan of those sources.

> Revisit the life cycle of ReadSupport instances.
> 
>
> Key: SPARK-25187
> URL: https://issues.apache.org/jira/browse/SPARK-25187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently the life cycle is bound to the batch/stream query. This fits 
> streaming very well but may not be perfect for batch source. We can also 
> consider to let {{ReadSupport.newScanConfigBuilder}} take 
> {{DataSourceOptions}} as parameter, if we decide to change the life cycle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25187) Revisit the life cycle of ReadSupport instances.

2018-08-22 Thread Ryan Blue (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589060#comment-16589060
 ] 

Ryan Blue edited comment on SPARK-25187 at 8/22/18 3:57 PM:


The need for {{newScanConfigBuilder}} to take key-value options doesn't require 
a change to the life-cycle of {{ReadSupport}} instances. There are options that 
are related to scan configuration, and not to source configuration. If data 
sources are free to reuse {{ReadSupport}} instances, then scan options must be 
passed to configure the scan.

HBase provides a good example of the difference. HBase table options would 
include where the data lives, like the HBase host to connect to. HBase scan 
options would include the MVCC timestamp to request for a scan. A HBase 
ReadSupport can be reused, which means that the MVCC timestamp used should be 
one passed to the scan, not the one passed to when creating the {{ReadSupport}}.

I understand that this is a little confusing because right now both sets of 
options are mixed together when using 
{{spark.read.format("fmt").option(...).load()}}. The only way to set these 
options is to pass them to the {{DataFrameReader}}. That makes it appear that 
there is only one set of options for a source. But, consider sources that are 
stored in the the session catalog. Those sources are stored with source/table 
configuration, the {{OPTIONS}} passed in when creating the table. When reading 
these tables, we can also pass options to the {{DataFrameReader}}, which need 
to be passed when creating a scan of those sources.


was (Author: rdblue):
The need for {{newScanConfigBuilder}} to take key-value options doesn't require 
a change to the life-cycle of {{ReadSupport}} instances. There are options that 
are related to scan configuration, and not to source configuration. If data 
sources are free to reuse {{ReadSupport}} instances, then scan options must be 
passed to configure the scan.

HBase provides a good example of the difference. HBase table options would 
include where the data lives, like the HBase host to connect to. HBase scan 
options would include the MVCC timestamp to request for a scan. A HBase 
ReadSupport can be reused, which means that the MVCC timestamp used should be 
one passed to the scan, not the one passed to when creating the {{ReadSupport}}.

I understand that this is a little confusing because right now both sets of 
options are mixed together. The only way to set these options is to pass them 
to the {{DataFrameReader}}. That makes it appear that there is only one set of 
options for a source. But, consider sources that are stored in the the session 
catalog. Those sources are stored with source/table configuration, the 
{{OPTIONS}} passed in when creating the table. When reading these tables, we 
can also pass options to the {{DataFrameReader}}, which need to be passed when 
creating a scan of those sources.

> Revisit the life cycle of ReadSupport instances.
> 
>
> Key: SPARK-25187
> URL: https://issues.apache.org/jira/browse/SPARK-25187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently the life cycle is bound to the batch/stream query. This fits 
> streaming very well but may not be perfect for batch source. We can also 
> consider to let {{ReadSupport.newScanConfigBuilder}} take 
> {{DataSourceOptions}} as parameter, if we decide to change the life cycle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25187) Revisit the life cycle of ReadSupport instances.

2018-08-22 Thread Ryan Blue (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589060#comment-16589060
 ] 

Ryan Blue commented on SPARK-25187:
---

The need for {{newScanConfigBuilder}} to take key-value options doesn't require 
a change to the life-cycle of {{ReadSupport}} instances. There are options that 
are related to scan configuration, and not to source configuration. If data 
sources are free to reuse {{ReadSupport}} instances, then scan options must be 
passed to configure the scan.

HBase provides a good example of the difference. HBase table options would 
include where the data lives, like the HBase host to connect to. HBase scan 
options would include the MVCC timestamp to request for a scan. A HBase 
ReadSupport can be reused, which means that the MVCC timestamp used should be 
one passed to the scan, not the one passed to when creating the {{ReadSupport}}.

I understand that this is a little confusing because right now both sets of 
options are mixed together. The only way to set these options is to pass them 
to the {{DataFrameReader}}. That makes it appear that there is only one set of 
options for a source. But, consider sources that are stored in the the session 
catalog. Those sources are stored with source/table configuration, the 
{{OPTIONS}} passed in when creating the table. When reading these tables, we 
can also pass options to the {{DataFrameReader}}, which need to be passed when 
creating a scan of those sources.

> Revisit the life cycle of ReadSupport instances.
> 
>
> Key: SPARK-25187
> URL: https://issues.apache.org/jira/browse/SPARK-25187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently the life cycle is bound to the batch/stream query. This fits 
> streaming very well but may not be perfect for batch source. We can also 
> consider to let {{ReadSupport.newScanConfigBuilder}} take 
> {{DataSourceOptions}} as parameter, if we decide to change the life cycle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25196) Analyze column statistics in cached query

2018-08-22 Thread Reynold Xin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589050#comment-16589050
 ] 

Reynold Xin commented on SPARK-25196:
-

Can we rework the interface so the two are not separate code path?

 

> Analyze column statistics in cached query
> -
>
> Key: SPARK-25196
> URL: https://issues.apache.org/jira/browse/SPARK-25196
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In common usecases, users read catalog table data, join/aggregate them, and 
> then cache the result for following reuse. Since we are only allowed to 
> analyze column statistics in catalog tables via ANALYZE commands, the 
> optimization depends on non-existing or inaccurate column statistics of 
> cached data. So, I think it'd be nice if Spark could analyze cached data and 
> hold temporary column statistics for InMemoryRelation.
> For example, we might be able to add a new API (e.g., 
> analyzeColumnCacheQuery) to do so in CacheManager;
>  POC: 
> [https://github.com/apache/spark/compare/master...maropu:AnalyzeCacheQuery]
> {code:java}
> scala> sql("SET spark.sql.cbo.enabled=true")
> scala> sql("SET spark.sql.statistics.histogram.enabled=true")
> scala> spark.range(1000).selectExpr("id % 33 AS c0", "rand() AS c1", "0 AS 
> c2").write.saveAsTable("t")
> scala> sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS c0, c1, c2")
> scala> val cacheManager = spark.sharedState.cacheManager
> scala> def printColumnStats(data: org.apache.spark.sql.DataFrame) = {
>  |   data.queryExecution.optimizedPlan.stats.attributeStats.foreach {
>  | case (k, v) => println(s"[$k]: $v")
>  |   }
>  | }
> scala> def df() = spark.table("t").groupBy("c0").agg(count("c1").as("v1"), 
> sum("c2").as("v2"))
> // Prints column statistics in catalog table `t`
> scala> printColumnStats(spark.table("t"))
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
> [c1#7074]: 
> ColumnStat(Some(997),Some(5.958619423369615E-4),Some(0.9988009488973438),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@4ef69c53)))
> [c2#7075]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(4),Some(4),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@7cbaf548)))
> // Prints column statistics on query result `df`
> scala> printColumnStats(df())
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
> // Prints column statistics on cached data of `df`
> scala> printColumnStats(df().cache)
> 
> // A new API described above
> scala> cacheManager.analyzeColumnCacheQuery(df(), "v1" :: "v2" :: Nil)
>   
>   
> // Then, prints again
> scala> printColumnStats(df())
> [v1#7101L]: 
> ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
> [v2#7103L]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
> scala> cacheManager.analyzeColumnCacheQuery(df(), "c0" :: Nil)
> scala> printColumnStats(df())
> [v1#7101L]: 
> ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
> [v2#7103L]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@626bcfc8)))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25147) GroupedData.apply pandas_udf crashing

2018-08-22 Thread Mike Sukmanowsky (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589027#comment-16589027
 ] 

Mike Sukmanowsky commented on SPARK-25147:
--

[~hyukjin.kwon] should I take any other action here? I wasn't sure how else to 
debug the issue unless I dig into Spark internals.

> GroupedData.apply pandas_udf crashing
> -
>
> Key: SPARK-25147
> URL: https://issues.apache.org/jira/browse/SPARK-25147
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: OS: Mac OS 10.13.6
> Python: 2.7.15, 3.6.6
> PyArrow: 0.10.0
> Pandas: 0.23.4
> Numpy: 1.15.0
>Reporter: Mike Sukmanowsky
>Priority: Major
>
> Running the following example taken straight from the docs results in 
> {{org.apache.spark.SparkException: Python worker exited unexpectedly 
> (crashed)}} for reasons that aren't clear from any logs I can see:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> spark = (
> SparkSession
> .builder
> .appName("pandas_udf")
> .getOrCreate()
> )
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v")
> )
> @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP)
> def normalize(pdf):
> v = pdf.v
> return pdf.assign(v=(v - v.mean()) / v.std())
> (
> df
> .groupby("id")
> .apply(normalize)
> .show()
> )
> {code}
>  See output.log for 
> [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28].
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-22 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589018#comment-16589018
 ] 

Xiao Li commented on SPARK-23874:
-

[~bryanc] Could you list some examples that can affect our Spark users?

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field

2018-08-22 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589015#comment-16589015
 ] 

Xiao Li commented on SPARK-25178:
-

Thanks!

> Use dummy name for xxxHashMapGenerator key/value schema field
> -
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25196) Analyze column statistics in cached query

2018-08-22 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589013#comment-16589013
 ] 

Xiao Li commented on SPARK-25196:
-

This sounds reasonable to me. 

> Analyze column statistics in cached query
> -
>
> Key: SPARK-25196
> URL: https://issues.apache.org/jira/browse/SPARK-25196
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In common usecases, users read catalog table data, join/aggregate them, and 
> then cache the result for following reuse. Since we are only allowed to 
> analyze column statistics in catalog tables via ANALYZE commands, the 
> optimization depends on non-existing or inaccurate column statistics of 
> cached data. So, I think it'd be nice if Spark could analyze cached data and 
> hold temporary column statistics for InMemoryRelation.
> For example, we might be able to add a new API (e.g., 
> analyzeColumnCacheQuery) to do so in CacheManager;
>  POC: 
> [https://github.com/apache/spark/compare/master...maropu:AnalyzeCacheQuery]
> {code:java}
> scala> sql("SET spark.sql.cbo.enabled=true")
> scala> sql("SET spark.sql.statistics.histogram.enabled=true")
> scala> spark.range(1000).selectExpr("id % 33 AS c0", "rand() AS c1", "0 AS 
> c2").write.saveAsTable("t")
> scala> sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS c0, c1, c2")
> scala> val cacheManager = spark.sharedState.cacheManager
> scala> def printColumnStats(data: org.apache.spark.sql.DataFrame) = {
>  |   data.queryExecution.optimizedPlan.stats.attributeStats.foreach {
>  | case (k, v) => println(s"[$k]: $v")
>  |   }
>  | }
> scala> def df() = spark.table("t").groupBy("c0").agg(count("c1").as("v1"), 
> sum("c2").as("v2"))
> // Prints column statistics in catalog table `t`
> scala> printColumnStats(spark.table("t"))
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
> [c1#7074]: 
> ColumnStat(Some(997),Some(5.958619423369615E-4),Some(0.9988009488973438),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@4ef69c53)))
> [c2#7075]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(4),Some(4),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@7cbaf548)))
> // Prints column statistics on query result `df`
> scala> printColumnStats(df())
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
> // Prints column statistics on cached data of `df`
> scala> printColumnStats(df().cache)
> 
> // A new API described above
> scala> cacheManager.analyzeColumnCacheQuery(df(), "v1" :: "v2" :: Nil)
>   
>   
> // Then, prints again
> scala> printColumnStats(df())
> [v1#7101L]: 
> ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
> [v2#7103L]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
> scala> cacheManager.analyzeColumnCacheQuery(df(), "c0" :: Nil)
> scala> printColumnStats(df())
> [v1#7101L]: 
> ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
> [v2#7103L]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@626bcfc8)))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25196) Analyze column statistics in cached query

2018-08-22 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589014#comment-16589014
 ] 

Xiao Li commented on SPARK-25196:
-

cc [~rxin]

> Analyze column statistics in cached query
> -
>
> Key: SPARK-25196
> URL: https://issues.apache.org/jira/browse/SPARK-25196
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In common usecases, users read catalog table data, join/aggregate them, and 
> then cache the result for following reuse. Since we are only allowed to 
> analyze column statistics in catalog tables via ANALYZE commands, the 
> optimization depends on non-existing or inaccurate column statistics of 
> cached data. So, I think it'd be nice if Spark could analyze cached data and 
> hold temporary column statistics for InMemoryRelation.
> For example, we might be able to add a new API (e.g., 
> analyzeColumnCacheQuery) to do so in CacheManager;
>  POC: 
> [https://github.com/apache/spark/compare/master...maropu:AnalyzeCacheQuery]
> {code:java}
> scala> sql("SET spark.sql.cbo.enabled=true")
> scala> sql("SET spark.sql.statistics.histogram.enabled=true")
> scala> spark.range(1000).selectExpr("id % 33 AS c0", "rand() AS c1", "0 AS 
> c2").write.saveAsTable("t")
> scala> sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS c0, c1, c2")
> scala> val cacheManager = spark.sharedState.cacheManager
> scala> def printColumnStats(data: org.apache.spark.sql.DataFrame) = {
>  |   data.queryExecution.optimizedPlan.stats.attributeStats.foreach {
>  | case (k, v) => println(s"[$k]: $v")
>  |   }
>  | }
> scala> def df() = spark.table("t").groupBy("c0").agg(count("c1").as("v1"), 
> sum("c2").as("v2"))
> // Prints column statistics in catalog table `t`
> scala> printColumnStats(spark.table("t"))
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
> [c1#7074]: 
> ColumnStat(Some(997),Some(5.958619423369615E-4),Some(0.9988009488973438),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@4ef69c53)))
> [c2#7075]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(4),Some(4),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@7cbaf548)))
> // Prints column statistics on query result `df`
> scala> printColumnStats(df())
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(3.937007874015748,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@209c0be5)))
> // Prints column statistics on cached data of `df`
> scala> printColumnStats(df().cache)
> 
> // A new API described above
> scala> cacheManager.analyzeColumnCacheQuery(df(), "v1" :: "v2" :: Nil)
>   
>   
> // Then, prints again
> scala> printColumnStats(df())
> [v1#7101L]: 
> ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
> [v2#7103L]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
> scala> cacheManager.analyzeColumnCacheQuery(df(), "c0" :: Nil)
> scala> printColumnStats(df())
> [v1#7101L]: 
> ColumnStat(Some(2),Some(30),Some(31),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@e2ff893)))
> [v2#7103L]: 
> ColumnStat(Some(1),Some(0),Some(0),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@1498a4d)))
> [c0#7073L]: 
> ColumnStat(Some(33),Some(0),Some(32),Some(0),Some(8),Some(8),Some(Histogram(0.12992125984251968,[Lorg.apache.spark.sql.catalyst.plans.logical.HistogramBin;@626bcfc8)))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 133 matches

Mail list logo