[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593200#comment-16593200
 ] 

Chenxiao Mao commented on SPARK-25175:
--

Also here is the similar investigation I did for parquet tables. Just for your 
information: https://github.com/apache/spark/pull/22184/files#r212405373

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat doesn't support case-insensitive field 
> resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao reopened SPARK-25175:
--

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat doesn't support case-insensitive field 
> resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593194#comment-16593194
 ] 

Chenxiao Mao commented on SPARK-25175:
--

[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables always return a,B,c, no matter whether 
spark.sql.caseSensitive is set to true or false and no matter metastore table 
schema is in lower case or upper case. Since ORC file schema is (a,B,c,C),
 ** Is it better to return null in scenario 2 and 10? 
 ** Is it better to return C in scenario 12?
 ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather 
than always return lower case one?
 * The hive serde tables always throw IndexOutOfBoundsException at runtime.
 * Since in case-sensitive mode analysis should fail if a column name in query 
and metastore schema are in different cases, all AnalysisException(s) meet our 
expectation.

Stacktrace of IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: toIndex = 4
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}
 

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports 

[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593185#comment-16593185
 ] 

Chenxiao Mao commented on SPARK-25175:
--

Thorough investigation about ORC tables
{code}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'

DESC EXTENDED orc_data_source_lower;
DESC EXTENDED orc_data_source_upper;
DESC EXTENDED orc_hive_serde_lower;
DESC EXTENDED orc_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
 
||no.||caseSensitive||table columns||select column||orc column
 (select via data source table)||orc column
 (select via hive serde table)||
|1|true|a, b, c|a|a |IndexOutOfBoundsException |
|2| | |b|B |IndexOutOfBoundsException |
|3| | |c|c |IndexOutOfBoundsException |
|4| | |A|AnalysisException|AnalysisException|
|5| | |B|AnalysisException|AnalysisException|
|6| | |C|AnalysisException|AnalysisException|
|7| |A, B, C|a|AnalysisException |AnalysisException|
|8| | |b|AnalysisException |AnalysisException |
|9| | |c|AnalysisException |AnalysisException |
|10| | |A|a |IndexOutOfBoundsException |
|11| | |B|B |IndexOutOfBoundsException |
|12| | |C|c |IndexOutOfBoundsException |
|13|false|a, b, c|a|a |IndexOutOfBoundsException |
|14| | |b|B |IndexOutOfBoundsException |
|15| | |c|c |IndexOutOfBoundsException |
|16| | |A|a |IndexOutOfBoundsException |
|17| | |B|B |IndexOutOfBoundsException |
|18| | |C|c |IndexOutOfBoundsException |
|19| |A, B, C|a|a |IndexOutOfBoundsException |
|20| | |b|B |IndexOutOfBoundsException |
|21| | |c|c |IndexOutOfBoundsException |
|22| | |A|a |IndexOutOfBoundsException |
|23| | |B|B |IndexOutOfBoundsException |
|24| | |C|c |IndexOutOfBoundsException |

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat doesn't support case-insensitive field 
> resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter cases 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 addressed this issue already.

 

The biggest difference is, in Spark 2.1, user will get Exception for the same 
query:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
So they will know the issue and fix the query.

But in Spark 2.3, user will get the wrong results sliently.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

 

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter cases 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 addressed this issue already.

 

The biggest difference is, in Spark 2.1, user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
So they will know the issue and fix the query.

But in Spark 2.3, user will get the wrong results sliently.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in 

[jira] [Commented] (SPARK-25248) Audit barrier APIs for Spark 2.4

2018-08-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593176#comment-16593176
 ] 

Apache Spark commented on SPARK-25248:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/22240

> Audit barrier APIs for Spark 2.4
> 
>
> Key: SPARK-25248
> URL: https://issues.apache.org/jira/browse/SPARK-25248
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Make a pass over APIs added for barrier execution mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25248) Audit barrier APIs for Spark 2.4

2018-08-26 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25248:
-

 Summary: Audit barrier APIs for Spark 2.4
 Key: SPARK-25248
 URL: https://issues.apache.org/jira/browse/SPARK-25248
 Project: Spark
  Issue Type: Story
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Make a pass over APIs added for barrier execution mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25247) Make RDDBarrier configurable

2018-08-26 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25247:
-

 Summary: Make RDDBarrier configurable
 Key: SPARK-25247
 URL: https://issues.apache.org/jira/browse/SPARK-25247
 Project: Spark
  Issue Type: Story
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


Currently we only offer one method under `RDDBarrier`. Users might want to have 
better control over a barrier stage, e.g., timeout behavior, failure recovery, 
etc. This Jira is to discuss what options we should provide under RDDBarrier.

 

Note: Users can use multiple RDDBarrier in a single barrier stage. So we also 
need to discuss how to merge the options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593156#comment-16593156
 ] 

Dongjoon Hyun commented on SPARK-25175:
---

Thanks, [~yucai]. I'm highly interested in this case. I'll wait for his 
reopening. :)

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat doesn't support case-insensitive field 
> resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25175.
---
Resolution: Cannot Reproduce

I followed the same direction given by SPARK-25132, but I cannot reproduce this 
in Spark 2.3.1.

{code}
scala> spark.version
res8: String = 2.3.1

scala> 
spark.range(5).toDF.write.mode("overwrite").format("orc").saveAsTable("t3")

scala> sql("create table t4 (`ID` BIGINT) USING orc LOCATION 
'/Users/dongjoon/spark-release/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t3'")

scala> sql("select * from t3").show
+---+
| id|
+---+
|  2|
|  3|
|  4|
|  1|
|  0|
+---+

scala> sql("select * from t4").show
+---+
| ID|
+---+
|  2|
|  3|
|  4|
|  1|
|  0|
+---+
{code}

Please reopen this with a reproducible example. Thanks, [~seancxmao].

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat doesn't support case-insensitive field 
> resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593152#comment-16593152
 ] 

yucai commented on SPARK-25175:
---

I pinged [~seancxmao] offline, he will give more details.

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat doesn't support case-insensitive field 
> resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Summary: wrong records are returned when Hive metastore schema and parquet 
schema are in different letter cases  (was: data issue when Hive metastore 
schema and parquet schema are in different letter cases)

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema are in different letter cases

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter cases 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 addressed this issue already.

 

The biggest difference is, in Spark 2.1, user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
So they will know the issue and fix the query.

But in Spark 2.3, user will get the wrong results sliently.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 solved this issue.

 

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue when Hive metastore schema and parquet schema are in different 
> letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest 

[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema are in different letter cases

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 solved this issue.

 

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 solved this issue.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue when Hive metastore schema and parquet schema are in different 
> letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 solved this issue.
>  
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column 

[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema are in different letter cases

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Summary: data issue when Hive metastore schema and parquet schema are in 
different letter cases  (was: data issue when Hive metastore schema and parquet 
schema have different letter case)

> data issue when Hive metastore schema and parquet schema are in different 
> letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 solved this issue.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema have different letter case

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Summary: data issue when Hive metastore schema and parquet schema have 
different letter case  (was: data issue when )

> data issue when Hive metastore schema and parquet schema have different 
> letter case
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 solved this issue.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> -SPARK-25132-'s backport has been track in its jira.
> Use this Jira to track the backport of SPARK-24716, 
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema have different letter case

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 solved this issue.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 solved this issue.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

 

-SPARK-25132-'s backport has been track in its jira.

Use this Jira to track the backport of SPARK-24716, 

 

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue when Hive metastore schema and parquet schema have different 
> letter case
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. 

[jira] [Updated] (SPARK-25206) data issue when

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Summary: data issue when   (was: data issue because wrong column is 
pushdown for parquet)

> data issue when 
> 
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 solved this issue.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> -SPARK-25132-'s backport has been track in its jira.
> Use this Jira to track the backport of SPARK-24716, 
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, even spark.sql.caseSensitive set to false.

SPARK-25132 solved this issue.

 

To make the above query work, we need both SPARK-25132 and -SPARK-24716.-

 

-SPARK-25132-'s backport has been track in its jira.

Use this Jira to track the backport of SPARK-24716, 

 

[~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. 

 

 

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 solved this issue.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> -SPARK-25132-'s backport has been track in 

[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

After deep dive, it has two issues, both are related to different letter case 
between Hive metastore schema and parquet schema.

1. Wrong column is pushdown.

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

2. 

 

 

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

It has two issues.

1. Wrong column 

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter case 
> between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. 
>  
>  
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*

It has two issues.

1. Wrong column 

Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*
Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> It has two issues.
> 1. Wrong column 
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593126#comment-16593126
 ] 

yucai edited comment on SPARK-25206 at 8/27/18 2:27 AM:


[~dongjoon], because of the below root cause
{quote}Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, 
but ID does not exist in /tmp/data (parquet is case sensitive, it has id 
actually).
{quote}
I changed the title to emphasize wrong column is pushdown: "id" should be 
pushdown instead of "ID".

Feel free to let me know if you have any concern.

This issue exists in 2.3 only, master is different.


was (Author: yucai):
[~dongjoon], because of the below root cause
{quote}Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, 
but ID does not exist in /tmp/data (parquet is case sensitive, it has id 
actually).
{quote}
I changed the title to emphasize wrong column is pushdown: "id" should be 
pushdown instead of "ID".

Feel free to let me know if you have any concern.

> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593128#comment-16593128
 ] 

Dongjoon Hyun commented on SPARK-25207:
---

[~yucai]. My bad. Please ignore that. It was based on the old one.

With the latest master branch, I found that the issue is a more general 
regression. Please see [the above 
comment|https://issues.apache.org/jira/browse/SPARK-25207?focusedCommentId=16593108=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16593108]
 and [Github 
comment|https://github.com/apache/spark/pull/22197#issuecomment-416085556] and 
update both GitHub PR and Apache JIRA description as you want.

> Case-insensitve field resolution for filter pushdown when reading Parquet
> -
>
> Key: SPARK-25207
> URL: https://issues.apache.org/jira/browse/SPARK-25207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Priority: Major
>  Labels: Parquet
> Attachments: image.png
>
>
> Currently, filter pushdown will not work if Parquet schema and Hive metastore 
> schema are in different letter cases even spark.sql.caseSensitive is false.
> Like the below case:
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> sql("select * from t where id > 0").show{code}
> -No filter will be pushed down.-
> {code}
> scala> sql("select * from t where id > 0").explain   // Filters are pushed 
> with `ID`
> == Physical Plan ==
> *(1) Project [ID#90L]
> +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0))
>+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], 
> PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: 
> struct
> scala> sql("select * from t").show// Parquet returns NULL for `ID` 
> because it has `id`.
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show   // `NULL > 0` is `false`.
> +---+
> | ID|
> +---+
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593126#comment-16593126
 ] 

yucai commented on SPARK-25206:
---

[~dongjoon], because of the below root cause
{quote}Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, 
but ID does not exist in /tmp/data (parquet is case sensitive, it has id 
actually).
{quote}
I changed the title to emphasize wrong column is pushdown: "id" should be 
pushdown instead of "ID".

Feel free to let me know if you have any concern.

> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Description: 
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+

{code}
 

*Root Cause*
Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?

  was:
In current Spark 2.3.1, below query returns wrong data silently.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
scala> sql("select * from t").show
++
|  ID|
++
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
++
scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+
scala> sql("set spark.sql.parquet.filterPushdown").show
++-+
| key|value|
++-+
|spark.sql.parquet...| true|
++-+
scala> sql("set spark.sql.parquet.filterPushdown=false").show
++-+
| key|value|
++-+
|spark.sql.parquet...|false|
++-+
scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+
{code}
 

*Root Cause*
Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) 
into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet 
is case sensitive, it has {color:#ff}id{color} actually).
So no records are returned.

In Spark 2.1, the user will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
But in Spark 2.3, they will get the wrong results sliently.

 

Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
to do the pushdown, perfect for this issue.

[~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?


> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25221) [DEPLOY] Consistent trailing whitespace treatment of conf values

2018-08-26 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-25221:

Target Version/s:   (was: 2.3.2, 2.4.0)

> [DEPLOY] Consistent trailing whitespace treatment of conf values
> 
>
> Key: SPARK-25221
> URL: https://issues.apache.org/jira/browse/SPARK-25221
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.1
>Reporter: Gera Shegalov
>Priority: Major
>
> When you use a custom line delimiter 
> {{spark.hadoop.textinputformat.record.delimiter}} that has a leading or a 
> trailing whitespace character it's only possible when specified via  
> {{--conf}} . Our pipeline consists of a highly customized generated jobs. 
> Storing all the config in a properities file is not only better for 
> readability but even necessary to avoid dealing with {{ARGS_MAX}} on 
> different OS. Spark should uniformly avoid trimming conf values in both 
> cases. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25221) [DEPLOY] Consistent trailing whitespace treatment of conf values

2018-08-26 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593124#comment-16593124
 ] 

Saisai Shao commented on SPARK-25221:
-

I'm going to remove the target version, I don't think it is a critical/blocker 
issue, committers will set the proper fix version when merged.

> [DEPLOY] Consistent trailing whitespace treatment of conf values
> 
>
> Key: SPARK-25221
> URL: https://issues.apache.org/jira/browse/SPARK-25221
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.1
>Reporter: Gera Shegalov
>Priority: Major
>
> When you use a custom line delimiter 
> {{spark.hadoop.textinputformat.record.delimiter}} that has a leading or a 
> trailing whitespace character it's only possible when specified via  
> {{--conf}} . Our pipeline consists of a highly customized generated jobs. 
> Storing all the config in a properities file is not only better for 
> readability but even necessary to avoid dealing with {{ARGS_MAX}} on 
> different OS. Spark should uniformly avoid trimming conf values in both 
> cases. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet

2018-08-26 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-25206:
--
Summary: data issue because wrong column is pushdown for parquet  (was: 
Wrong data may be returned for Parquet)

> data issue because wrong column is pushdown for parquet
> ---
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593117#comment-16593117
 ] 

Hyukjin Kwon commented on SPARK-25206:
--

[~yucai], mind fixing the JIRA title?

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet

2018-08-26 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593110#comment-16593110
 ] 

yucai commented on SPARK-25207:
---

[~dongjoon] , sorry if I am confusing you.

 

This bug is created for master branch, because it has SPARK-25132 and 
-SPARK-24716- already.

So it has no below issue actually.
{code:java}
scala> sql("select * from t").show// Parquet returns NULL for `ID` because 
it has `id`.
++
|  ID|
++
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
++

scala> sql("select * from t where id > 0").show   // `NULL > 0` is `false`.
+---+
| ID|
+---+
+---+
{code}

> Case-insensitve field resolution for filter pushdown when reading Parquet
> -
>
> Key: SPARK-25207
> URL: https://issues.apache.org/jira/browse/SPARK-25207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Priority: Major
>  Labels: Parquet
> Attachments: image.png
>
>
> Currently, filter pushdown will not work if Parquet schema and Hive metastore 
> schema are in different letter cases even spark.sql.caseSensitive is false.
> Like the below case:
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> sql("select * from t where id > 0").show{code}
> -No filter will be pushed down.-
> {code}
> scala> sql("select * from t where id > 0").explain   // Filters are pushed 
> with `ID`
> == Physical Plan ==
> *(1) Project [ID#90L]
> +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0))
>+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], 
> PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: 
> struct
> scala> sql("select * from t").show// Parquet returns NULL for `ID` 
> because it has `id`.
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show   // `NULL > 0` is `false`.
> +---+
> | ID|
> +---+
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593108#comment-16593108
 ] 

Dongjoon Hyun commented on SPARK-25207:
---

According to the PR, this seems to be a new regression introduced at Spark 2.4. 
It's not specific to schema mismatch case. For example, in the following schema 
matched case, the input size is less than or equal to 8.0 MB in Spark 2.3.1, 
but now master seems to show the following.

{code}
spark.sparkContext.hadoopConfiguration.setInt("parquet.block.size", 8 * 1024 * 
1024)
spark.range(1, 40 * 1024 * 1024, 1, 
1).sortWithinPartitions("id").write.mode("overwrite").parquet("/tmp/t")
sql("CREATE TABLE t (id LONG) USING parquet LOCATION '/tmp/t'")
// It should be less than and equal to 8MB.
sql("select * from t where id < 100L").show()  
// It's already less than and equal to 8BM
sql("select * from t where id < 100L").write.mode("overwrite").csv("/tmp/id")
{code}

!image.png! 

> Case-insensitve field resolution for filter pushdown when reading Parquet
> -
>
> Key: SPARK-25207
> URL: https://issues.apache.org/jira/browse/SPARK-25207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Priority: Major
>  Labels: Parquet
> Attachments: image.png
>
>
> Currently, filter pushdown will not work if Parquet schema and Hive metastore 
> schema are in different letter cases even spark.sql.caseSensitive is false.
> Like the below case:
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> sql("select * from t where id > 0").show{code}
> -No filter will be pushed down.-
> {code}
> scala> sql("select * from t where id > 0").explain   // Filters are pushed 
> with `ID`
> == Physical Plan ==
> *(1) Project [ID#90L]
> +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0))
>+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], 
> PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: 
> struct
> scala> sql("select * from t").show// Parquet returns NULL for `ID` 
> because it has `id`.
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show   // `NULL > 0` is `false`.
> +---+
> | ID|
> +---+
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25207:
--
Attachment: image.png

> Case-insensitve field resolution for filter pushdown when reading Parquet
> -
>
> Key: SPARK-25207
> URL: https://issues.apache.org/jira/browse/SPARK-25207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Priority: Major
>  Labels: Parquet
> Attachments: image.png
>
>
> Currently, filter pushdown will not work if Parquet schema and Hive metastore 
> schema are in different letter cases even spark.sql.caseSensitive is false.
> Like the below case:
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> sql("select * from t where id > 0").show{code}
> -No filter will be pushed down.-
> {code}
> scala> sql("select * from t where id > 0").explain   // Filters are pushed 
> with `ID`
> == Physical Plan ==
> *(1) Project [ID#90L]
> +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0))
>+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], 
> PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: 
> struct
> scala> sql("select * from t").show// Parquet returns NULL for `ID` 
> because it has `id`.
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show   // `NULL > 0` is `false`.
> +---+
> | ID|
> +---+
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-26 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593102#comment-16593102
 ] 

yucai commented on SPARK-25206:
---

I am OK with "known correctness bug in 2.3" way, just raise some concern in my 
previous post.

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-26 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593100#comment-16593100
 ] 

yucai commented on SPARK-25206:
---

[~smilegator] , sure, I will add tests.

 

If we don't backport SPARK-25132 and SPARK-24716, user will have below issue.
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")

scala> sql("select * from t where id > 0").show
+---+
| ID|
+---+
+---+
{code}
 

The biggest difference is, in Spark 2.1, they will get Exception:
{code:java}
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
schema!{code}
So will they know the issue and fix the query.

But in Spark 2.3, they will get the wrong results sliently and might be ignored?

 

Could it be risky for the user?

 

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25236) Investigate using a logging library inside of PySpark on the workers instead of print

2018-08-26 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593098#comment-16593098
 ] 

holdenk commented on SPARK-25236:
-

Probably. The only thing would be probably wanting to pass log level config 
from driver to exec but that could be a V2 feature.

> Investigate using a logging library inside of PySpark on the workers instead 
> of print
> -
>
> Key: SPARK-25236
> URL: https://issues.apache.org/jira/browse/SPARK-25236
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> We don't have a logging library on the workers to use which means that its 
> difficult for folks to tune the log level on the workers. On the driver 
> processes we _could_ just call the JVM logging, but on the workers that won't 
> work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25236) Investigate using a logging library inside of PySpark on the workers instead of print

2018-08-26 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593097#comment-16593097
 ] 

Liang-Chi Hsieh commented on SPARK-25236:
-

hmm, maybe dumb question, can't we use {{logging}} to do that? 

> Investigate using a logging library inside of PySpark on the workers instead 
> of print
> -
>
> Key: SPARK-25236
> URL: https://issues.apache.org/jira/browse/SPARK-25236
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> We don't have a logging library on the workers to use which means that its 
> difficult for folks to tune the log level on the workers. On the driver 
> processes we _could_ just call the JVM logging, but on the workers that won't 
> work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-26 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593096#comment-16593096
 ] 

Wenchen Fan commented on SPARK-25206:
-

I'm fine to mark it as a known correctness bug in Spark 2.2, 2.3, shall we put 
it in the release notes of Spark 2.3.2? cc [~jerryshao]

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-08-26 Thread Michail Giannakopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593085#comment-16593085
 ] 

Michail Giannakopoulos edited comment on SPARK-24826 at 8/27/18 12:53 AM:
--

[~dongjoon] I will try to repro and let you know...


was (Author: miccagiann):
[~dongjoon] I will and let you know...

> Self-Join not working in Apache Spark 2.2.2
> ---
>
> Key: SPARK-24826
> URL: https://issues.apache.org/jira/browse/SPARK-24826
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.2
>Reporter: Michail Giannakopoulos
>Priority: Major
> Attachments: 
> part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet
>
>
> Running a self-join against a table derived from a parquet file with many 
> columns fails during the planning phase with the following stack-trace:
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
>  Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
> funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
> emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
> verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
> desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, 
> member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, 
> int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, 
> emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, 
> issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, 
> title#22, zip_code#23, ... 92 more fields]
>  +- Filter isnotnull(_row_id#0L)
>  +- FileScan parquet 
> [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more 
> fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
> struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at 

[jira] [Commented] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-08-26 Thread Michail Giannakopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593085#comment-16593085
 ] 

Michail Giannakopoulos commented on SPARK-24826:


[~dongjoon] I will and let you know...

> Self-Join not working in Apache Spark 2.2.2
> ---
>
> Key: SPARK-24826
> URL: https://issues.apache.org/jira/browse/SPARK-24826
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.2
>Reporter: Michail Giannakopoulos
>Priority: Major
> Attachments: 
> part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet
>
>
> Running a self-join against a table derived from a parquet file with many 
> columns fails during the planning phase with the following stack-trace:
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
>  Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
> funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
> emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
> verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
> desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, 
> member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, 
> int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, 
> emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, 
> issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, 
> title#22, zip_code#23, ... 92 more fields]
>  +- Filter isnotnull(_row_id#0L)
>  +- FileScan parquet 
> [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more 
> fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
> struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at 
> org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:73)
>  at 
> 

[jira] [Commented] (SPARK-19355) Use map output statistices to improve global limit's parallelism

2018-08-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593080#comment-16593080
 ] 

Apache Spark commented on SPARK-19355:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/22239

> Use map output statistices to improve global limit's parallelism
> 
>
> Key: SPARK-19355
> URL: https://issues.apache.org/jira/browse/SPARK-19355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> A logical Limit is performed actually by two physical operations LocalLimit 
> and GlobalLimit.
> In most of time, before GlobalLimit, we will perform a shuffle exchange to 
> shuffle data to single partition. When the limit number is very big, we 
> shuffle a lot of data to a single partition and significantly reduce 
> parallelism, except for the cost of shuffling.
> This change tries to perform GlobalLimit without shuffling data to single 
> partition. Instead, we perform the map stage of the shuffling and collect the 
> statistics of the number of rows in each partition. Shuffled data are 
> actually all retrieved locally without from remote executors.
> Once we get the number of output rows in each partition, we only take the 
> required number of rows from the locally shuffled data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25207:
--
Description: 
Currently, filter pushdown will not work if Parquet schema and Hive metastore 
schema are in different letter cases even spark.sql.caseSensitive is false.

Like the below case:
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
sql("select * from t where id > 0").show{code}
-No filter will be pushed down.-

{code}
scala> sql("select * from t where id > 0").explain   // Filters are pushed with 
`ID`
== Physical Plan ==
*(1) Project [ID#90L]
+- *(1) Filter (isnotnull(id#90L) && (id#90L > 0))
   +- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], 
PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: struct

scala> sql("select * from t").show// Parquet returns NULL for `ID` because 
it has `id`.
++
|  ID|
++
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
++

scala> sql("select * from t where id > 0").show   // `NULL > 0` is `false`.
+---+
| ID|
+---+
+---+
{code}

  was:
Currently, filter pushdown will not work if Parquet schema and Hive metastore 
schema are in different letter cases even spark.sql.caseSensitive is false.

Like the below case:
{code:java}
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
sql("select * from t where id > 0").show{code}
No filter will be pushed down.


> Case-insensitve field resolution for filter pushdown when reading Parquet
> -
>
> Key: SPARK-25207
> URL: https://issues.apache.org/jira/browse/SPARK-25207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Priority: Major
>  Labels: Parquet
>
> Currently, filter pushdown will not work if Parquet schema and Hive metastore 
> schema are in different letter cases even spark.sql.caseSensitive is false.
> Like the below case:
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> sql("select * from t where id > 0").show{code}
> -No filter will be pushed down.-
> {code}
> scala> sql("select * from t where id > 0").explain   // Filters are pushed 
> with `ID`
> == Physical Plan ==
> *(1) Project [ID#90L]
> +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0))
>+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], 
> PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: 
> struct
> scala> sql("select * from t").show// Parquet returns NULL for `ID` 
> because it has `id`.
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show   // `NULL > 0` is `false`.
> +---+
> | ID|
> +---+
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24766) CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column stats in parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24766:
--
Labels: Parquet  (was: )

> CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column 
> stats in parquet
> 
>
> Key: SPARK-24766
> URL: https://issues.apache.org/jira/browse/SPARK-24766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: Parquet
>
> How to reproduce:
> {code:java}
> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/spark/parquet/dir' STORED AS parquet 
> select cast(1 as decimal) as decimal1;
> {code}
> {code:java}
> create table test_parquet stored as parquet as select cast(1 as decimal) as 
> decimal1;
> {code}
> {noformat}
> $ java -jar ./parquet-tools/target/parquet-tools-1.10.1-SNAPSHOT.jar meta  
> file:/tmp/spark/parquet/dir/part-0-cb96a617-4759-4b21-a222-2153ca0e8951-c000
> file:        
> file:/tmp/spark/parquet/dir/part-0-cb96a617-4759-4b21-a222-2153ca0e8951-c000
> creator:     parquet-mr version 1.6.0 (build 
> 6aa21f8776625b5fa6b18059cfebe7549f2e00cb)
> file schema: hive_schema
> 
> decimal1:    OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1: RC:1 TS:46 OFFSET:4
> 
> decimal1:     FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:4 SZ:48/46/0.96 VC:1 
> ENC:BIT_PACKED,PLAIN,RLE ST:[no stats for this column]
> {noformat}
> because spark still use com.twitter.parquet-hadoop-bundle.1.6.0.
> May be we should refactor {{CreateHiveTableAsSelectCommand}} and 
> {{InsertIntoHiveDirCommand}} or [upgrade built-in 
> Hive|https://issues.apache.org/jira/browse/SPARK-23710].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-08-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593067#comment-16593067
 ] 

Dongjoon Hyun commented on SPARK-24826:
---

Hi, [~miccagiann]. Could you try that in Apache Spark 2.3.1?

> Self-Join not working in Apache Spark 2.2.2
> ---
>
> Key: SPARK-24826
> URL: https://issues.apache.org/jira/browse/SPARK-24826
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.2
>Reporter: Michail Giannakopoulos
>Priority: Major
> Attachments: 
> part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet
>
>
> Running a self-join against a table derived from a parquet file with many 
> columns fails during the planning phase with the following stack-trace:
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
>  Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
> funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
> emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
> verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
> desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, 
> member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, 
> int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, 
> emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, 
> issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, 
> title#22, zip_code#23, ... 92 more fields]
>  +- Filter isnotnull(_row_id#0L)
>  +- FileScan parquet 
> [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more 
> fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
> struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at 
> org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:73)
>  at 
> 

[jira] [Updated] (SPARK-25132) Case-insensitive field resolution when reading from Parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25132:
--
Labels: Parquet  (was: )

> Case-insensitive field resolution when reading from Parquet
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
>  Labels: Parquet
> Fix For: 2.4.0
>
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25135) insert datasource table may all null when select from view on parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25135:
--
Labels: Parquet correctness  (was: correctness)

> insert datasource table may all null when select from view on parquet
> -
>
> Key: SPARK-25135
> URL: https://issues.apache.org/jira/browse/SPARK-25135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Yuming Wang
>Priority: Blocker
>  Labels: Parquet, correctness
>
> This happens on parquet.
> How to reproduce in parquet.
> {code:scala}
> val path = "/tmp/spark/parquet"
> val cnt = 30
> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
> as col2").write.mode("overwrite").parquet(path)
> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
> location '$path'")
> spark.sql("create view view1 as select col1, col2 from table1 where col1 > 
> -20")
> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> spark.table("table2").show
> {code}
> FYI, the following is orc.
> {code}
> scala> val path = "/tmp/spark/orc"
> scala> val cnt = 30
> scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as 
> bigint) as col2").write.mode("overwrite").orc(path)
> scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc 
> location '$path'")
> scala> spark.sql("create view view1 as select col1, col2 from table1 where 
> col1 > -20")
> scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc")
> scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> scala> spark.table("table2").show
> +++
> |COL1|COL2|
> +++
> |  15|  15|
> |  16|  16|
> |  17|  17|
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25207:
--
Labels: Parquet  (was: )

> Case-insensitve field resolution for filter pushdown when reading Parquet
> -
>
> Key: SPARK-25207
> URL: https://issues.apache.org/jira/browse/SPARK-25207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Priority: Major
>  Labels: Parquet
>
> Currently, filter pushdown will not work if Parquet schema and Hive metastore 
> schema are in different letter cases even spark.sql.caseSensitive is false.
> Like the below case:
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> sql("select * from t where id > 0").show{code}
> No filter will be pushed down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25206:
--
Labels: Parquet correctness  (was: correctness)

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25135) insert datasource table may all null when select from view on parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593066#comment-16593066
 ] 

Dongjoon Hyun commented on SPARK-25135:
---

[~yumwang]. Could you update your PR according to this JIRA title? We need to 
be specific.

> insert datasource table may all null when select from view on parquet
> -
>
> Key: SPARK-25135
> URL: https://issues.apache.org/jira/browse/SPARK-25135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Yuming Wang
>Priority: Blocker
>  Labels: correctness
>
> This happens on parquet.
> How to reproduce in parquet.
> {code:scala}
> val path = "/tmp/spark/parquet"
> val cnt = 30
> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
> as col2").write.mode("overwrite").parquet(path)
> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
> location '$path'")
> spark.sql("create view view1 as select col1, col2 from table1 where col1 > 
> -20")
> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> spark.table("table2").show
> {code}
> FYI, the following is orc.
> {code}
> scala> val path = "/tmp/spark/orc"
> scala> val cnt = 30
> scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as 
> bigint) as col2").write.mode("overwrite").orc(path)
> scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc 
> location '$path'")
> scala> spark.sql("create view view1 as select col1, col2 from table1 where 
> col1 > -20")
> scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc")
> scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> scala> spark.table("table2").show
> +++
> |COL1|COL2|
> +++
> |  15|  15|
> |  16|  16|
> |  17|  17|
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25135) insert datasource table may all null when select from view on parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25135:
--
Description: 
This happens on parquet.

How to reproduce in parquet.
{code:scala}
val path = "/tmp/spark/parquet"
val cnt = 30
spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
as col2").write.mode("overwrite").parquet(path)
spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
location '$path'")
spark.sql("create view view1 as select col1, col2 from table1 where col1 > -20")
spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
spark.table("table2").show
{code}

FYI, the following is orc.
{code}
scala> val path = "/tmp/spark/orc"
scala> val cnt = 30
scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as 
bigint) as col2").write.mode("overwrite").orc(path)
scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc 
location '$path'")
scala> spark.sql("create view view1 as select col1, col2 from table1 where col1 
> -20")
scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc")
scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
scala> spark.table("table2").show
+++
|COL1|COL2|
+++
|  15|  15|
|  16|  16|
|  17|  17|
...
{code}

  was:
How to reproduce:
{code:scala}
val path = "/tmp/spark/parquet"
val cnt = 30
spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
as col2").write.mode("overwrite").parquet(path)
spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
location '$path'")
spark.sql("create view view1 as select col1, col2 from table1 where col1 > -20")
spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
spark.table("table2").show
{code}

This happens on parquet.

{code}
scala> val path = "/tmp/spark/orc"
scala> val cnt = 30
scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as 
bigint) as col2").write.mode("overwrite").orc(path)
scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc 
location '$path'")
scala> spark.sql("create view view1 as select col1, col2 from table1 where col1 
> -20")
scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc")
scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
scala> spark.table("table2").show
+++
|COL1|COL2|
+++
|  15|  15|
|  16|  16|
|  17|  17|
...
{code}


> insert datasource table may all null when select from view on parquet
> -
>
> Key: SPARK-25135
> URL: https://issues.apache.org/jira/browse/SPARK-25135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Yuming Wang
>Priority: Blocker
>  Labels: correctness
>
> This happens on parquet.
> How to reproduce in parquet.
> {code:scala}
> val path = "/tmp/spark/parquet"
> val cnt = 30
> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
> as col2").write.mode("overwrite").parquet(path)
> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
> location '$path'")
> spark.sql("create view view1 as select col1, col2 from table1 where col1 > 
> -20")
> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> spark.table("table2").show
> {code}
> FYI, the following is orc.
> {code}
> scala> val path = "/tmp/spark/orc"
> scala> val cnt = 30
> scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as 
> bigint) as col2").write.mode("overwrite").orc(path)
> scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc 
> location '$path'")
> scala> spark.sql("create view view1 as select col1, col2 from table1 where 
> col1 > -20")
> scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc")
> scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> scala> spark.table("table2").show
> +++
> |COL1|COL2|
> +++
> |  15|  15|
> |  16|  16|
> |  17|  17|
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25135) insert datasource table may all null when select from view on parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25135:
--
Description: 
How to reproduce:
{code:scala}
val path = "/tmp/spark/parquet"
val cnt = 30
spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
as col2").write.mode("overwrite").parquet(path)
spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
location '$path'")
spark.sql("create view view1 as select col1, col2 from table1 where col1 > -20")
spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
spark.table("table2").show
{code}

This happens on parquet.

{code}
scala> val path = "/tmp/spark/orc"
scala> val cnt = 30
scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as 
bigint) as col2").write.mode("overwrite").orc(path)
scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc 
location '$path'")
scala> spark.sql("create view view1 as select col1, col2 from table1 where col1 
> -20")
scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc")
scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
scala> spark.table("table2").show
+++
|COL1|COL2|
+++
|  15|  15|
|  16|  16|
|  17|  17|
...
{code}

  was:
How to reproduce:
{code:scala}
val path = "/tmp/spark/parquet"
val cnt = 30
spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
as col2").write.mode("overwrite").parquet(path)
spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
location '$path'")
spark.sql("create view view1 as select col1, col2 from table1 where col1 > -20")
spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
spark.table("table2").show
{code}


> insert datasource table may all null when select from view on parquet
> -
>
> Key: SPARK-25135
> URL: https://issues.apache.org/jira/browse/SPARK-25135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Yuming Wang
>Priority: Blocker
>  Labels: correctness
>
> How to reproduce:
> {code:scala}
> val path = "/tmp/spark/parquet"
> val cnt = 30
> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
> as col2").write.mode("overwrite").parquet(path)
> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
> location '$path'")
> spark.sql("create view view1 as select col1, col2 from table1 where col1 > 
> -20")
> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> spark.table("table2").show
> {code}
> This happens on parquet.
> {code}
> scala> val path = "/tmp/spark/orc"
> scala> val cnt = 30
> scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as 
> bigint) as col2").write.mode("overwrite").orc(path)
> scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc 
> location '$path'")
> scala> spark.sql("create view view1 as select col1, col2 from table1 where 
> col1 > -20")
> scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc")
> scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> scala> spark.table("table2").show
> +++
> |COL1|COL2|
> +++
> |  15|  15|
> |  16|  16|
> |  17|  17|
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25135) insert datasource table may all null when select from view on parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25135:
--
Summary: insert datasource table may all null when select from view on 
parquet  (was: insert datasource table may all null when select from view)

> insert datasource table may all null when select from view on parquet
> -
>
> Key: SPARK-25135
> URL: https://issues.apache.org/jira/browse/SPARK-25135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Yuming Wang
>Priority: Blocker
>  Labels: correctness
>
> How to reproduce:
> {code:scala}
> val path = "/tmp/spark/parquet"
> val cnt = 30
> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
> as col2").write.mode("overwrite").parquet(path)
> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
> location '$path'")
> spark.sql("create view view1 as select col1, col2 from table1 where col1 > 
> -20")
> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> spark.table("table2").show
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25091) Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor memory

2018-08-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593061#comment-16593061
 ] 

Dongjoon Hyun commented on SPARK-25091:
---

Hi, [~Chao Fang]. Could you remove `Spark Thrift Server: ` from the title if 
you see it in `pyspark` shell as you reported?
bq. Similar behavior when using pyspark df.unpersist().

> Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor 
> memory
> 
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Critical
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-26 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593041#comment-16593041
 ] 

Xiao Li edited comment on SPARK-25206 at 8/26/18 10:45 PM:
---

Currently, we do not have a good test coverage when the physical schema and 
logical schema use difference cases. Thus, any new change could introduce new 
behavior changes or bugs. Thus, the first step is to add the tests first. 
[~yucai] Could you help this effort?

Merging Parquet filter refactoring is kind of breaking our backport rule. Maybe 
we do not need to claim we support this scenario before Spark 2.4?


was (Author: smilegator):
Previously, we do not have a good test coverage when the physical schema and 
logical schema use difference cases. Thus, any new change could introduce new 
behavior changes or bugs. Thus, the first step is to add the tests first. 
[~yucai] Could you help this effort?

Merging Parquet filter refactoring is kind of breaking our backport rule. Maybe 
we do not need to claim we support this scenario before Spark 2.4?

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-26 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593041#comment-16593041
 ] 

Xiao Li commented on SPARK-25206:
-

Previously, we do not have a good test coverage when the physical schema and 
logical schema use difference cases. Thus, any new change could introduce new 
behavior changes or bugs. Thus, the first step is to add the tests first. 
[~yucai] Could you help this effort?

Merging Parquet filter refactoring is kind of breaking our backport rule. Maybe 
we do not need to claim we support this scenario before Spark 2.4?

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25246) When the spark.eventLog.compress is enabled, the Application is not showing in the History server UI ('incomplete application' page), initially.

2018-08-26 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593040#comment-16593040
 ] 

shahid commented on SPARK-25246:


I am working on it :)

> When the spark.eventLog.compress is enabled, the Application is not showing 
> in the History server UI ('incomplete application' page), initially.
> 
>
> Key: SPARK-25246
> URL: https://issues.apache.org/jira/browse/SPARK-25246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: shahid
>Priority: Major
>
> 1) bin/spark-shell --master yarn --conf "spark.eventLog.compress=true" 
> 2) hdfs dfs -ls /spark-logs 
> {code:java}
> -rwxrwx---   1 root supergroup  *0* 2018-08-27 03:26 
> /spark-logs/application_1535313809919_0005.lz4.inprogress
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25246) When the spark.eventLog.compress is enabled, the Application is not showing in the History server UI ('incomplete application' page), initially.

2018-08-26 Thread shahid (JIRA)
shahid created SPARK-25246:
--

 Summary: When the spark.eventLog.compress is enabled, the 
Application is not showing in the History server UI ('incomplete application' 
page), initially.
 Key: SPARK-25246
 URL: https://issues.apache.org/jira/browse/SPARK-25246
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: shahid


1) bin/spark-shell --master yarn --conf "spark.eventLog.compress=true" 

2) hdfs dfs -ls /spark-logs 

{code:java}
-rwxrwx---   1 root supergroup  *0* 2018-08-27 03:26 
/spark-logs/application_1535313809919_0005.lz4.inprogress
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25245) Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming

2018-08-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25245:


Assignee: Apache Spark

> Explain regarding limiting modification on "spark.sql.shuffle.partitions" for 
> structured streaming
> --
>
> Key: SPARK-25245
> URL: https://issues.apache.org/jira/browse/SPARK-25245
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> Couple of users wondered why "spark.sql.shuffle.partitions" keeps unchanged 
> when they changed the config value after running the query. Some of them even 
> submitted the patch as this behavior as a bug. But it is based on how state 
> is partitioned and the behavior is intentional.
> Looks like it's worth to explain it to guide doc so that no more users will 
> be wondered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25245) Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming

2018-08-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593034#comment-16593034
 ] 

Apache Spark commented on SPARK-25245:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/22238

> Explain regarding limiting modification on "spark.sql.shuffle.partitions" for 
> structured streaming
> --
>
> Key: SPARK-25245
> URL: https://issues.apache.org/jira/browse/SPARK-25245
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Couple of users wondered why "spark.sql.shuffle.partitions" keeps unchanged 
> when they changed the config value after running the query. Some of them even 
> submitted the patch as this behavior as a bug. But it is based on how state 
> is partitioned and the behavior is intentional.
> Looks like it's worth to explain it to guide doc so that no more users will 
> be wondered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25245) Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming

2018-08-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25245:


Assignee: (was: Apache Spark)

> Explain regarding limiting modification on "spark.sql.shuffle.partitions" for 
> structured streaming
> --
>
> Key: SPARK-25245
> URL: https://issues.apache.org/jira/browse/SPARK-25245
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Couple of users wondered why "spark.sql.shuffle.partitions" keeps unchanged 
> when they changed the config value after running the query. Some of them even 
> submitted the patch as this behavior as a bug. But it is based on how state 
> is partitioned and the behavior is intentional.
> Looks like it's worth to explain it to guide doc so that no more users will 
> be wondered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25245) Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming

2018-08-26 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-25245:


 Summary: Explain regarding limiting modification on 
"spark.sql.shuffle.partitions" for structured streaming
 Key: SPARK-25245
 URL: https://issues.apache.org/jira/browse/SPARK-25245
 Project: Spark
  Issue Type: Documentation
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jungtaek Lim


Couple of users wondered why "spark.sql.shuffle.partitions" keeps unchanged 
when they changed the config value after running the query. Some of them even 
submitted the patch as this behavior as a bug. But it is based on how state is 
partitioned and the behavior is intentional.

Looks like it's worth to explain it to guide doc so that no more users will be 
wondered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593024#comment-16593024
 ] 

Dongjoon Hyun commented on SPARK-25206:
---

Hi, [~yucai], [~cloud_fan], [~smilegator], [~hyukjin.kwon].

In Spark 2.4, we are still trying to fix long-lasting Parquet case-sensitivity 
issues (Spark 2.1.x raises Exceptions and Spark 2.2.x is the same with Spark 
2.3.x).
Unfortunately, this effort is incomplete and unstable even in Spark 2.4 because 
we have unmerged one (SPARK-25207) and we may have more future unknown patches.
In this case, we had better consider any backporting to `branch-2.3` after 
Spark 2.4 becomes stable first. We may land them together, not one by one.
How do you think about this? Are the current three Spark-2.4-only Parquet 
patches(SPARK-25132, SPARK-24716, SPARK-25207) considered as a complete set of 
patches for this?

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593013#comment-16593013
 ] 

Dongjoon Hyun commented on SPARK-25175:
---

[~seancxmao]. If there is no example, we can not help you. In that case, we 
usually close this as `Cannot Reproduce`.

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat doesn't support case-insensitive field 
> resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

2018-08-26 Thread Anton Daitche (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Daitche updated SPARK-25244:
--
Description: 
The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would try to come up with 
a patch. 

 

 

  was:
The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would be happy to 
contribute a patch. 

 

 


> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-25244
> URL: https://issues.apache.org/jira/browse/SPARK-25244
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Anton Daitche
>Priority: Major
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try 

[jira] [Created] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

2018-08-26 Thread Anton Daitche (JIRA)
Anton Daitche created SPARK-25244:
-

 Summary: [Python] Setting `spark.sql.session.timeZone` only 
partially respected
 Key: SPARK-25244
 URL: https://issues.apache.org/jira/browse/SPARK-25244
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.1
Reporter: Anton Daitche


The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|[http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].]
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would be happy to 
contribute a patch. 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

2018-08-26 Thread Anton Daitche (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Daitche updated SPARK-25244:
--
Description: 
The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would be happy to 
contribute a patch. 

 

 

  was:
The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|[http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].]
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would be happy to 
contribute a patch. 

 

 


> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-25244
> URL: https://issues.apache.org/jira/browse/SPARK-25244
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Anton Daitche
>Priority: Major
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would 

[jira] [Assigned] (SPARK-25243) Use FailureSafeParser in from_json

2018-08-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25243:


Assignee: Apache Spark

> Use FailureSafeParser in from_json
> --
>
> Key: SPARK-25243
> URL: https://issues.apache.org/jira/browse/SPARK-25243
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> The 
> [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28]
>   is used in parsing JSON, CSV files and dataset of strings. It supports the 
> [PERMISSIVE, DROPMALFORMED and 
> FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44]
>  modes. The ticket aims to make the from_json function compatible to regular 
> parsing via FailureSafeParser and support above modes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25243) Use FailureSafeParser in from_json

2018-08-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592940#comment-16592940
 ] 

Apache Spark commented on SPARK-25243:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22237

> Use FailureSafeParser in from_json
> --
>
> Key: SPARK-25243
> URL: https://issues.apache.org/jira/browse/SPARK-25243
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> The 
> [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28]
>   is used in parsing JSON, CSV files and dataset of strings. It supports the 
> [PERMISSIVE, DROPMALFORMED and 
> FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44]
>  modes. The ticket aims to make the from_json function compatible to regular 
> parsing via FailureSafeParser and support above modes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25243) Use FailureSafeParser in from_json

2018-08-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25243:


Assignee: (was: Apache Spark)

> Use FailureSafeParser in from_json
> --
>
> Key: SPARK-25243
> URL: https://issues.apache.org/jira/browse/SPARK-25243
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> The 
> [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28]
>   is used in parsing JSON, CSV files and dataset of strings. It supports the 
> [PERMISSIVE, DROPMALFORMED and 
> FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44]
>  modes. The ticket aims to make the from_json function compatible to regular 
> parsing via FailureSafeParser and support above modes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25243) Use FailureSafeParser in from_json

2018-08-26 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25243:
--

 Summary: Use FailureSafeParser in from_json
 Key: SPARK-25243
 URL: https://issues.apache.org/jira/browse/SPARK-25243
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


The 
[FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28]
  is used in parsing JSON, CSV files and dataset of strings. It supports the 
[PERMISSIVE, DROPMALFORMED and 
FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44]
 modes. The ticket aims to make the from_json function compatible to regular 
parsing via FailureSafeParser and support above modes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23707) Don't need shuffle exchange with single partition for 'spark.range'

2018-08-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23707.
--
Resolution: Cannot Reproduce

> Don't need shuffle exchange with single partition for 'spark.range' 
> 
>
> Key: SPARK-23707
> URL: https://issues.apache.org/jira/browse/SPARK-23707
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xianyang Liu
>Priority: Major
>
> Just like #20726. There is no need 'Exchange' when `spark.range` produce only 
> one partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25013) JDBC urls with jdbc:mariadb don't work as expected

2018-08-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25013.
--
Resolution: Won't Fix

I wouldn't add this into Spark for now unless there's strong request from a 
community for now.

> JDBC urls with jdbc:mariadb don't work as expected
> --
>
> Key: SPARK-25013
> URL: https://issues.apache.org/jira/browse/SPARK-25013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dieter Vekeman
>Priority: Minor
>
> When using the MariaDB JDBC driver, the JDBC connection url should be  
> {code:java}
> jdbc:mariadb://localhost:3306/DB?user=someuser=somepassword
> {code}
> https://mariadb.com/kb/en/library/about-mariadb-connector-j/
> However this does not work well in Spark (see below)
> *Workaround*
> The MariaDB driver also supports using mysql which does work.
> The problem seems to have been described and identified in:
> https://jira.mariadb.org/browse/CONJ-421
> All works well with spark using connection string with {{"jdbc:mysql:..."}}, 
> but not using {{"jdbc:mariadb:..."}} because MySQL dialect is then not used.
> when not used, defaut quote is {{"}}, not {{`}}
> So, some internal query generated by spark like {{SELECT `i`,`ip` FROM tmp}} 
> will then be executed as {{SELECT "i","ip" FROM tmp}} with dataType 
> previously retrieved, causing the exception
> The author of the comment says
> {quote}I'll make a pull request to spark so "jdbc:mariadb:" connection string 
> can be handle{quote}
> Did the pull request get lost or should a new one be made?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10697) Lift Calculation in Association Rule mining

2018-08-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10697:


Assignee: Apache Spark

> Lift Calculation in Association Rule mining
> ---
>
> Key: SPARK-10697
> URL: https://issues.apache.org/jira/browse/SPARK-10697
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yashwanth Kumar
>Assignee: Apache Spark
>Priority: Minor
>
> Lift is to be calculated for Association rule mining in 
> AssociationRules.scala under FPM.
> Lift is a measure of the performance of a  Association rules.
> Adding lift will help to compare the model efficiency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10697) Lift Calculation in Association Rule mining

2018-08-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592899#comment-16592899
 ] 

Apache Spark commented on SPARK-10697:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22236

> Lift Calculation in Association Rule mining
> ---
>
> Key: SPARK-10697
> URL: https://issues.apache.org/jira/browse/SPARK-10697
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yashwanth Kumar
>Priority: Minor
>
> Lift is to be calculated for Association rule mining in 
> AssociationRules.scala under FPM.
> Lift is a measure of the performance of a  Association rules.
> Adding lift will help to compare the model efficiency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10697) Lift Calculation in Association Rule mining

2018-08-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10697:


Assignee: (was: Apache Spark)

> Lift Calculation in Association Rule mining
> ---
>
> Key: SPARK-10697
> URL: https://issues.apache.org/jira/browse/SPARK-10697
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yashwanth Kumar
>Priority: Minor
>
> Lift is to be calculated for Association rule mining in 
> AssociationRules.scala under FPM.
> Lift is a measure of the performance of a  Association rules.
> Adding lift will help to compare the model efficiency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23792) Documentation improvements for datetime functions

2018-08-26 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23792.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20901
[https://github.com/apache/spark/pull/20901]

> Documentation improvements for datetime functions
> -
>
> Key: SPARK-23792
> URL: https://issues.apache.org/jira/browse/SPARK-23792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: A Bradbury
>Assignee: A Bradbury
>Priority: Minor
> Fix For: 2.4.0
>
>
> Added details about the supported column input types, the column return type, 
> behaviour on invalid input, supporting examples and clarifications to the 
> datetime functions in `org.apache.spark.sql.functions` for Java/Scala. 
> These changes stemmed from confusion over behaviour of the `date_add` method. 
> On first use I thought it would add the specified days to the input 
> timestamp, but it also truncated (cast) the input timestamp to a date, 
> loosing the time part. 
> Some examples:
>  * Noted that the week definition for `dayofweek` method starts on a Sunday
>  * Corrected documentation for methods such as `last_day` that only listed 
> one type of input i.e. "date column" changed to "date, timestamp or string"
>  * Renamed the parameters of the `months_between` method to match those of 
> the `datediff` method and to indicate which parameter is expected to be 
> before then other chronologically
>  * `from_unixtime` documentation referenced the "given format" when there was 
> no format parameter
>  * Documentation for `to_timestamp` methods detailed that a unix timestamp in 
> seconds would be returned (implying 1521926327) when they would actually 
> return the input cast to a timestamp type 
> Some observations:
>  * The first day of the week by the `dayofweek` method is a Sunday, but by 
> the `weekofyear` method it is a Monday
>  * The `datediff` method returns a integer value, even with timestamp input, 
> whereas the `months_between` method returns a double, which seems inconsistent
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23792) Documentation improvements for datetime functions

2018-08-26 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23792:
-

Assignee: A Bradbury

> Documentation improvements for datetime functions
> -
>
> Key: SPARK-23792
> URL: https://issues.apache.org/jira/browse/SPARK-23792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: A Bradbury
>Assignee: A Bradbury
>Priority: Minor
> Fix For: 2.4.0
>
>
> Added details about the supported column input types, the column return type, 
> behaviour on invalid input, supporting examples and clarifications to the 
> datetime functions in `org.apache.spark.sql.functions` for Java/Scala. 
> These changes stemmed from confusion over behaviour of the `date_add` method. 
> On first use I thought it would add the specified days to the input 
> timestamp, but it also truncated (cast) the input timestamp to a date, 
> loosing the time part. 
> Some examples:
>  * Noted that the week definition for `dayofweek` method starts on a Sunday
>  * Corrected documentation for methods such as `last_day` that only listed 
> one type of input i.e. "date column" changed to "date, timestamp or string"
>  * Renamed the parameters of the `months_between` method to match those of 
> the `datediff` method and to indicate which parameter is expected to be 
> before then other chronologically
>  * `from_unixtime` documentation referenced the "given format" when there was 
> no format parameter
>  * Documentation for `to_timestamp` methods detailed that a unix timestamp in 
> seconds would be returned (implying 1521926327) when they would actually 
> return the input cast to a timestamp type 
> Some observations:
>  * The first day of the week by the `dayofweek` method is a Sunday, but by 
> the `weekofyear` method it is a Monday
>  * The `datediff` method returns a integer value, even with timestamp input, 
> whereas the `months_between` method returns a double, which seems inconsistent
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25080) NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110)

2018-08-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25080.
--
Resolution: Cannot Reproduce

> NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110)
> --
>
> Key: SPARK-25080
> URL: https://issues.apache.org/jira/browse/SPARK-25080
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.1
> Environment: AWS EMR
>Reporter: Andrew K Long
>Priority: Minor
>
> NPE while reading hive table.
>  
> ```
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1190 in stage 392.0 failed 4 times, most recent failure: Lost task 
> 1190.3 in stage 392.0 (TID 122055, ip-172-31-32-196.ec2.internal, executor 
> 487): java.lang.NullPointerException
> at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:110)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:413)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:217)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:294)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:265)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>  
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1753)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1741)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1740)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1740)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:871)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1974)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1923)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1912)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:682)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
> ... 67 more
> Caused by: java.lang.NullPointerException
> at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:110)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414)
> at 
> 

[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet

2018-08-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592893#comment-16592893
 ] 

Hyukjin Kwon commented on SPARK-25206:
--

Please fix the JIRA title to reflect more precisely rather then just wrong 
results since this one is a blocker and better be clarified.

> Wrong data may be returned for Parquet
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t").show
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> scala> sql("set spark.sql.parquet.filterPushdown").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...| true|
> ++-+
> scala> sql("set spark.sql.parquet.filterPushdown=false").show
> ++-+
> | key|value|
> ++-+
> |spark.sql.parquet...|false|
> ++-+
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25135) insert datasource table may all null when select from view

2018-08-26 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592890#comment-16592890
 ] 

Yuming Wang commented on SPARK-25135:
-

Another serious case: 
{code:scala}
  withTempDir { dir =>
val path = dir.getCanonicalPath
val cnt = 30
val table1Path = s"$path/table1"
val table3Path = s"$path/table3"
spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id % 3 as 
bigint) as col2")
  .write.mode(SaveMode.Overwrite).parquet(table1Path)
withTable("table1", "table3") {
  spark.sql(
s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet location 
'$table1Path/'")
  spark.sql("CREATE TABLE table3(COL1 bigint, COL2 bigint) using parquet " +
"PARTITIONED BY (COL2) " +
s"CLUSTERED BY (COL1) INTO 2 BUCKETS location '$table3Path/'")

  withView("view1") {
spark.sql("CREATE VIEW view1 as select col1, col2 from table1 where 
col1 > -20")
spark.sql("INSERT OVERWRITE TABLE table3 select COL1, COL2 from view1 
CLUSTER BY COL1")
spark.table("table3").show
  }
}
  }
}
{code}

Exception:
{noformat}
None.get
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$4$$anonfun$5.apply(FileFormatWriter.scala:126)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$4$$anonfun$5.apply(FileFormatWriter.scala:126)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$4.apply(FileFormatWriter.scala:126)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$4.apply(FileFormatWriter.scala:125)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:125)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:151)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:101)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:117)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:186)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:186)
at org.apache.spark.sql.Dataset$$anonfun$51.apply(Dataset.scala:3243)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3242)
at org.apache.spark.sql.Dataset.(Dataset.scala:186)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:71)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
{noformat}


> insert datasource table may all null when select from view
> --
>
> Key: SPARK-25135
> URL: https://issues.apache.org/jira/browse/SPARK-25135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Yuming Wang
>Priority: Blocker
>  Labels: correctness
>
> How to reproduce:
> {code:scala}
> val path = "/tmp/spark/parquet"
> val cnt = 30
> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
> as col2").write.mode("overwrite").parquet(path)
> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
> location '$path'")
> spark.sql("create view view1 as select col1, col2 from table1 where col1 > 
> -20")
> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> spark.table("table2").show
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23698) Spark code contains numerous undefined names in Python 3

2018-08-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592848#comment-16592848
 ] 

Apache Spark commented on SPARK-23698:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22235

> Spark code contains numerous undefined names in Python 3
> 
>
> Key: SPARK-23698
> URL: https://issues.apache.org/jira/browse/SPARK-23698
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: cclauss
>Assignee: cclauss
>Priority: Minor
> Fix For: 2.4.0
>
>
> flake8 testing of https://github.com/apache/spark on Python 3.6.3
> $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source 
> --statistics*
> ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input'
> result = raw_input("\n%s (y/n): " % prompt)
>  ^
> ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input'
> primary_author = raw_input(
>  ^
> ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input'
> pick_ref = raw_input("Enter a branch name [%s]: " % default_branch)
>^
> ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input'
> jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id)
>   ^
> ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input'
> fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % 
> default_fix_versions)
>^
> ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input'
> raw_assignee = raw_input(
>^
> ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input'
> pr_num = raw_input("Which pull request would you like to merge? (e.g. 
> 34): ")
>  ^
> ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input'
> result = raw_input("Would you like to use the modified title? (y/n): 
> ")
>  ^
> ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input'
> while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y":
>   ^
> ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input'
> response = raw_input("%s [y/n]: " % msg)
>^
> ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode'
> author = unidecode.unidecode(unicode(author, "UTF-8")).strip()
>  ^
> ./python/setup.py:37:11: F821 undefined name '__version__'
> VERSION = __version__
>   ^
> ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer'
> dispatch[buffer] = save_buffer
>  ^
> ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file'
> dispatch[file] = save_file
>  ^
> ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode'
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> ^
> ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long'
> intlike = (int, long)
> ^
> ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
> return self._sc._jvm.Time(long(timestamp * 1000))
>   ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 
> undefined name 'xrange'
> for i in xrange(50):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 
> undefined name 'xrange'
> for j in xrange(5):
>  ^
> ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 
> undefined name 'xrange'
> for k in xrange(20022):
>  ^
> 20F821 undefined name 'raw_input'
> 20



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org