[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in char type

2021-11-01 Thread frankli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437127#comment-17437127
 ] 

frankli commented on SPARK-37051:
-

I know this SQL can work, but this behavior is different from MYSQL and 
PostgreSQL.

> The filter operator gets wrong results in char type
> ---
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type. 
> Data is inserted by hive, and queried by Spark.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
>  ++---+++
> |col_name|data_type|comment|
> ++---+++
> |a|string      |NULL|
> |b|char(50)  |NULL|
> |c|int            |NULL|
> ++---++--–+
> >>> select * from t2_orc where a='a';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
> |a|b|1|
> |a|b|2|
> |a|b|3|
> |a|b|4|
> |a|b|5|
> +-+---++–+
> >>> select * from t2_orc where b='b';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
>  +-+---++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-26 Thread frankli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434201#comment-17434201
 ] 

frankli edited comment on SPARK-37051 at 10/26/21, 8:48 AM:


This scenario also occur on Parquet. [~dongjoon]

Spark3.1 will do padding for both writer and reader side.

So, Spark 3.1 cannot read Hive data without padding, while Spark 2.4 works well.


was (Author: frankli):
This scenario also occur on Parquet.

Spark3.1 will do padding for both writer and reader side.

So, Spark 3.1 cannot read Hive data without padding, while Spark 2.4 works well.

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type. 
> Data is inserted by hive, and queried by Spark.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
>  ++---+++
> |col_name|data_type|comment|
> ++---+++
> |a|string      |NULL|
> |b|char(50)  |NULL|
> |c|int            |NULL|
> ++---++--–+
> >>> select * from t2_orc where a='a';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
> |a|b|1|
> |a|b|2|
> |a|b|3|
> |a|b|4|
> |a|b|5|
> +-+---++–+
> >>> select * from t2_orc where b='b';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
>  +-+---++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was 

[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-26 Thread frankli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434201#comment-17434201
 ] 

frankli commented on SPARK-37051:
-

This scenario also occur on Parquet.

Spark3.1 will do padding for both writer and reader side.

So, Spark 3.1 cannot read Hive data without padding, while Spark 2.4 works well.

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type. 
> Data is inserted by hive, and queried by Spark.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
>  ++---+++
> |col_name|data_type|comment|
> ++---+++
> |a|string      |NULL|
> |b|char(50)  |NULL|
> |c|int            |NULL|
> ++---++--–+
> >>> select * from t2_orc where a='a';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
> |a|b|1|
> |a|b|2|
> |a|b|3|
> |a|b|4|
> |a|b|5|
> +-+---++–+
> >>> select * from t2_orc where b='b';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
>  +-+---++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-19 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Description: 
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type. 

Data is inserted by hive, and queried by Spark.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

Luckily, the varchar type is OK. 

 

This bug can be reproduced by a few steps.

>>> desc t2_orc;
 ++---+++
|col_name|data_type|comment|

++---+++
|a|string      |NULL|
|b|char(50)  |NULL|
|c|int            |NULL|

++---++--–+

>>> select * from t2_orc where a='a';
 +-+---++--+
|a|b|c|

+-+---++--+
|a|b|1|
|a|b|2|
|a|b|3|
|a|b|4|
|a|b|5|

+-+---++–+

>>> select * from t2_orc where b='b';
 +-+---++--+
|a|b|c|

+-+---++--+
 +-+---++--+

 

By the way, Spark's tests should add more cases on the char type.

 

== Physical Plan ==
 CollectLimit (3)
 +- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)

(1) Scan orc tpcds_bin_partitioned_orc_2.item
 Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Batched: false
 Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
 PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
)]
 ReadSchema: 
struct

(2) Filter
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+

(3) CollectLimit
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Arguments: 100

 

  was:
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

Luckily, the varchar type is OK. 

 

This bug can be reproduced by a few steps.

>>> desc t2_orc;
+---++--+--+
| col_name | data_type | comment |
+---++--+--+
| a | string       | NULL |
| b | char(50)  | NULL |
| c | int            | NULL |
+---++--+–+

>>> select * from t2_orc where a='a';
++++--+
| a | b | c |
++++--+
| a | b | 1 |
| a | b | 2 |
| a | b | 3 |
| a | b | 4 |
| a | b | 5 |
++++–+

>>> select * from t2_orc where b='b';
++++--+
| a | b | c |
++++--+
++++--+

 

By the way, Spark's tests should add more cases on the 

[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-19 Thread frankli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430450#comment-17430450
 ] 

frankli commented on SPARK-37051:
-

It seems to be affected by the right padding.

[SPARK-34192][SQL] 
[https://github.com/apache/spark/commit/d1177b52304217f4cb86506fd1887ec98879ed16]

[~yaoqiang]

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
> +---++--+--+
> | col_name | data_type | comment |
> +---++--+--+
> | a | string       | NULL |
> | b | char(50)  | NULL |
> | c | int            | NULL |
> +---++--+–+
> >>> select * from t2_orc where a='a';
> ++++--+
> | a | b | c |
> ++++--+
> | a | b | 1 |
> | a | b | 2 |
> | a | b | 3 |
> | a | b | 4 |
> | a | b | 5 |
> ++++–+
> >>> select * from t2_orc where b='b';
> ++++--+
> | a | b | c |
> ++++--+
> ++++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-19 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Affects Version/s: 3.3.0

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
> +---++--+--+
> | col_name | data_type | comment |
> +---++--+--+
> | a | string       | NULL |
> | b | char(50)  | NULL |
> | c | int            | NULL |
> +---++--+–+
> >>> select * from t2_orc where a='a';
> ++++--+
> | a | b | c |
> ++++--+
> | a | b | 1 |
> | a | b | 2 |
> | a | b | 3 |
> | a | b | 4 |
> | a | b | 5 |
> ++++–+
> >>> select * from t2_orc where b='b';
> ++++--+
> | a | b | c |
> ++++--+
> ++++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-19 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Affects Version/s: 3.2.1

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
> +---++--+--+
> | col_name | data_type | comment |
> +---++--+--+
> | a | string       | NULL |
> | b | char(50)  | NULL |
> | c | int            | NULL |
> +---++--+–+
> >>> select * from t2_orc where a='a';
> ++++--+
> | a | b | c |
> ++++--+
> | a | b | 1 |
> | a | b | 2 |
> | a | b | 3 |
> | a | b | 4 |
> | a | b | 5 |
> ++++–+
> >>> select * from t2_orc where b='b';
> ++++--+
> | a | b | c |
> ++++--+
> ++++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-19 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Priority: Critical  (was: Major)

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
> +---++--+--+
> | col_name | data_type | comment |
> +---++--+--+
> | a | string       | NULL |
> | b | char(50)  | NULL |
> | c | int            | NULL |
> +---++--+–+
> >>> select * from t2_orc where a='a';
> ++++--+
> | a | b | c |
> ++++--+
> | a | b | 1 |
> | a | b | 2 |
> | a | b | 3 |
> | a | b | 4 |
> | a | b | 5 |
> ++++–+
> >>> select * from t2_orc where b='b';
> ++++--+
> | a | b | c |
> ++++--+
> ++++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-19 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Description: 
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

Luckily, the varchar type is OK. 

 

This bug can be reproduced by a few steps.

>>> desc t2_orc;
+---++--+--+
| col_name | data_type | comment |
+---++--+--+
| a | string       | NULL |
| b | char(50)  | NULL |
| c | int            | NULL |
+---++--+–+

>>> select * from t2_orc where a='a';
++++--+
| a | b | c |
++++--+
| a | b | 1 |
| a | b | 2 |
| a | b | 3 |
| a | b | 4 |
| a | b | 5 |
++++–+

>>> select * from t2_orc where b='b';
++++--+
| a | b | c |
++++--+
++++--+

 

By the way, Spark's tests should add more cases on the char type.

 

== Physical Plan ==
 CollectLimit (3)
 +- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)

(1) Scan orc tpcds_bin_partitioned_orc_2.item
 Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Batched: false
 Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
 PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
)]
 ReadSchema: 
struct

(2) Filter
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+

(3) CollectLimit
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Arguments: 100

 

  was:
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

== Physical Plan ==
 CollectLimit (3)
 +- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)

(1) Scan orc tpcds_bin_partitioned_orc_2.item
 Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, 

[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-19 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Summary: The filter operator gets wrong results in ORC's char type  (was: 
The filter operator gets wrong results in ORC char/varchar types)

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Major
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> By the way, Spark's tests should add more cases on ORC format.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

2021-10-18 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Description: 
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

== Physical Plan ==
 CollectLimit (3)
 +- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)

(1) Scan orc tpcds_bin_partitioned_orc_2.item
 Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Batched: false
 Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
 PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
)]
 ReadSchema: 
struct

(2) Filter
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+

(3) CollectLimit
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Arguments: 100

 

  was:
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

== Physical Plan ==
CollectLimit (3)
+- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)


(1) Scan orc tpcds_bin_partitioned_orc_2.item
Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Batched: false
Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
PushedFilters: [IsNotNull(i_category), EqualTo(i_category,+Music         )]+
ReadSchema: 
struct

(2) Filter
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+

(3) CollectLimit
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, 

[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

2021-10-18 Thread frankli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430297#comment-17430297
 ] 

frankli commented on SPARK-37051:
-

[~dongjoon] Can I trouble you to take a look. Thanks a lot.

> The filter operator gets wrong results in ORC char/varchar types
> 
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Major
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> By the way, Spark's tests should add more cases on ORC format.
>  
> == Physical Plan ==
> CollectLimit (3)
> +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
> Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Batched: false
> Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
> PushedFilters: [IsNotNull(i_category), EqualTo(i_category,+Music         )]+
> ReadSchema: 
> struct
> (2) Filter
> Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
> Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

2021-10-18 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Description: 
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

== Physical Plan ==
CollectLimit (3)
+- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)


(1) Scan orc tpcds_bin_partitioned_orc_2.item
Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Batched: false
Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
PushedFilters: [IsNotNull(i_category), EqualTo(i_category,+Music         )]+
ReadSchema: 
struct

(2) Filter
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+

(3) CollectLimit
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Arguments: 100

 

  was:
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

== Physical Plan ==
 CollectLimit (3)
 +- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)

(1) Scan orc tpcds_bin_partitioned_orc_2.item
 Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Batched: false
PushedFilters: [IsNotNull(i_category), ++EqualTo(i_category,Music          
)]
 ReadSchema: 
struct

(2) Filter
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music           ))+

(3) CollectLimit
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, 

[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

2021-10-18 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Description: 
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

== Physical Plan ==
 CollectLimit (3)
 +- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)

(1) Scan orc tpcds_bin_partitioned_orc_2.item
 Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Batched: false
PushedFilters: [IsNotNull(i_category), ++EqualTo(i_category,Music          
)]
 ReadSchema: 
struct

(2) Filter
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music           ))+

(3) CollectLimit
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Arguments: 100

 

  was:
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

 

== Physical Plan ==
CollectLimit (3)
+- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)


(1) Scan orc tpcds_bin_partitioned_orc_2.item
Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Batched: false
Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,Music          )]+
ReadSchema: 
struct

(2) Filter
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music           ))+

(3) CollectLimit
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 

[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

2021-10-18 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Description: 
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

 

 

== Physical Plan ==
CollectLimit (3)
+- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)


(1) Scan orc tpcds_bin_partitioned_orc_2.item
Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Batched: false
Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,Music          )]+
ReadSchema: 
struct

(2) Filter
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music           ))+

(3) CollectLimit
Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21]
Arguments: 100

 

  was:
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

!image-2021-10-19-11-01-55-597.png|width=1085,height=499!

 


> The filter operator gets wrong results in ORC char/varchar types
> 
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Major
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> By the way, Spark's tests should add more cases on ORC format.
>  
>  
> == Physical Plan ==
> CollectLimit (3)
> +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
> Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Batched: false
> Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
> PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,Music          )]+
> ReadSchema: 
> struct
> (2) Filter
> Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
> Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music           
> ))+
> (3) CollectLimit
> Input [22]: [i_item_sk#0L, 

[jira] [Created] (SPARK-37051) The filter operator gets wrong results in ORC char/varchar types

2021-10-18 Thread frankli (Jira)
frankli created SPARK-37051:
---

 Summary: The filter operator gets wrong results in ORC 
char/varchar types
 Key: SPARK-37051
 URL: https://issues.apache.org/jira/browse/SPARK-37051
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
 Environment: Spark 3.1.2

Scala 2.12 / Java 1.8
Reporter: frankli


When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

By the way, Spark's tests should add more cases on ORC format.

!image-2021-10-19-11-01-55-597.png|width=1085,height=499!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org