[jira] [Comment Edited] (SPARK-37051) The filter operator gets wrong results in ORC's char type

frankli (Jira) Tue, 26 Oct 2021 01:49:08 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434201#comment-17434201
 ]


frankli edited comment on SPARK-37051 at 10/26/21, 8:48 AM:
------------------------------------------------------------

This scenario also occur on Parquet. [~dongjoon]

Spark3.1 will do padding for both writer and reader side.

So, Spark 3.1 cannot read Hive data without padding, while Spark 2.4 works well.


was (Author: frankli):
This scenario also occur on Parquet.

Spark3.1 will do padding for both writer and reader side.

So, Spark 3.1 cannot read Hive data without padding, while Spark 2.4 works well.

> The filter operator gets wrong results in ORC's char type
> ---------------------------------------------------------
>
>                 Key: SPARK-37051
>                 URL: https://issues.apache.org/jira/browse/SPARK-37051
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.2, 3.2.1, 3.3.0
>         Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>            Reporter: frankli
>            Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type. 
> Data is inserted by hive, and queried by Spark.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
>  +------------+-----------++------------+
> |col_name|data_type|comment|
> +------------+-----------++------------+
> |a|string      |NULL|
> |b|char(50)  |NULL|
> |c|int            |NULL|
> +------------+-----------++----------–+
> >>> select * from t2_orc where a='a';
>  +-----+---++------+
> |a|b|c|
> +-----+---++------+
> |a|b|1|
> |a|b|2|
> |a|b|3|
> |a|b|4|
> |a|b|5|
> +-----+---++----–+
> >>> select * from t2_orc where b='b';
>  +-----+---++------+
> |a|b|c|
> +-----+---++------+
>  +-----+---++------+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]++++
>  ReadSchema: 
> struct<i_item_sk:bigint,i_item_id:string,i_rec_start_date:date,i_rec_end_date:date,i_item_desc:string,i_current_price:decimal(7,2),i_wholesale_cost:decimal(7,2),i_brand_id:int,i_brand:string,i_class_id:int,i_class:string,i_category_id:int,i_category:string,i_manufact_id:int,i_manufact:string,i_size:string,i_formulation:string,i_color:string,i_units:string,i_container:string,i_manager_id:int,i_product_name:string>
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37051) The filter operator gets wrong results in ORC's char type

Reply via email to