Christopher created SPARK-38454:
-----------------------------------

             Summary: Partition Data Type Prevents Filtering Sporadically
                 Key: SPARK-38454
                 URL: https://issues.apache.org/jira/browse/SPARK-38454
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.2.0
            Reporter: Christopher


A pipeline (an airflow DAG) that has been running successfully in +production+ 
for 72+ hours has started failing with the same error on two different queries 
with the only difference being the table. We believe the root of the error is 


{quote}Caused by: MetaException(message:Filtering is supported only on 
partition keys of type string){quote}
 

We've seen this error resolve itself on task retry attempts, but the latest 
occurrence of this task was not resolved on retry attempts, and all proceeding 
airflow DAGs failed. The queries that trigger this error are 
{quote}select * from db.cleansed_layer_table  where 
(`dataset`='20220305185000_4d' AND `date_partition`=CAST('2022-03-05' as DATE)):


select * from db.raw_layer_table  where (`date_partition`=CAST('2022-03-05' as 
DATE) AND `dataset`='20220305185000_4d')
{quote}
 

The date_partition field was a DATE type when this error started occurring. The 
task writes and queries the raw layer before the cleansed layer is written & 
queried.

 

The first task failure was caused by the cleansed layer query, and the 
proceeding ones all failed on the raw layer query. The inconsistent behavior of 
the pipeline is of highest concern; there were 35 successful DAG runs in 
Airflow of this pipeline.

 

The error suggests
{quote}{{You can set the Spark configuration setting 
spark.sql.hive.manageFilesourcePartitions to false to work around this problem}}
{quote}
which resulted in too large of a performance hit to keep. 

 

We've changed the field to a STRING in our +development+ environment, and have 
had 78 consecutive successful __ task runs. We've paused that test for now, in 
favor of filtering only on dataset for now which we just started running.

 

Is our assessment that we will experience higher reliability by changing the 
data type of date_partition to STRING reasonable?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to