[ 
https://issues.apache.org/jira/browse/SPARK-47034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817123#comment-17817123
 ] 

Bruce Robbins commented on SPARK-47034:
---------------------------------------

I wonder if this is SPARK-45592 (and, relatedly, SPARK-45282), which existed as 
a bug in 3.5.0 but is fixed on master and branch-3.5.

> join between cached temp tables result in missing entries
> ---------------------------------------------------------
>
>                 Key: SPARK-47034
>                 URL: https://issues.apache.org/jira/browse/SPARK-47034
>             Project: Spark
>          Issue Type: Bug
>          Components: Examples
>    Affects Versions: 3.5.0
>            Reporter: shurik mermelshtein
>            Priority: Major
>
> we create several temp tables (views) by loading several delta tables and 
> joining between them. 
> those views are used for calculation of different metrics. each metric 
> requires different views to be used. some of the more popular views are 
> cached for better performance. 
> we have noticed that once we upgraded from spark 3.4.2  to spark 3.5.0 some 
> of the join started to fail.
> we can reproduce a case were we have 2 data frames (views) (this is not the 
> real names  / values we use. this is just for the example)
>  # users with the column user_id, campaign_id, user_name.
> we make sure it has a single entry
> '111111', '22222', 'Jhon Doe'
>  # actions with the column user_id, campaign_id, action_id, action count
> we make sure it has a single entry
> '111111', '22222', 'clicks', 5
>  
>  # users view can be filtered for user_id = '111111' or/and campaign_id = 
> '22222' and it will find the existing single row
>  # actions view can be filtered for user_id = '111111' or/and campaign_id = 
> '22222' and it will find the existing single row
>  # users and actions can be inner join by user_id *OR* campaign_id and the 
> join will be successful. 
>  # users and actions can *not* be inner join by user_id *AND* campaign_id. 
> The join results in no entry.
>  # if we write both of the views to S3 and read them back to new data frames, 
> suddenly the join is working.
>  # if we disable AQE the join is working
>  # running checkpoint on the views does not make join #4 work



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to