[jira] [Created] (SPARK-24177) Spark returning inconsistent rows and data in a join query when run using Spark SQL (using SQLContext.sql(...))

Ajay Monga (JIRA) Fri, 04 May 2018 00:30:44 -0700

Ajay Monga created SPARK-24177:
----------------------------------

             Summary: Spark returning inconsistent rows and data in a join 
query when run using Spark SQL (using SQLContext.sql(...))
                 Key: SPARK-24177
                 URL: https://issues.apache.org/jira/browse/SPARK-24177
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.0
         Environment: Production
            Reporter: Ajay Monga



Spark SQL is returning inconsistent result for a JOIN query. It returns 
different rows and the value of the column on which a simple multiplication 
takes place returns different values:

The query is like:

SELECT
second_table.date_value, SUM(XXX * second_table.shift_value)
FROM
(
 SELECT
 date_value, SUM(value) as XXX
 FROM first_table
 WHERE
 AND date IN ( '2018-01-01', '2018-01-02' )
 GROUP BY date_value
)
intermediate LEFT OUTER
JOIN second_table ON second_table.date_value = (<Logic to change the 
'date_value' from first table, say if it's a Saturday or Sunday then use 
Monday, else next valid working date>)
AND second_table.date_value IN (
 '2018-01-02',
 '2018-01-03'
)
GROUP BY second_table.date_value

 

Suspicion is that, the execution of above query is split into two queries - one 
for first_table and other for second_table before joining. Then the result gets 
split across partitions, seemingly grouped/distributed by the join column, 
which is 'date_value'. In the join there is a date shift logic that fails to 
join in some cases when it should, primarily for the date_values at the edge of 
the partitions across the Spark cluster. So, it's dependent on how the data (or 
the rdd) of the individual queries is partitioned in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24177) Spark returning inconsistent rows and data in a join query when run using Spark SQL (using SQLContext.sql(...))

Reply via email to