Hi, You’re correct - that is not a valid rewrite.
Both tables have to be shuffled across due to the OR clause with no reductions. Cheers, Gopal On 5/11/15, 10:43 AM, "Eugene Koifman" <ekoif...@hortonworks.com> wrote: >This isn’t a valid rewrite. >if a(x,y) has 1 row (1,2) and b(x,z) has 1 row (1,1) then the 1st query >will produce 1 row >but the 2nd query with subselects will not. > >On 5/11/15, 10:13 AM, "Gopal Vijayaraghavan" <gop...@apache.org> wrote: > >>Hi, >> >>> I change the sql where condition to (where t.update_time >= >>>'2015-05-04') , the sql can return result for a while. Because >>>t.update_time >>> >= '2015-05-04' can filter many row when table scan. But why change >>>where condition to >>> (where t.update_time >= '2015-05-04' or length(t8.end_user_id)>0) ,the >>>sql run forever as follows: >> >> >>The OR clause is probably causing the problems. >> >>We¹re probably not pushing down the OR clauses down to the original table >>scans. >> >>This is most likely a hive PPD miss where you do something like >> >>select a.*,b.* from a,b where a.x = b.x and (a.y = 1 or b.z = 1); >> >>where it doesn¹t get planned as >> >>select a1.*, b1.* from (select a.* from a where a.y=1) a1, (select b.* >>from b where b.z = 1) b1 where a1.x = b1.x; >> >>instead gets planned as a full-scan JOIN, then a filter. >> >>Can you spend some time and try to rewrite down your case to something >>like the above queries? >> >>If that works, then file a JIRA. >> >>Cheers, >>Gopal >> >> >