[GitHub] [incubator-hudi] bhasudha commented on issue #689: [WIP] [HUDI-25] Optimize HoodieInputFormat.listStatus for faster Hive Incremental queries

GitBox Mon, 06 Jan 2020 14:49:03 -0800

bhasudha commented on issue #689: [WIP] [HUDI-25] Optimize 
HoodieInputFormat.listStatus for faster Hive Incremental queries
URL: https://github.com/apache/incubator-hudi/pull/689#issuecomment-571349958
 
 
   I was able to run tests using spark.sql() against some of the production 
tables (both MOR and COW types). I used  --conf 
spark.sql.hive.convertMetastoreParquet=false so Hive serDe is used instead. 
Below is a flavor of queries that I tested. The results match between pre-fix 
and post-fix hudi-spark-bundle jars. 
   
   Snapshot queries
   =============
   simple count:
   spark.sql("select count(*) from tableA where datestr = '2019-12-10'").show()
   
   non-hudi hudi datasets join:
   spark.sql("select m.col1 as colA, t.col2 as colB from table1 m left join 
table2 as t on t._row_key = m.col1 and t.datestr >= '2016-01-01' join table3 c 
on m.col4 = c.col5 where c.col6 = 'XYZ'").show()
   
   non-hudi non-hudi datasets join:
   spark.sql("select o.col1, count(distinct e.col2) from tableA o join tableB e 
on o.id = e.id where to_date(e.col1) >= date_sub(current_date, 10) or 
to_date(e.col3) >= date_sub(current_date, 10) group by 1 order by 2 
desc").show()
   
   hudi hudi datasets join:
   spark.sql("select t.id, count(t.load) as total_count FROM tableT t LEFT JOIN 
tableO o on t.id = o.id AND o.datestr > '2019-12-28' AND NOT o.isactive WHERE 
t.datestr > '2019-12-28' AND NOT t.isactive group by 1 order by 1,2 
desc").show()
   
   group by, order and rank:
   spark.sql("select * from ( select *, rank() over ( partition by rg order by 
total_items desc ) as row_number from ( select rg, usr, count(*) as total_items 
from tableA where date(datestr) >= date('2019-10-11') and date(datestr) < 
date('2019-10-16') and event = 'complete' and  SUBSTRING_INDEX(rg, '.',1) = 
'adhoc' group by 1,2 order by 1, count(*) desc ) ) where row_number <= 
1").show()
   
   Incremental queries
   ==============
   spark.sql("select name, count(*) from  tableA where event_status = 
'complete'  and `_hoodie_commit_time` > '20200101235440' group by 1").show()


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on issue #689: [WIP] [HUDI-25] Optimize HoodieInputFormat.listStatus for faster Hive Incremental queries

Reply via email to