Hi Team,

basically we have all data as hive tables ..and processing it till now in
hive on MR.. now that we have hivecontext which can run hivequeries on
spark, we are making all these complex hive scripts to run using
hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically
running hive queries on spark and not coding anything yet in scala still we
see just making hive queries to run on spark is showing a lot difference in
time than run on MR..

so as we already have hivescripts lets make those complex hivescript run
using hc.sql as hc.sql is able to do it

or is this not best practice even though spark can do it its still better
to load all those individual hive tables in spark and make rdds and write
scala code to get the same functionality happening in hive

its becoming difficult for us to choose whether to leave it to hc.sql to do
the work of running complex scripts also or we have to code in scala..will
it be worth the effort of manual intervention in terms of performance

ex of our sample scripts
use db;
create tempfunction1 as com.fgh.jkl.TestFunction;

create destable in hive;
insert overwrite desttable select (big complext transformations and usage
of hive udf)
from table1,table2,table3 join table4 on some condition complex and join
table 7 on another complex condition where complex filtering

So please help what would be best approach and why i should not give entire
script for hivecontext to make its own rdds and run on spark if we are able
to do it

coz all examples i see online are only showing hc.sql("select * from
table1) and nothing complex than that

Reply via email to