Thank you very very much,Gopal. I got it. And I will study this carefully on the PPTS you shared. Best Regards.
--LLBian At 2016-01-19 14:16:27, "Gopal Vijayaraghavan" <gop...@apache.org> wrote: > > >>Thank-you so much for your quick response. Yea, the option is use only >>for hive-on-tez. I want to know its source, its principle. > >in.am=true is the better option as it computes the splits after a job has >been submitted. > >Imagine you have 3 tables in your query - with in.am=false, all the splits >have to be generated before the 1st task is spun up. > >with in.am=true, the 1st task can spin up when at least one of the tables >has already generated splits. GetSplits() is not blocking across all >tables - only within 1 table. > >In some cases, you can wait for the 1st task to even finish executing >before starting the split-gen for the 2nd task, producing ~1000x speedups. > >For example, > >insert into bigtable partition(dt) >select ... from small left outer join bigtable where >date(small.ts) = bigtable.dt and small.txnid = bigtable.txnid >where bigtable.txnid is null >; > >With in.am = true + tez DPP, the split-gen is dynamic and will not >generate splits for 100% of big-table (assuming small table is just today). > >>Mybe this resource >>“http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/29” is very >>useful, > >It has diagrams, but here's an original .pptx > >http://people.apache.org/~gopalv/W-235p-Pandey.pptx > >MD5 (W-235p-Pandey.pptx) = fd3d5c7eb6360f9654bdfbfb20031ba4 > > >Cheers, >Gopal > >