You can patch HIVE-3603 into your deployment so that you can make use of scan.setCacheBlocks(false).
Cheers On Mon, Feb 10, 2014 at 10:56 AM, java8964 <[email protected]> wrote: > Hi, > I know this has been asked before. I did google around this topic and > tried to understand as much as possible, but I kind of got difference > answers based on different places. So I like to ask what I have faced and > if someone can help me again on this topic. > I created one table with one column family with 20+ columns in the hive. > It is populated around 150M records from a 20G csv file. What I want to > check if how fast I can get for a full scan in MR job from the Hbase table. > It is running in a 10 nodes hadoop cluster (With Hadoop 1.1.1 + Hbase > 0.94.3 + Hive 0.9) , 8 of them as Data + Task nodes, and one is NN and > Hbase master, and another one is running 2nd NN. > 4 nodes of 8 data nodes also run Hbase region servers. > I use the following code example to get row count from a MR job, > http://hbase.apache.org/book/mapreduce.example.htmlAt first, the mapper > tasks run very slow, as I commented out the following 2 lines on purpose: > scan.setCaching(1000); // 1 is the default in Scan, which will be > bad for MapReduce jobs > scan.setCacheBlocks(false); // don't set to true for MR jobs > Then I added the above 2 lines, I almost get 10X faster compared to the > first run. That's good, it proved to me that above 2 lines are important > for Hbase full scan. > Now the question comes to in Hive. > I already created the table in the Hive linking to the Hbase table, then I > started my hive session like this: > hive --auxpath > $HIVE_HOME/lib/hive-hbase-handler-0.9.0.jar,$HIVE_HOME/lib/hbase-0.94.3.jar,$HIVE_HOME/lib/zookeeper-3.4.5.jar,$HIVE_HOME/lib/guava-r09.jar > -hiveconf hbase.master=Hbase_master:port > If I run this query "select count(*) from table", I can see the mappers > performance is very bad, almost as bad as my 1st run above. > I searched this mailing list, it looks like there is a setting in Hive > session to change the scan caching size, same as 1st line of above code > base, from here: > > http://mail-archives.apache.org/mod_mbox/hbase-user/201110.mbox/%3CCAGpTDNfn11jZAJ2mfboEqkfudXaU9HGsY4b=2x1spwf4qmu...@mail.gmail.com%3E > So I add the following settings in my hive session: > set hbase.client.scanner.caching=1000; > To my surprise, after this setting in hive session, the new MR job > generated from the Hive query still very slow, same as before this settings. > Here is what I found so far: > 1) In my owner MR code, before I add the 2 lines of code change or after, > in the job.xml of MR job, I both saw this setting in the job.xml: > hbase.client.scanner.caching=1 So this setting is the same in both run, > but the performance improved great after the code change. > 2) In hive run, I saw the setting "hbase.client.scanner.caching" changed > from 1 to 1000 in job.xml, which is what I set in the hive session, but > performance has not too much change. So the setting was changed, but it > didn't help the performance as I expected. > My questions are following: > 1) Is there any change in the hive (0.9) do the same as the 1st line of > code change? From google and hbase document, it looks like the above > configuration is the one, but it didn't help me.2) Even assume the above > setting is correct, why we have this Hive Jira to fix the Hbase scan cache > and marked ONLY fixed in Hive 0.12? The Jira ticket is here: > https://issues.apache.org/jira/browse/HIVE-36033) Is there any hive > setting can do the same as 2nd line code change above? If so, what is it? I > google around and cannot find one. > Thanks > Yong
