In 0.94 The UI of the RS has a metrics table. In that you can see blockCacheHitCount, blockCacheMissCount etc. May be there is a variation when you do scan() and get() here.
Regards Ram > -----Original Message----- > From: Jean-Marc Spaggiari [mailto:[email protected]] > Sent: Thursday, June 28, 2012 4:44 PM > To: [email protected] > Subject: Re: Scan vs Put vs Get > > Wow. First, thanks a lot all for jumping into this. > > Let me try to reply to everyone in a single post. > > > How many Gets you batch together in one call > I tried with multiple different values from 10 to 3000 with similar > results. > Time to read 10 lines : 181.0 mseconds (55 lines/seconds) > Time to read 100 lines : 484.0 mseconds (207 lines/seconds) > Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds) > Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds) > > > Is this equal to the Scan#setCaching () that u are using? > The scan call is done after the get test. So I can't set the cache for > the scan before I do the gets. Also, I tried to run them separatly (On > time only the put, one time only the get, etc.) so I did not find a > way to setup the cache for the get. > > > If both are same u can be sure that the the number of NW calls is > coming almost same. > Here are the results for 10 000 gets and 10 000 scan.next(). Each time > I access the result to be sure they are sent to the client. > (gets) Time to read 10000 lines : 36620.0 mseconds (273 lines/seconds) > (scan) Time to read 10000 lines : 119.0 mseconds (84034 lines/seconds) > > >[Block caching is enabled?] > Good question. I don't know :( Is it enabled by default? How can I > verify or activate it? > > > Also have you tried using Bloom filters? > Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;) > > > > What's the hbase version you're using? > I manually installed 0.94.0. I can try with an other version. > > > Is it repeatable? > Yes. I tries many many times by adding some options, closing some > process on the server side, remonving one datanode, adding one, etc. I > can see some small variations, but still in the same range. I was able > to move from 200 rows/second to 300 rows/second. But that's not > really a significant improvment. Also, here are the results for 7 > iterations of the same code. > > Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds) > Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds) > Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds) > Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds) > Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds) > Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds) > Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds) > > >If the locations are wrong (region moved) you will have a retry loop > I have one dead region. It's a server I brought down few days ago > because it was to slow. But it's still on the hbase web interface. > However, if I look at the table, there is no table region hosted on > this server. Hadoop also was removed from it so it's saying one dead > node. > > >Do you have anything in the logs? > Nothing special. Only some "Block cache LRU eviction" entries. > > > Could you share as well the code > Eveything is at the end of this post. > > >You can also check the cache hit and cache miss statistics that > appears on > the UI? > Can you please tell me how I can find that? I was not able to find > that on the web UI. Where should I look? > > > In your random scan how many Regions are scanned > I only have 5 regions servers and 12 table regions. So I guess all the > servers are called. > > > So here is the code for the gets. I removed the KeyOnlyFilter because > it's not improving the results. > > JM > > > > > http://pastebin.com/K75nFiQk (for syntax highligthing) > > HTable table = new HTable(config, "test3"); > > for (int iteration = 0; iteration < 10; iteration++) > { > > final int linesToRead = 1000; > System.out.println(new java.util.Date () + " Processing iteration > " + > iteration + "... "); > Vector<Get> gets = new Vector<Get>(linesToRead); > > for (long l = 0; l < linesToRead; l++) > { > byte[] array1 = new byte[24]; > for (int i = 0; i < array1.length; i++) > array1[i] = (byte)Math.floor(Math.random() * 256); > Get g = new Get (array1); > gets.addElement(g); > > processed++; > } > Object[] results = new Object[gets.size()]; > > long timeBefore = System.currentTimeMillis(); > table.batch(gets, results); > long timeAfter = System.currentTimeMillis(); > > float duration = timeAfter - timeBefore; > System.out.println ("Time to read " + gets.size() + " lines : " + > duration + " mseconds (" + Math.round(((float)linesToRead / (duration > / 1000))) + " lines/seconds)"); > > > for (int i = 0; i < results.length; i++) > { > if (results[i] instanceof KeyValue) > if (!((KeyValue)results[i]).isEmptyColumn()) > System.out.println("Result[" + i + "]: " + > results[i]); // co > BatchExample-9-Dump Print all results. > } > > 2012/6/28, Ramkrishna.S.Vasudevan <[email protected]>: > > Hi > > > > You can also check the cache hit and cache miss statistics that > appears on > > the UI? > > > > In your random scan how many Regions are scanned whereas in gets may > be > > many > > due to randomness. > > > > Regards > > Ram > > > >> -----Original Message----- > >> From: N Keywal [mailto:[email protected]] > >> Sent: Thursday, June 28, 2012 2:00 PM > >> To: [email protected] > >> Subject: Re: Scan vs Put vs Get > >> > >> Hi Jean-Marc, > >> > >> Interesting.... :-) > >> > >> Added to Anoop questions: > >> > >> What's the hbase version you're using? > >> > >> Is it repeatable, I mean if you try twice the same "gets" with the > >> same client do you have the same results? I'm asking because the > >> client caches the locations. > >> > >> If the locations are wrong (region moved) you will have a retry > loop, > >> and it includes a sleep. Do you have anything in the logs? > >> > >> Could you share as well the code you're using to get the ~100 ms > time? > >> > >> Cheers, > >> > >> N. > >> > >> On Thu, Jun 28, 2012 at 6:56 AM, Anoop Sam John <[email protected]> > >> wrote: > >> > Hi > >> > How many Gets you batch together in one call? Is this equal to > >> the Scan#setCaching () that u are using? > >> > If both are same u can be sure that the the number of NW calls is > >> coming almost same. > >> > > >> > Also you are giving random keys in the Gets. The scan will be > always > >> sequential. Seems in your get scenario it is very very random reads > >> resulting in too many reads of HFile block from HDFS. [Block caching > is > >> enabled?] > >> > > >> > Also have you tried using Bloom filters? ROW blooms might improve > >> your get performance. > >> > > >> > -Anoop- > >> > ________________________________________ > >> > From: Jean-Marc Spaggiari [[email protected]] > >> > Sent: Thursday, June 28, 2012 5:04 AM > >> > To: user > >> > Subject: Scan vs Put vs Get > >> > > >> > Hi, > >> > > >> > I have a small piece of code, for testing, which is putting 1B > lines > >> > in an existing table, getting 3000 lines and scanning 10000. > >> > > >> > The table is one family, one column. > >> > > >> > Everything is done randomly. Put with Random key (24 bytes), fixed > >> > family and fixed column names with random content (24 bytes). > >> > > >> > Get (batch) is done with random keys and scan with > RandomRowFilter. > >> > > >> > And here are the results. > >> > Time to insert 1000000 lines: 43 seconds (23255 lines/seconds) > >> > That's correct for my needs based on the poor performances of the > >> > servers in the cluster. I'm fine with the results. > >> > > >> > Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds) > >> > This is way to low. I don't understand why. So I tried the random > >> scan > >> > because I'm not able to figure the issue. > >> > > >> > Time to read 10000 lines: 108.0 mseconds (92593 lines/seconds) > >> > This it impressive! I have added that after I failed with the get. > I > >> > moved from 262 lines per seconds to almost 100K lines/seconds!!! > It's > >> > awesome! > >> > > >> > However, I'm still wondering what's wrong with my gets. > >> > > >> > The code is very simple. I'm using Get objects that I'm executing > in > >> a > >> > Batch. I tried to add a filter but it's not helping. Here is an > >> > extract of the code. > >> > > >> > for (long l = 0; l < linesToRead; l++) > >> > { > >> > byte[] array1 = new byte[24]; > >> > for (int i = 0; i < array1.length; > >> i++) > >> > array1[i] = > >> (byte)Math.floor(Math.random() * 256); > >> > Get g = new Get (array1); > >> > gets.addElement(g); > >> > } > >> > Object[] results = new > >> Object[gets.size()]; > >> > System.out.println(new > java.util.Date > >> () + " \"gets\" created."); > >> > long timeBefore = > >> System.currentTimeMillis(); > >> > table.batch(gets, results); > >> > long timeAfter = > System.currentTimeMillis(); > >> > > >> > float duration = timeAfter - timeBefore; > >> > System.out.println ("Time to read " + > >> gets.size() + " lines : " > >> > + duration + " mseconds (" + Math.round(((float)linesToRead / > >> > (duration / 1000))) + " lines/seconds)"); > >> > > >> > What's wrong with it? I can't add the setBatch neither I can add > >> > setCaching because it's not a scan. I tried with different numbers > of > >> > gets but it's almost always the same speed. Am I using it the > wrong > >> > way? Does anyone have any advice to improve that? > >> > > >> > Thanks, > >> > > >> > JM > > > >
