RE: Scan vs Put vs Get

Ramkrishna.S.Vasudevan Thu, 28 Jun 2012 04:59:59 -0700

In 0.94

The UI of the RS has a metrics table.  In that you can see blockCacheHitCount, 
blockCacheMissCount etc.  May be there is a variation when you do scan() and 
get() here.


Regards
Ram



> -----Original Message-----
> From: Jean-Marc Spaggiari [mailto:[email protected]]
> Sent: Thursday, June 28, 2012 4:44 PM
> To: [email protected]
> Subject: Re: Scan vs Put vs Get
> 
> Wow. First, thanks a lot all for jumping into this.
> 
> Let me try to reply to everyone in a single post.
> 
> > How many Gets you batch together in one call
> I tried with multiple different values from 10 to 3000 with similar
> results.
> Time to read 10 lines : 181.0 mseconds (55 lines/seconds)
> Time to read 100 lines : 484.0 mseconds (207 lines/seconds)
> Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds)
> Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds)
> 
> > Is this equal to the Scan#setCaching () that u are using?
> The scan call is done after the get test. So I can't set the cache for
> the scan before I do the gets. Also, I tried to run them separatly (On
> time only the put, one time only the get, etc.) so I did not find a
> way to setup the cache for the get.
> 
> > If both are same u can be sure that the the number of NW calls is
> coming almost same.
> Here are the results for 10 000 gets and 10 000 scan.next(). Each time
> I access the result to be sure they are sent to the client.
> (gets) Time to read 10000 lines : 36620.0 mseconds (273 lines/seconds)
> (scan) Time to read 10000 lines : 119.0 mseconds (84034 lines/seconds)
> 
> >[Block caching is enabled?]
> Good question. I don't know :( Is it enabled by default? How can I
> verify or activate it?
> 
> > Also have you tried using Bloom filters?
> Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;)
> 
> 
> > What's the hbase version you're using?
> I manually installed 0.94.0. I can try with an other version.
> 
> > Is it repeatable?
> Yes. I tries many many times by adding some options, closing some
> process on the server side, remonving one datanode, adding one, etc. I
> can see some small variations, but still in the same range. I was able
> to move from 200 rows/second  to 300 rows/second. But that's not
> really a significant improvment. Also, here are the results for 7
> iterations of the same code.
> 
> Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds)
> Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds)
> Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds)
> Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds)
> Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds)
> Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds)
> Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds)
> 
> >If the locations are wrong (region moved) you will have a retry loop
> I have one dead region. It's a server I brought down few days ago
> because it was to slow. But it's still on the hbase web interface.
> However, if I look at the table, there is no table region hosted on
> this server. Hadoop also was removed from it so it's saying one dead
> node.
> 
> >Do you have anything in the logs?
> Nothing special. Only some "Block cache LRU eviction" entries.
> 
> > Could you share as well the code
> Eveything is at the end of this post.
> 
> >You can also check the cache hit and cache miss statistics that
> appears on
> the UI?
> Can you please tell me how I can find that? I was not able to find
> that on the web UI. Where should I look?
> 
> > In your random scan how many Regions are scanned
> I only have 5 regions servers and 12 table regions. So I guess all the
> servers are called.
> 
> 
> So here is the code for the gets. I removed the KeyOnlyFilter because
> it's not improving the results.
> 
> JM
> 
> 
> 
> 
> http://pastebin.com/K75nFiQk (for syntax highligthing)
> 
> HTable table = new HTable(config, "test3");
> 
> for (int iteration = 0; iteration < 10; iteration++)
> {
> 
>       final int linesToRead = 1000;
>       System.out.println(new java.util.Date () + " Processing iteration
> " +
> iteration + "... ");
>       Vector<Get> gets = new Vector<Get>(linesToRead);
> 
>       for (long l = 0; l < linesToRead; l++)
>       {
>       byte[] array1 = new byte[24];
>       for (int i = 0; i < array1.length; i++)
>               array1[i] = (byte)Math.floor(Math.random() * 256);
>       Get g = new Get (array1);
>       gets.addElement(g);
> 
>       processed++;
> }
> Object[] results = new Object[gets.size()];
> 
> long timeBefore = System.currentTimeMillis();
> table.batch(gets, results);
> long timeAfter = System.currentTimeMillis();
> 
> float duration = timeAfter - timeBefore;
> System.out.println ("Time to read " + gets.size() + " lines : " +
> duration + " mseconds (" + Math.round(((float)linesToRead / (duration
> / 1000))) + " lines/seconds)");
> 
> 
> for (int i = 0; i < results.length; i++)
> {
>       if (results[i] instanceof KeyValue)
>               if (!((KeyValue)results[i]).isEmptyColumn())
>                       System.out.println("Result[" + i + "]: " +
> results[i]); // co
> BatchExample-9-Dump Print all results.
> }
> 
> 2012/6/28, Ramkrishna.S.Vasudevan <[email protected]>:
> > Hi
> >
> > You can also check the cache hit and cache miss statistics that
> appears on
> > the UI?
> >
> > In your random scan how many Regions are scanned whereas in gets may
> be
> > many
> > due to randomness.
> >
> > Regards
> > Ram
> >
> >> -----Original Message-----
> >> From: N Keywal [mailto:[email protected]]
> >> Sent: Thursday, June 28, 2012 2:00 PM
> >> To: [email protected]
> >> Subject: Re: Scan vs Put vs Get
> >>
> >> Hi Jean-Marc,
> >>
> >> Interesting.... :-)
> >>
> >> Added to Anoop questions:
> >>
> >> What's the hbase version you're using?
> >>
> >> Is it repeatable, I mean if you try twice the same "gets" with the
> >> same client do you have the same results? I'm asking because the
> >> client caches the locations.
> >>
> >> If the locations are wrong (region moved) you will have a retry
> loop,
> >> and it includes a sleep. Do you have anything in the logs?
> >>
> >> Could you share as well the code you're using to get the ~100 ms
> time?
> >>
> >> Cheers,
> >>
> >> N.
> >>
> >> On Thu, Jun 28, 2012 at 6:56 AM, Anoop Sam John <[email protected]>
> >> wrote:
> >> > Hi
> >> >     How many Gets you batch together in one call? Is this equal to
> >> the Scan#setCaching () that u are using?
> >> > If both are same u can be sure that the the number of NW calls is
> >> coming almost same.
> >> >
> >> > Also you are giving random keys in the Gets. The scan will be
> always
> >> sequential. Seems in your get scenario it is very very random reads
> >> resulting in too many reads of HFile block from HDFS. [Block caching
> is
> >> enabled?]
> >> >
> >> > Also have you tried using Bloom filters?  ROW blooms might improve
> >> your get performance.
> >> >
> >> > -Anoop-
> >> > ________________________________________
> >> > From: Jean-Marc Spaggiari [[email protected]]
> >> > Sent: Thursday, June 28, 2012 5:04 AM
> >> > To: user
> >> > Subject: Scan vs Put vs Get
> >> >
> >> > Hi,
> >> >
> >> > I have a small piece of code, for testing, which is putting 1B
> lines
> >> > in an existing table, getting 3000 lines and scanning 10000.
> >> >
> >> > The table is one family, one column.
> >> >
> >> > Everything is done randomly. Put with Random key (24 bytes), fixed
> >> > family and fixed column names with random content (24 bytes).
> >> >
> >> > Get (batch) is done with random keys and scan with
> RandomRowFilter.
> >> >
> >> > And here are the results.
> >> > Time to insert 1000000 lines: 43 seconds (23255 lines/seconds)
> >> > That's correct for my needs based on the poor performances of the
> >> > servers in the cluster. I'm fine with the results.
> >> >
> >> > Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds)
> >> > This is way to low. I don't understand why. So I tried the random
> >> scan
> >> > because I'm not able to figure the issue.
> >> >
> >> > Time to read 10000 lines: 108.0 mseconds (92593 lines/seconds)
> >> > This it impressive! I have added that after I failed with the get.
> I
> >> > moved from 262 lines per seconds to almost 100K lines/seconds!!!
> It's
> >> > awesome!
> >> >
> >> > However, I'm still wondering what's wrong with my gets.
> >> >
> >> > The code is very simple. I'm using Get objects that I'm executing
> in
> >> a
> >> > Batch. I tried to add a filter but it's not helping. Here is an
> >> > extract of the code.
> >> >
> >> >                        for (long l = 0; l < linesToRead; l++)
> >> >                        {
> >> >                                byte[] array1 = new byte[24];
> >> >                                for (int i = 0; i < array1.length;
> >> i++)
> >> >                                                array1[i] =
> >> (byte)Math.floor(Math.random() * 256);
> >> >                                Get g = new Get (array1);
> >> >                                gets.addElement(g);
> >> >                        }
> >> >                                Object[] results = new
> >> Object[gets.size()];
> >> >                                System.out.println(new
> java.util.Date
> >> () + " \"gets\" created.");
> >> >                                long timeBefore =
> >> System.currentTimeMillis();
> >> >                        table.batch(gets, results);
> >> >                        long timeAfter =
> System.currentTimeMillis();
> >> >
> >> >                        float duration = timeAfter - timeBefore;
> >> >                        System.out.println ("Time to read " +
> >> gets.size() + " lines : "
> >> > + duration + " mseconds (" + Math.round(((float)linesToRead /
> >> > (duration / 1000))) + " lines/seconds)");
> >> >
> >> > What's wrong with it? I can't add the setBatch neither I can add
> >> > setCaching because it's not a scan. I tried with different numbers
> of
> >> > gets but it's almost always the same speed. Am I using it the
> wrong
> >> > way? Does anyone have any advice to improve that?
> >> >
> >> > Thanks,
> >> >
> >> > JM
> >
> >

RE: Scan vs Put vs Get

Reply via email to