Re: Get on a row with multiple columns

Varun Sharma Sun, 10 Feb 2013 14:36:28 -0800

Back to BulkDeleteEndpoint, i got it to work but why are the scanner.next()
calls executing on the Priority handler queue ?


Varun

On Sat, Feb 9, 2013 at 8:46 AM, lars hofhansl <[email protected]> wrote:

> The answer is "probably" :)
> It's disabled in 0.96 by default. Check out HBASE-7008 (
> https://issues.apache.org/jira/browse/HBASE-7008) and the discussion
> there.
>
> Also check out the discussion in HBASE-5943 and HADOOP-8069 (
> https://issues.apache.org/jira/browse/HADOOP-8069)
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Jean-Marc Spaggiari <[email protected]>
> To: [email protected]
> Sent: Saturday, February 9, 2013 5:02 AM
> Subject: Re: Get on a row with multiple columns
>
> Lars, should we always consider disabling Nagle? What's the down side?
>
> JM
>
> 2013/2/9, Varun Sharma <[email protected]>:
> > Yeah, I meant true...
> >
> > On Sat, Feb 9, 2013 at 12:17 AM, lars hofhansl <[email protected]> wrote:
> >
> >> Should be set to true. If tcpnodelay is set to true, Nagle's is
> disabled.
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >>  From: Varun Sharma <[email protected]>
> >> To: [email protected]; lars hofhansl <[email protected]>
> >> Sent: Saturday, February 9, 2013 12:11 AM
> >> Subject: Re: Get on a row with multiple columns
> >>
> >>
> >> Okay I did my research - these need to be set to false. I agree.
> >>
> >>
> >> On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <[email protected]>
> >> wrote:
> >>
> >> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the
> >> hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce
> >> network latency ?
> >> >
> >> >
> >> >On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <[email protected]>
> wrote:
> >> >
> >> >Sorry.. I meant set these two config parameters to true (not false as I
> >> state below).
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>----- Original Message -----
> >> >>From: lars hofhansl <[email protected]>
> >> >>To: "[email protected]" <[email protected]>
> >> >>Cc:
> >> >>Sent: Friday, February 8, 2013 11:41 PM
> >> >>Subject: Re: Get on a row with multiple columns
> >> >>
> >> >>Only somewhat related. Seeing the magic 40ms random read time there.
> >> >> Did
> >> you disable Nagle's?
> >> >>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in
> >> hbase-site.xml).
> >> >>
> >> >>________________________________
> >> >>From: Varun Sharma <[email protected]>
> >> >>To: [email protected]; lars hofhansl <[email protected]>
> >> >>Sent: Friday, February 8, 2013 10:45 PM
> >> >>Subject: Re: Get on a row with multiple columns
> >> >>
> >> >>The use case is like your twitter feed. Tweets from people u follow.
> >> >> When
> >> >>someone unfollows, you need to delete a bunch of his tweets from the
> >> >>following feed. So, its frequent, and we are essentially running into
> >> some
> >> >>extreme corner cases like the one above. We need high write throughput
> >> for
> >> >>this, since when someone tweets, we need to fanout the tweet to all
> the
> >> >>followers. We need the ability to do fast deletes (unfollow) and fast
> >> adds
> >> >>(follow) and also be able to do fast random gets - when a real user
> >> >> loads
> >> >>the feed. I doubt we will able to play much with the schema here since
> >> >> we
> >> >>need to support a bunch of use cases.
> >> >>
> >> >>@lars: It does not take 30 seconds to place 300 delete markers. It
> >> >> takes
> >> 30
> >> >>seconds to first find which of those 300 pins are in the set of
> columns
> >> >>present - this invokes 300 gets and then place the appropriate delete
> >> >>markers. Note that we can have tens of thousands of columns in a
> single
> >> row
> >> >>so a single get is not cheap.
> >> >>
> >> >>If we were to just place delete markers, that is very fast. But when
> >> >>started doing that, our random read performance suffered because of
> too
> >> >>many delete markers. The 90th percentile on random reads shot up from
> >> >> 40
> >> >>milliseconds to 150 milliseconds, which is not acceptable for our
> >> usecase.
> >> >>
> >> >>Thanks
> >> >>Varun
> >> >>
> >> >>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <[email protected]>
> >> >> wrote:
> >> >>
> >> >>> Can you organize your columns and then delete by column family?
> >> >>>
> >> >>> deleteColumn without specifying a TS is expensive, since HBase first
> >> has
> >> >>> to figure out what the latest TS is.
> >> >>>
> >> >>> Should be better in 0.94.1 or later since deletes are batched like
> >> >>> Puts
> >> >>> (still need to retrieve the latest version, though).
> >> >>>
> >> >>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which
> >> >>> basically
> >> >>> let's specify a scan condition and then place specific delete marker
> >> for
> >> >>> all KVs encountered.
> >> >>>
> >> >>>
> >> >>> If you wanted to get really
> >> >>> fancy, you could hook up a coprocessor to the compaction process and
> >> >>> simply filter all KVs you no longer want (without ever placing any
> >> >>> delete markers).
> >> >>>
> >> >>>
> >> >>> Are you saying it takes 15 seconds to place 300 version delete
> >> markers?!
> >> >>>
> >> >>>
> >> >>> -- Lars
> >> >>>
> >> >>>
> >> >>>
> >> >>> ________________________________
> >> >>>  From: Varun Sharma <[email protected]>
> >> >>> To: [email protected]
> >> >>> Sent: Friday, February 8, 2013 10:05 PM
> >> >>> Subject: Re: Get on a row with multiple columns
> >> >>>
> >> >>> We are given a set of 300 columns to delete. I tested two cases:
> >> >>>
> >> >>> 1) deleteColumns() - with the 's'
> >> >>>
> >> >>> This function simply adds delete markers for 300 columns, in our
> >> >>> case,
> >> >>> typically only a fraction of these columns are actually present -
> 10.
> >> After
> >> >>> starting to use deleteColumns, we starting seeing a drop in cluster
> >> wide
> >> >>> random read performance - 90th percentile latency worsened, so did
> >> >>> 99th
> >> >>> probably because of having to traverse delete markers. I attribute
> >> this to
> >> >>> profusion of delete markers in the cluster. Major compactions slowed
> >> down
> >> >>> by almost 50 percent probably because of having to clean out
> >> significantly
> >> >>> more delete markers.
> >> >>>
> >> >>> 2) deleteColumn()
> >> >>>
> >> >>> Ended up with untolerable 15 second calls, which clogged all the
> >> handlers.
> >> >>> Making the cluster pretty much unresponsive.
> >> >>>
> >> >>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <[email protected]> wrote:
> >> >>>
> >> >>> > For the 300 column deletes, can you show us how the Delete(s) are
> >> >>> > constructed ?
> >> >>> >
> >> >>> > Do you use this method ?
> >> >>> >
> >> >>> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> >> >>> > Thanks
> >> >>> >
> >> >>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <[email protected]
> >
> >> >>> wrote:
> >> >>> >
> >> >>> > > So a Get call with multiple columns on a single row should be
> >> >>> > > much
> >> >>> faster
> >> >>> > > than independent Get(s) on each of those columns for that row. I
> >> >>> > > am
> >> >>> > > basically seeing severely poor performance (~ 15 seconds) for
> >> certain
> >> >>> > > deleteColumn() calls and I am seeing that there is a
> >> >>> > > prepareDeleteTimestamps() function in HRegion.java which first
> >> tries to
> >> >>> > > locate the column by doing individual gets on each column you
> >> >>> > > want
> >> to
> >> >>> > > delete (I am doing 300 column deletes). Now, I think this should
> >> ideall
> >> >>> > by
> >> >>> > > 1 get call with the batch of 300 columns so that one scan can
> >> retrieve
> >> >>> > the
> >> >>> > > columns and the columns that are found, are indeed deleted.
> >> >>> > >
> >> >>> > > Before I try this fix, I wanted to get an opinion if it will
> make
> >> >>> > > a
> >> >>> > > difference to batch the get() and it seems from your answer, it
> >> should.
> >> >>> > >
> >> >>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <[email protected]
> >
> >> >>> wrote:
> >> >>> > >
> >> >>> > > > Everything is stored as a KeyValue in HBase.
> >> >>> > > > The Key part of a KeyValue contains the row key, column
> family,
> >> >>> column
> >> >>> > > > name, and timestamp in that order.
> >> >>> > > > Each column family has it's own store and store files.
> >> >>> > > >
> >> >>> > > > So in a nutshell a get is executed by starting a scan at the
> >> >>> > > > row
> >> key
> >> >>> > > > (which is a prefix of the key) in each store (CF) and then
> >> scanning
> >> >>> > > forward
> >> >>> > > > in each store until the next row key is reached. (in reality
> it
> >> is a
> >> >>> > bit
> >> >>> > > > more complicated due to multiple versions, skipping columns,
> >> >>> > > > etc)
> >> >>> > > >
> >> >>> > > >
> >> >>> > > > -- Lars
> >> >>> > > > ________________________________
> >> >>> > > > From: Varun Sharma <[email protected]>
> >> >>> > > > To: [email protected]
> >> >>> > > > Sent: Friday, February 8, 2013 9:22 PM
> >> >>> > > > Subject: Re: Get on a row with multiple columns
> >> >>> > > >
> >> >>> > > > Sorry, I was a little unclear with my question.
> >> >>> > > >
> >> >>> > > > Lets say you have
> >> >>> > > >
> >> >>> > > > Get get = new Get(row)
> >> >>> > > > get.addColumn("1");
> >> >>> > > > get.addColumn("2");
> >> >>> > > > .
> >> >>> > > > .
> >> >>> > > > .
> >> >>> > > >
> >> >>> > > > When internally hbase executes the batch get, it will seek to
> >> column
> >> >>> > "1",
> >> >>> > > > now since data is lexicographically sorted, it does not need
> to
> >> seek
> >> >>> > from
> >> >>> > > > the beginning to get to "2", it can continue seeking,
> >> >>> > > > henceforth
> >> >>> since
> >> >>> > > > column "2" will always be after column "1". I want to know
> >> whether
> >> >>> this
> >> >>> > > is
> >> >>> > > > how a multicolumn get on a row works or not.
> >> >>> > > >
> >> >>> > > > Thanks
> >> >>> > > > Varun
> >> >>> > > >
> >> >>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <[email protected]>
> >> wrote:
> >> >>> > > >
> >> >>> > > > > Like Ishan said, a get give an instance of the Result class.
> >> >>> > > > > All utility methods that you can use are:
> >> >>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
> >> >>> > > > >  byte[] value()
> >> >>> > > > >  byte[] getRow()
> >> >>> > > > >  int size()
> >> >>> > > > >  boolean isEmpty()
> >> >>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> >> >>> > > > >  List<KeyValue> list()
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> >> >>> > > > >
> >> >>> > > > >> Based on what I read in Lars' book, a get will return a
> >> result a
> >> >>> > > Result,
> >> >>> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted
> >> by the
> >> >>> > key
> >> >>> > > > and
> >> >>> > > > >> you access this array using raw or list methods on the
> >> >>> > > > >> Result
> >> >>> > object.
> >> >>> > > > >>
> >> >>> > > > >>
> >> >>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
> >> [email protected]
> >> >>> >
> >> >>> > > > wrote:
> >> >>> > > > >>
> >> >>> > > > >>  +user
> >> >>> > > > >>>
> >> >>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
> >> >>> [email protected]>
> >> >>> > > > >>> wrote:
> >> >>> > > > >>>
> >> >>> > > > >>>  Hi,
> >> >>> > > > >>>>
> >> >>> > > > >>>> When I do a Get on a row with multiple column qualifiers.
> >> Do we
> >> >>> > sort
> >> >>> > > > the
> >> >>> > > > >>>> column qualifers and make use of the sorted order when we
> >> get
> >> >>> the
> >> >>> > > > >>>>
> >> >>> > > > >>> results ?
> >> >>> > > > >>>
> >> >>> > > > >>>> Thanks
> >> >>> > > > >>>> Varun
> >> >>> > > > >>>>
> >> >>> > > > >>>>
> >> >>> > > > >>
> >> >>> > > > >>
> >> >>> > > > > --
> >> >>> > > > > Marcos Ortiz Valmaseda,
> >> >>> > > > > Product Manager && Data Scientist at UCI
> >> >>> > > > > Blog: http://marcosluis2186.**posterous.com<
> >> >>> > > > http://marcosluis2186.posterous.com>
> >> >>> > > > > Twitter: @marcosluis2186
> >> >>> > > > > <http://twitter.com/**marcosluis2186<
> >> >>> > > > http://twitter.com/marcosluis2186>
> >> >>> > > > > >
> >> >>> > > > >
> >> >>> > > >
> >> >>> > >
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >
>

Re: Get on a row with multiple columns

Reply via email to