It is no problem, the different is subtle yet important.

There is some notion that we will be collecting FAQ/docs from these
threads, and going over it is helpful.  Hopefully it becomes a
permanent record and we can all benefit :-)

-ryan

On Fri, Jan 21, 2011 at 5:33 PM, Matt Corgan <[email protected]> wrote:
> Ah - i see.  I didn't notice the difference between KeyValue.Type.Delete
> and KeyValue.Type.DeleteColumn.
> Sorry about that,
> Matt
>
>
> On Fri, Jan 21, 2011 at 8:24 PM, Ryan Rawson <[email protected]> wrote:
>>
>> Hi Matt,
>>
>>
>> This call, deleteColumns (plural!!!) when you do not specify a
>> timestamp, sends LATEST_TIMESTAMP as you say, but the server uses
>> System.currentTimeMilllis and inserts the delete marker - which masks
>> ALL previous version for that column.  So it does NOT use
>> get-before-delete, the only call that does this is 'deleteColumn'
>> (SINGULAR!!)
>>
>> note the 2 calls are VERY similar, one creates a KV of Type.Delete the
>> other of Type.DeleteColumn.
>>
>> Yes the API is confusing.  If you DO NOT use 'deleteColumn'
>> (SINGULAR!), you WONT invoke the Get-before-Delete code.  Stack and I
>> both checked the code path, and it's the same as I remember :-)
>>
>> -ryan
>>
>>
>> On Fri, Jan 21, 2011 at 5:17 PM, Matt Corgan <[email protected]> wrote:
>> > Thanks for the replies.  My table is set to store only one version, but
>> > I'd
>> > probably delete all previous versions to be safe.  I'd therefore use one
>> > of
>> > these 2 methods:
>> > - Delete.deleteColumns(byte[] family, byte[]qualifier)
>> > - Delete.deleteColumns(byte[] family, byte[]qualifier, long timestamp)
>> > The problem is that both have the client generate the timestamp.  If you
>> > don't specify it, it uses the HConstants.LATEST_TIMESTAMP which causes
>> > the
>> > get-before-put (10x slowdown in my use case).  If you do specify it,
>> > which
>> > is required because the method takes a primitive long, then you're
>> > relying
>> > on the client's clock to be perfect.  I chose the latter option for
>> > better
>> > performance, but was surprised to see there's not an option to let the
>> > server generate the currentTimeMillis, since that is what happens on a
>> > Put
>> > operation.  Not a big deal, but wanted see if there was a technical
>> > reason
>> > behind it or if it's just that nobody's needed that functionality.
>> > Thanks again,
>> > Matt
>> >
>> > On Fri, Jan 21, 2011 at 6:41 PM, Bill Graham <[email protected]>
>> > wrote:
>> >>
>> >> Thanks Ryan, that clears it up.
>> >>
>> >>
>> >> On Fri, Jan 21, 2011 at 3:29 PM, Ryan Rawson <[email protected]>
>> >> wrote:
>> >> > No, the storage model does not work like that.  The storage model
>> >> > revolves around the KeyValue, which is roughly:
>> >> >
>> >> > rowid/family/qualifier/timestamp/data
>> >> >
>> >> > and we store sequences of these in sorted order in HFiles.
>> >> >
>> >> > Note, we store the row with every single version of every
>> >> > column/cell.
>> >> >
>> >> > Therefore there is no such thing as "removing the bytes that
>> >> > represent
>> >> > the actual row key", they are part of every cell, and once those
>> >> > cells
>> >> > go away, then so does the row key.
>> >> >
>> >> > I hope this helps,
>> >> > -ryan
>> >> >
>> >> > On Fri, Jan 21, 2011 at 3:26 PM, Bill Graham <[email protected]>
>> >> > wrote:
>> >> >> I follow the tombstone/compact/delete cycle of the column values,
>> >> >> but
>> >> >> I'm still unclear of the row key life cycle.
>> >> >>
>> >> >> Is it that the bytes that represent the actual row key are
>> >> >> associated
>> >> >> with and removed with each column value? Or are they removed upon
>> >> >> compaction when no column values exist for a given row key?
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Fri, Jan 21, 2011 at 2:26 PM, Ryan Rawson <[email protected]>
>> >> >> wrote:
>> >> >>> Any of the deletes merely insert a 'tombstone' which doesnt delete
>> >> >>> the
>> >> >>> data immediately but does mark it so queries no longer return it.
>> >> >>>
>> >> >>> During the compactions we prune these delete values and they
>> >> >>> disappear
>> >> >>> for good.  (Barring other backups of course)
>> >> >>>
>> >> >>> Because of our variable length storage model, we dont store rows in
>> >> >>> particular blocks and rewrite said blocks, so notions of rows
>> >> >>> 'existing' or not, don't event apply to HBase as they do to RDBMS
>> >> >>> systems.
>> >> >>>
>> >> >>> -ryan
>> >> >>>
>> >> >>> On Fri, Jan 21, 2011 at 2:21 PM, Bill Graham <[email protected]>
>> >> >>> wrote:
>> >> >>>> If you use some combination of delete requests and leave a row
>> >> >>>> without
>> >> >>>> any column data will the row/rowkey still exist? I'm thinking of
>> >> >>>> the
>> >> >>>> use case where you want to prune all old data, including row keys,
>> >> >>>> from a table.
>> >> >>>>
>> >> >>>>
>> >> >>>> On Fri, Jan 21, 2011 at 2:04 PM, Ryan Rawson <[email protected]>
>> >> >>>> wrote:
>> >> >>>>> There are 3 kinds of deletes (with a 4th for win):
>> >> >>>>>
>> >> >>>>> - Delete.deleteFamily(byte [] family, [long])
>> >> >>>>> -- This removes all data from the given family before the given
>> >> >>>>> timestamp, or if none is given, System.currentTimeMillis()
>> >> >>>>> - Delete.deleteColumns(byte[] family, byte[]qualifier, [long])
>> >> >>>>> -- This removes all data from the given qualifier, before the
>> >> >>>>> given
>> >> >>>>> timestamp, or if none is given, System.currentTimeMillis()
>> >> >>>>> - Delete.deleteColumn(byte[]family, byte[]qualifier, [long])
>> >> >>>>> -- This removes A SINGLE VERSION at the given time, or if none is
>> >> >>>>> given, the most recent version is Get'ed and deleted.
>> >> >>>>> - Delete()
>> >> >>>>> -- Calls deleteFamily() on server side on every family.
>> >> >>>>>
>> >> >>>>> Stack is talking about the LAST delete form.
>> >> >>>>>
>> >> >>>>> I think what you want is probably deleteColumns() (plural!), or
>> >> >>>>> perhaps deleteFamily().
>> >> >>>>>
>> >> >>>>> One rarely wants to call deleteColumn(), since it removes just a
>> >> >>>>> single version, thus exposing older versions, which MAY be what
>> >> >>>>> you
>> >> >>>>> want, but I'm guessing probably isn't.
>> >> >>>>>
>> >> >>>>> Only the last form (deleteColumn (singlar!)) calls Get, the rest
>> >> >>>>> do
>> >> >>>>> not call Get and are very fast.
>> >> >>>>>
>> >> >>>>> -ryan
>> >> >>>>>
>> >> >>>>> On Fri, Jan 21, 2011 at 1:51 PM, Stack <[email protected]> wrote:
>> >> >>>>>> On Fri, Jan 21, 2011 at 12:30 PM, Matt Corgan
>> >> >>>>>> <[email protected]>
>> >> >>>>>> wrote:
>> >> >>>>>>> Is there a way to issue a delete using the server's current
>> >> >>>>>>> timestamp?  I
>> >> >>>>>>> see methods using HConstants.LATEST_TIMESTAMP which is
>> >> >>>>>>> extremely
>> >> >>>>>>> expensive
>> >> >>>>>>> since it triggers a Get call.
>> >> >>>>>>
>> >> >>>>>> Yes.  Deleting latest version involves a Get to figure the most
>> >> >>>>>> recents timestamp.  And yes, in src code it says this is
>> >> >>>>>> 'expensive'.
>> >> >>>>>> Seems like it does this lookup anything LATEST_TIMESTAMP is
>> >> >>>>>> passed
>> >> >>>>>> whether column, columns, or family only to ensure the delete
>> >> >>>>>> goes
>> >> >>>>>> in
>> >> >>>>>> ahead of whatever is currently in the Store.
>> >> >>>>>>
>> >> >>>>>> St.Ack
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >
>> >
>> >
>
>

Reply via email to