It is no problem, the different is subtle yet important. There is some notion that we will be collecting FAQ/docs from these threads, and going over it is helpful. Hopefully it becomes a permanent record and we can all benefit :-)
-ryan On Fri, Jan 21, 2011 at 5:33 PM, Matt Corgan <[email protected]> wrote: > Ah - i see. I didn't notice the difference between KeyValue.Type.Delete > and KeyValue.Type.DeleteColumn. > Sorry about that, > Matt > > > On Fri, Jan 21, 2011 at 8:24 PM, Ryan Rawson <[email protected]> wrote: >> >> Hi Matt, >> >> >> This call, deleteColumns (plural!!!) when you do not specify a >> timestamp, sends LATEST_TIMESTAMP as you say, but the server uses >> System.currentTimeMilllis and inserts the delete marker - which masks >> ALL previous version for that column. So it does NOT use >> get-before-delete, the only call that does this is 'deleteColumn' >> (SINGULAR!!) >> >> note the 2 calls are VERY similar, one creates a KV of Type.Delete the >> other of Type.DeleteColumn. >> >> Yes the API is confusing. If you DO NOT use 'deleteColumn' >> (SINGULAR!), you WONT invoke the Get-before-Delete code. Stack and I >> both checked the code path, and it's the same as I remember :-) >> >> -ryan >> >> >> On Fri, Jan 21, 2011 at 5:17 PM, Matt Corgan <[email protected]> wrote: >> > Thanks for the replies. My table is set to store only one version, but >> > I'd >> > probably delete all previous versions to be safe. I'd therefore use one >> > of >> > these 2 methods: >> > - Delete.deleteColumns(byte[] family, byte[]qualifier) >> > - Delete.deleteColumns(byte[] family, byte[]qualifier, long timestamp) >> > The problem is that both have the client generate the timestamp. If you >> > don't specify it, it uses the HConstants.LATEST_TIMESTAMP which causes >> > the >> > get-before-put (10x slowdown in my use case). If you do specify it, >> > which >> > is required because the method takes a primitive long, then you're >> > relying >> > on the client's clock to be perfect. I chose the latter option for >> > better >> > performance, but was surprised to see there's not an option to let the >> > server generate the currentTimeMillis, since that is what happens on a >> > Put >> > operation. Not a big deal, but wanted see if there was a technical >> > reason >> > behind it or if it's just that nobody's needed that functionality. >> > Thanks again, >> > Matt >> > >> > On Fri, Jan 21, 2011 at 6:41 PM, Bill Graham <[email protected]> >> > wrote: >> >> >> >> Thanks Ryan, that clears it up. >> >> >> >> >> >> On Fri, Jan 21, 2011 at 3:29 PM, Ryan Rawson <[email protected]> >> >> wrote: >> >> > No, the storage model does not work like that. The storage model >> >> > revolves around the KeyValue, which is roughly: >> >> > >> >> > rowid/family/qualifier/timestamp/data >> >> > >> >> > and we store sequences of these in sorted order in HFiles. >> >> > >> >> > Note, we store the row with every single version of every >> >> > column/cell. >> >> > >> >> > Therefore there is no such thing as "removing the bytes that >> >> > represent >> >> > the actual row key", they are part of every cell, and once those >> >> > cells >> >> > go away, then so does the row key. >> >> > >> >> > I hope this helps, >> >> > -ryan >> >> > >> >> > On Fri, Jan 21, 2011 at 3:26 PM, Bill Graham <[email protected]> >> >> > wrote: >> >> >> I follow the tombstone/compact/delete cycle of the column values, >> >> >> but >> >> >> I'm still unclear of the row key life cycle. >> >> >> >> >> >> Is it that the bytes that represent the actual row key are >> >> >> associated >> >> >> with and removed with each column value? Or are they removed upon >> >> >> compaction when no column values exist for a given row key? >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Jan 21, 2011 at 2:26 PM, Ryan Rawson <[email protected]> >> >> >> wrote: >> >> >>> Any of the deletes merely insert a 'tombstone' which doesnt delete >> >> >>> the >> >> >>> data immediately but does mark it so queries no longer return it. >> >> >>> >> >> >>> During the compactions we prune these delete values and they >> >> >>> disappear >> >> >>> for good. (Barring other backups of course) >> >> >>> >> >> >>> Because of our variable length storage model, we dont store rows in >> >> >>> particular blocks and rewrite said blocks, so notions of rows >> >> >>> 'existing' or not, don't event apply to HBase as they do to RDBMS >> >> >>> systems. >> >> >>> >> >> >>> -ryan >> >> >>> >> >> >>> On Fri, Jan 21, 2011 at 2:21 PM, Bill Graham <[email protected]> >> >> >>> wrote: >> >> >>>> If you use some combination of delete requests and leave a row >> >> >>>> without >> >> >>>> any column data will the row/rowkey still exist? I'm thinking of >> >> >>>> the >> >> >>>> use case where you want to prune all old data, including row keys, >> >> >>>> from a table. >> >> >>>> >> >> >>>> >> >> >>>> On Fri, Jan 21, 2011 at 2:04 PM, Ryan Rawson <[email protected]> >> >> >>>> wrote: >> >> >>>>> There are 3 kinds of deletes (with a 4th for win): >> >> >>>>> >> >> >>>>> - Delete.deleteFamily(byte [] family, [long]) >> >> >>>>> -- This removes all data from the given family before the given >> >> >>>>> timestamp, or if none is given, System.currentTimeMillis() >> >> >>>>> - Delete.deleteColumns(byte[] family, byte[]qualifier, [long]) >> >> >>>>> -- This removes all data from the given qualifier, before the >> >> >>>>> given >> >> >>>>> timestamp, or if none is given, System.currentTimeMillis() >> >> >>>>> - Delete.deleteColumn(byte[]family, byte[]qualifier, [long]) >> >> >>>>> -- This removes A SINGLE VERSION at the given time, or if none is >> >> >>>>> given, the most recent version is Get'ed and deleted. >> >> >>>>> - Delete() >> >> >>>>> -- Calls deleteFamily() on server side on every family. >> >> >>>>> >> >> >>>>> Stack is talking about the LAST delete form. >> >> >>>>> >> >> >>>>> I think what you want is probably deleteColumns() (plural!), or >> >> >>>>> perhaps deleteFamily(). >> >> >>>>> >> >> >>>>> One rarely wants to call deleteColumn(), since it removes just a >> >> >>>>> single version, thus exposing older versions, which MAY be what >> >> >>>>> you >> >> >>>>> want, but I'm guessing probably isn't. >> >> >>>>> >> >> >>>>> Only the last form (deleteColumn (singlar!)) calls Get, the rest >> >> >>>>> do >> >> >>>>> not call Get and are very fast. >> >> >>>>> >> >> >>>>> -ryan >> >> >>>>> >> >> >>>>> On Fri, Jan 21, 2011 at 1:51 PM, Stack <[email protected]> wrote: >> >> >>>>>> On Fri, Jan 21, 2011 at 12:30 PM, Matt Corgan >> >> >>>>>> <[email protected]> >> >> >>>>>> wrote: >> >> >>>>>>> Is there a way to issue a delete using the server's current >> >> >>>>>>> timestamp? I >> >> >>>>>>> see methods using HConstants.LATEST_TIMESTAMP which is >> >> >>>>>>> extremely >> >> >>>>>>> expensive >> >> >>>>>>> since it triggers a Get call. >> >> >>>>>> >> >> >>>>>> Yes. Deleting latest version involves a Get to figure the most >> >> >>>>>> recents timestamp. And yes, in src code it says this is >> >> >>>>>> 'expensive'. >> >> >>>>>> Seems like it does this lookup anything LATEST_TIMESTAMP is >> >> >>>>>> passed >> >> >>>>>> whether column, columns, or family only to ensure the delete >> >> >>>>>> goes >> >> >>>>>> in >> >> >>>>>> ahead of whatever is currently in the Store. >> >> >>>>>> >> >> >>>>>> St.Ack >> >> >>>>>> >> >> >>>>> >> >> >>>> >> >> >>> >> >> >> >> >> > >> > >> > > >
