Very cool Anoop. I can definitely see how that would be useful. Lars - the bulk deletes do appear to work. I just wasn't sure if there was something I might be missing since I haven't seen this documented elsewhere.
Coprocessors do seem a better fit for this in the long term. Thanks everyone. On 10/7/12 11:55 PM, "Anoop Sam John" <[email protected]> wrote: >We also done an implementation using compaction time deletes(avoid KVs). >This works very well for us.... >As this would delay the deletes to happen till the next major compaction, >we are having an implementation to do the real time bulk delete. [We have >such use case] >Here I am using an endpoint implementation to do the scan and delete at >the server side only. Just raised an IA for this [HBASE-6942]. I will >post a patch based on 0.94 model there...Pls have a look.... I have >noticed big performance improvement over the normal way of scan() + >delete(List<Delete>) as this avoids several network calls and traffic... > >-Anoop- >________________________________________ >From: lars hofhansl [[email protected]] >Sent: Saturday, October 06, 2012 1:09 AM >To: [email protected] >Subject: Re: bulk deletes > >Does it work? :) > >How did you do the deletes before?I assume you used the >HTable.delete(List<Delete>) API? > >(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor >into the compactions and simply filter out any KVs you want to have >removed. > > >-- Lars > > > >________________________________ > From: Paul Mackles <[email protected]> >To: "[email protected]" <[email protected]> >Sent: Friday, October 5, 2012 11:17 AM >Subject: bulk deletes > >We need to do deletes pretty regularly and sometimes we could have >hundreds of millions of cells to delete. TTLs won't work for us because >we have a fair amount of bizlogic around the deletes. > >Given their current implemention (we are on 0.90.4), this delete process >can take a really long time (half a day or more with 100 or so concurrent >threads). From everything I can tell, the performance issues come down to >each delete being an individual RPC call (even when using the batch API). >In other words, I don't see any thrashing on hbase while this process is >running just lots of waiting for the RPC calls to return. > >The alternative we came up with is to use the standard bulk load >facilities to handle the deletes. The code turned out to be surpisingly >simple and appears to work in the small-scale tests we have tried so far. >Is anyone else doing deletes in this fashion? Are there drawbacks that I >might be missing? Here is a link to the code: > >https://gist.github.com/3841437 > >Pretty simple, eh? I haven't seen much mention of this technique which is >why I am a tad paranoid about it. > >Thanks, >Paul
