Thanks for filing it, Jordan. Great writeup too. Mike
On Thu, Apr 28, 2016 at 12:54 PM, Jordan Birdsell < jordan.birdsell.k...@statefarm.com> wrote: > Opened KUDU-1431 <https://issues.apache.org/jira/browse/KUDU-1431> > > > > *From:* Mike Percy [mailto:mpe...@apache.org] > *Sent:* Thursday, April 28, 2016 1:55 PM > > *To:* user@kudu.incubator.apache.org > *Subject:* Re: Weekly update 4/25 > > > > Hey Jordan, > > It would definitely be helpful if you could file a JIRA to track this. > > > > The initial version of tablet history GC that I am currently working on as > part of KUDU-236 won't yet support this type of SLA-based removal, since > the current changes are much simpler than that since they are more in line > with how we currently schedule background maintenance tasks. They > prioritize work that is estimated to provide the greatest performance > improvement or space improvement. Still, this is something we should look > at more closely, to support compliance-based use cases like the one it > sounds like you're describing. > > > > Mike > > > > On Thu, Apr 28, 2016 at 10:28 AM, Jordan Birdsell < > jordan.birdsell.k...@statefarm.com> wrote: > > Todd, > > > > Should a JIRA be opened to track this? > > > > *From:* Jordan Birdsell > *Sent:* Tuesday, April 26, 2016 2:07 PM > *To:* user@kudu.incubator.apache.org > *Subject:* RE: Weekly update 4/25 > > > > Today we solve this on an RDBMS (DB2) platform, however when data is > replicated to the cluster, we need to be able to address such deletes that > occur after replication so that we don’t have to continue to replicate > petabytes across the network. We’ve experimented with HBase and some HDFS > solutions (Hive transactions), but neither really seem to be ideal. > > > > *From:* Todd Lipcon [mailto:t...@cloudera.com <t...@cloudera.com>] > *Sent:* Tuesday, April 26, 2016 1:21 PM > > > *To:* user@kudu.incubator.apache.org > *Subject:* Re: Weekly update 4/25 > > > > On Tue, Apr 26, 2016 at 10:14 AM, Jordan Birdsell < > jordan.birdsell.k...@statefarm.com> wrote: > > If we had to go less frequently than a day I’m sure it’d be acceptable. > The volume of deletes is very low in this case. In some tables we can just > “erase” a column’s data but in others, based on the data design, we must > delete the entire row or group of rows. > > > > Thanks for the details. > > > > I'm curious, are you solving this use case with an existing system today? > (eg HBase, HDFS, or some RDBMS?) Would like to compare our planned > implementation with whatever that system is doing to make sure it's at > least as good. > > > > -Todd > > > > > > *From:* Todd Lipcon [mailto:t...@cloudera.com] > *Sent:* Tuesday, April 26, 2016 12:59 PM > > > *To:* user@kudu.incubator.apache.org > *Subject:* Re: Weekly update 4/25 > > > > On Tue, Apr 26, 2016 at 8:28 AM, Jordan Birdsell < > jordan.birdsell.k...@statefarm.com> wrote: > > Yes, this is exactly what we need to do. Not immediately is ok for our > current requirements, I’d say within a day would be ideal. > > > > Even within a day can be tricky for this kind of system if you have a > fairly uniform random delete workload. That would imply that you're > rewriting _all_ of your data every day, which uses a fair amount of IO. > > > > Are deletes extremely rare for your use case? > > > > Is it the entire row of data that has to be deleted or would it be > sufficient to "X out" some particularly sensitive column? > > > > -Todd > > > > > > *From:* Jean-Daniel Cryans [mailto:jdcry...@apache.org] > *Sent:* Tuesday, April 26, 2016 11:15 AM > > > *To:* user@kudu.incubator.apache.org > *Subject:* Re: Weekly update 4/25 > > > > Oh I see so this is in order to comply with asks such as "much sure that > data for some user/customer is 100% deleted"? We'll still have the problem > where we don't want to rewrite all the base data files (GBs/TBs) to clean > up KBs of data, although since a single row is always only part of one row > set, it means it's at most 64MB that you'd be rewriting. > > > > BTW is it ok if the data isn't immediately deleted? How long is it > acceptable to wait for before it happens? > > > > J-D > > > > On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell < > jordan.birdsell.k...@statefarm.com> wrote: > > Correct. As for the “latest version”, if a row is deleted in the latest > version then removing the old versions where it existed is exactly what > we’re looking to do. Basically, we need a way to physically get rid of > select rows (or data within a column for that matter) and all versions of > that row or column data. > > > > *From:* Jean-Daniel Cryans [mailto:jdcry...@apache.org] > *Sent:* Tuesday, April 26, 2016 10:56 AM > *To:* user@kudu.incubator.apache.org > *Subject:* Re: Weekly update 4/25 > > > > Hi Jordan, > > > > In other words, you'd like to tag specific rows to be excluded from the > default data history retention? > > > > Also, keep in mind that this improvement is about removing old versions of > the data, it will not delete the latest version. If you are used to HBase, > it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely > age out a row. > > > > Hope this helps, > > > > J-D > > > > On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell < > jordan.birdsell.k...@statefarm.com> wrote: > > Hi, > > > > Regarding row GC, I see in the design document that the tablet history > max age will be set at the table level, would it be possible to make this > something that can be overridden for specific transactions? We have some > use cases that would require accelerated removal of data from disk and > other use cases that would not have the same requirement. Unfortunately, > these different use cases apply, often times, to the same tables. > > > > Thanks, > > Jordan Birdsell > > > > *From:* Todd Lipcon [mailto:t...@apache.org] > *Sent:* Monday, April 25, 2016 1:54 PM > *To:* d...@kudu.incubator.apache.org; user@kudu.incubator.apache.org > *Subject:* Weekly update 4/25 > > > > Hey Kudu-ers, > > > > For the last month and a half, I've been posting weekly summaries of > community development activity on the Kudu blog. In case you aren't on > twitter or slack you might not have seen the posts, so I'm going to start > emailing them to the list as well. > > > > Here's this week's update: > > http://getkudu.io/2016/04/25/weekly-update.html > > > > Feel free to reply to this mail if you have any questions or would like to > get involved in development. > > > > -Todd > > > > > > > > > > -- > > Todd Lipcon > Software Engineer, Cloudera > > > > > > -- > > Todd Lipcon > Software Engineer, Cloudera > > >