Thanks for filing it, Jordan. Great writeup too.

Mike

On Thu, Apr 28, 2016 at 12:54 PM, Jordan Birdsell <
jordan.birdsell.k...@statefarm.com> wrote:

> Opened KUDU-1431 <https://issues.apache.org/jira/browse/KUDU-1431>
>
>
>
> *From:* Mike Percy [mailto:mpe...@apache.org]
> *Sent:* Thursday, April 28, 2016 1:55 PM
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Hey Jordan,
>
> It would definitely be helpful if you could file a JIRA to track this.
>
>
>
> The initial version of tablet history GC that I am currently working on as
> part of KUDU-236 won't yet support this type of SLA-based removal, since
> the current changes are much simpler than that since they are more in line
> with how we currently schedule background maintenance tasks. They
> prioritize work that is estimated to provide the greatest performance
> improvement or space improvement. Still, this is something we should look
> at more closely, to support compliance-based use cases like the one it
> sounds like you're describing.
>
>
>
> Mike
>
>
>
> On Thu, Apr 28, 2016 at 10:28 AM, Jordan Birdsell <
> jordan.birdsell.k...@statefarm.com> wrote:
>
> Todd,
>
>
>
> Should a JIRA be opened to track this?
>
>
>
> *From:* Jordan Birdsell
> *Sent:* Tuesday, April 26, 2016 2:07 PM
> *To:* user@kudu.incubator.apache.org
> *Subject:* RE: Weekly update 4/25
>
>
>
> Today we solve this on an RDBMS (DB2) platform, however when data is
> replicated to the cluster, we need to be able to address such deletes that
> occur after replication so that we don’t have to continue to replicate
> petabytes across the network.  We’ve experimented with HBase and some HDFS
> solutions (Hive transactions), but neither really seem to be ideal.
>
>
>
> *From:* Todd Lipcon [mailto:t...@cloudera.com <t...@cloudera.com>]
> *Sent:* Tuesday, April 26, 2016 1:21 PM
>
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> On Tue, Apr 26, 2016 at 10:14 AM, Jordan Birdsell <
> jordan.birdsell.k...@statefarm.com> wrote:
>
> If we had to go less frequently than a day I’m sure it’d be acceptable.
> The volume of deletes is very low in this case.  In some tables we can just
> “erase” a column’s data but in others, based on the data design, we must
> delete the entire row or group of rows.
>
>
>
> Thanks for the details.
>
>
>
> I'm curious, are you solving this use case with an existing system today?
> (eg HBase, HDFS, or some RDBMS?) Would like to compare our planned
> implementation with whatever that system is doing to make sure it's at
> least as good.
>
>
>
> -Todd
>
>
>
>
>
> *From:* Todd Lipcon [mailto:t...@cloudera.com]
> *Sent:* Tuesday, April 26, 2016 12:59 PM
>
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> On Tue, Apr 26, 2016 at 8:28 AM, Jordan Birdsell <
> jordan.birdsell.k...@statefarm.com> wrote:
>
> Yes, this is exactly what we need to do.  Not immediately is ok for our
> current requirements, I’d say within a day would be ideal.
>
>
>
> Even within a day can be tricky for this kind of system if you have a
> fairly uniform random delete workload. That would imply that you're
> rewriting _all_ of your data every day, which uses a fair amount of IO.
>
>
>
> Are deletes extremely rare for your use case?
>
>
>
> Is it the entire row of data that has to be deleted or would it be
> sufficient to "X out" some particularly sensitive column?
>
>
>
> -Todd
>
>
>
>
>
> *From:* Jean-Daniel Cryans [mailto:jdcry...@apache.org]
> *Sent:* Tuesday, April 26, 2016 11:15 AM
>
>
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Oh I see so this is in order to comply with asks such as "much sure that
> data for some user/customer is 100% deleted"? We'll still have the problem
> where we don't want to rewrite all the base data files (GBs/TBs) to clean
> up KBs of data, although since a single row is always only part of one row
> set, it means it's at most 64MB that you'd be rewriting.
>
>
>
> BTW is it ok if the data isn't immediately deleted? How long is it
> acceptable to wait for before it happens?
>
>
>
> J-D
>
>
>
> On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell <
> jordan.birdsell.k...@statefarm.com> wrote:
>
> Correct.  As for the “latest version”, if a row is deleted in the latest
> version then removing the old versions where it existed is exactly what
> we’re looking to do.  Basically, we need a way to physically get rid of
> select rows (or data within a column for that matter) and all versions of
> that row or column data.
>
>
>
> *From:* Jean-Daniel Cryans [mailto:jdcry...@apache.org]
> *Sent:* Tuesday, April 26, 2016 10:56 AM
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Hi Jordan,
>
>
>
> In other words, you'd like to tag specific rows to be excluded from the
> default data history retention?
>
>
>
> Also, keep in mind that this improvement is about removing old versions of
> the data, it will not delete the latest version. If you are used to HBase,
> it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely
> age out a row.
>
>
>
> Hope this helps,
>
>
>
> J-D
>
>
>
> On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <
> jordan.birdsell.k...@statefarm.com> wrote:
>
> Hi,
>
>
>
> Regarding row GC,  I see in the design document that the tablet history
> max age will be set at the table level, would it be possible to make this
> something that can be overridden for specific transactions?  We have some
> use cases that would require accelerated removal of data from disk and
> other use cases that would not have the same requirement. Unfortunately,
> these different use cases apply, often times, to the same tables.
>
>
>
> Thanks,
>
> Jordan Birdsell
>
>
>
> *From:* Todd Lipcon [mailto:t...@apache.org]
> *Sent:* Monday, April 25, 2016 1:54 PM
> *To:* d...@kudu.incubator.apache.org; user@kudu.incubator.apache.org
> *Subject:* Weekly update 4/25
>
>
>
> Hey Kudu-ers,
>
>
>
> For the last month and a half, I've been posting weekly summaries of
> community development activity on the Kudu blog. In case you aren't on
> twitter or slack you might not have seen the posts, so I'm going to start
> emailing them to the list as well.
>
>
>
> Here's this week's update:
>
> http://getkudu.io/2016/04/25/weekly-update.html
>
>
>
> Feel free to reply to this mail if you have any questions or would like to
> get involved in development.
>
>
>
> -Todd
>
>
>
>
>
>
>
>
>
> --
>
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>
>
>
> --
>
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>

Reply via email to