date:20231211

Re: ACID with Hive/Kylin

2023-12-11 Thread Xiaoxiang Yu

I don't know GDPR very well. Here is my understanding.

For hive and hdfs, you can consider using these techniques which support
ACID in Spark and Hive(I recommend first one):
1) Delta Lake,
https://docs.databricks.com/en/security/privacy/gdpr-delta.html
2) Hive ACID table, here is a link,
https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/migrate-hive-workloads/topics/hive-acid-migration-regulations.html

For Kylin, there are three places which may store data, index, snapshot,
dict. The refresh of the snapshot costs
less time and resources,  while refresh of index/dict much more. Snapshot
refresh will be triggered automatically
when you build an index every day.

I think you should consider centralizing user-sensitive columns(email,
phone, address) in dimension tables,
and your fact table only has the foreign key(for example, uid) which refers
to the primary key of dimension tables.
When you are modeling in Kylin, for these dim tables which contains
user-sensitive columns, try

1. set dim tables as snapshot by disable precompute join relation, so these
columns won't be built into indexes, refer
https://kylin.apache.org/5.0/docs/modeling/model_design/precompute_join_relations
2. not create a bitmap measure on these columns, so these columns won't be
built into dict

With warm regard
Xiaoxiang Yu

On Tue, Dec 12, 2023 at 12:11 PM Nam Đỗ Duy  wrote:

> Dear Xiaoxiang, Sirs/Madams
>
> I face an issue with deleting data of user according to GPDR-like policy
> which means when user send request to delete their personal data, we need
> to delete it from all system, that means to delete data:
>
> 1- from Kylin index (cube)
> 2- from Hive
> 3- from HDFS
>
> Have you had the same use-case before, do you have any suggestions to
> achieve this scenario?
>
> Thank you very much and best regards
>

Re: ACID with Hive/Kylin

2023-12-11 Thread ShaoFeng Shi

Hi Nam,

As Kylin is used to store the aggregated data, there should be no PII
information. (if you use Kylin to manage person level data, that is not a
good case).

If you do need to delete certain personal data, refresh the whole index or
some partitions is what we can do.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC,
Apache Incubator PMC,
Email: shaofeng...@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscr...@kylin.apache.org
Join Kylin dev mail group: dev-subscr...@kylin.apache.org




Nam Đỗ Duy  于2023年12月12日周二 12:11写道：

> Dear Xiaoxiang, Sirs/Madams
>
> I face an issue with deleting data of user according to GPDR-like policy
> which means when user send request to delete their personal data, we need
> to delete it from all system, that means to delete data:
>
> 1- from Kylin index (cube)
> 2- from Hive
> 3- from HDFS
>
> Have you had the same use-case before, do you have any suggestions to
> achieve this scenario?
>
> Thank you very much and best regards
>

ACID with Hive/Kylin

2023-12-11 Thread Nam Đỗ Duy via user

Dear Xiaoxiang, Sirs/Madams

I face an issue with deleting data of user according to GPDR-like policy
which means when user send request to delete their personal data, we need
to delete it from all system, that means to delete data:

1- from Kylin index (cube)
2- from Hive
3- from HDFS

Have you had the same use-case before, do you have any suggestions to
achieve this scenario?

Thank you very much and best regards

Re: ACID with Hive/Kylin

Re: ACID with Hive/Kylin

ACID with Hive/Kylin

3 matches

Site Navigation

Mail list logo

Footer information