Re: ACID with Hive/Kylin

2023-12-11 Thread Xiaoxiang Yu
I don't know GDPR very well. Here is my understanding.

For hive and hdfs, you can consider using these techniques which support
ACID in Spark and Hive(I recommend first one):
1) Delta Lake,
https://docs.databricks.com/en/security/privacy/gdpr-delta.html
2) Hive ACID table, here is a link,
https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/migrate-hive-workloads/topics/hive-acid-migration-regulations.html

For Kylin, there are three places which may store data, index, snapshot,
dict. The refresh of the snapshot costs
less time and resources,  while refresh of index/dict much more. Snapshot
refresh will be triggered automatically
when you build an index every day.

I think you should consider centralizing user-sensitive columns(email,
phone, address) in dimension tables,
and your fact table only has the foreign key(for example, uid) which refers
to the primary key of dimension tables.
When you are modeling in Kylin, for these dim tables which contains
user-sensitive columns, try

1. set dim tables as snapshot by disable precompute join relation, so these
columns won't be built into indexes, refer
https://kylin.apache.org/5.0/docs/modeling/model_design/precompute_join_relations
2. not create a bitmap measure on these columns, so these columns won't be
built into dict


With warm regard
Xiaoxiang Yu



On Tue, Dec 12, 2023 at 12:11 PM Nam Đỗ Duy  wrote:

> Dear Xiaoxiang, Sirs/Madams
>
> I face an issue with deleting data of user according to GPDR-like policy
> which means when user send request to delete their personal data, we need
> to delete it from all system, that means to delete data:
>
> 1- from Kylin index (cube)
> 2- from Hive
> 3- from HDFS
>
> Have you had the same use-case before, do you have any suggestions to
> achieve this scenario?
>
> Thank you very much and best regards
>


Re: ACID with Hive/Kylin

2023-12-11 Thread ShaoFeng Shi
Hi Nam,

As Kylin is used to store the aggregated data, there should be no PII
information. (if you use Kylin to manage person level data, that is not a
good case).

If you do need to delete certain personal data, refresh the whole index or
some partitions is what we can do.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC,
Apache Incubator PMC,
Email: shaofeng...@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscr...@kylin.apache.org
Join Kylin dev mail group: dev-subscr...@kylin.apache.org




Nam Đỗ Duy  于2023年12月12日周二 12:11写道:

> Dear Xiaoxiang, Sirs/Madams
>
> I face an issue with deleting data of user according to GPDR-like policy
> which means when user send request to delete their personal data, we need
> to delete it from all system, that means to delete data:
>
> 1- from Kylin index (cube)
> 2- from Hive
> 3- from HDFS
>
> Have you had the same use-case before, do you have any suggestions to
> achieve this scenario?
>
> Thank you very much and best regards
>


ACID with Hive/Kylin

2023-12-11 Thread Nam Đỗ Duy
Dear Xiaoxiang, Sirs/Madams

I face an issue with deleting data of user according to GPDR-like policy
which means when user send request to delete their personal data, we need
to delete it from all system, that means to delete data:

1- from Kylin index (cube)
2- from Hive
3- from HDFS

Have you had the same use-case before, do you have any suggestions to
achieve this scenario?

Thank you very much and best regards


[jira] [Created] (KYLIN-5747) Calcite constant folding, adding strings to numbers, results not as expected when multiple plus signs are used together

2023-12-11 Thread zhong.zhu (Jira)
zhong.zhu created KYLIN-5747:


 Summary: Calcite constant folding, adding strings to numbers, 
results not as expected when multiple plus signs are used together
 Key: KYLIN-5747
 URL: https://issues.apache.org/jira/browse/KYLIN-5747
 Project: Kylin
  Issue Type: Bug
Affects Versions: 5.0-beta
Reporter: zhong.zhu
Assignee: zhong.zhu
 Fix For: 5.0.0


Phenomenon:
When more than one plus sign is used in a row and the parameters on both sides 
of the plus sign are constants, the result is not as expected
'1' + 3 + 3 → 7 (correct)
'1' + 3 + '3' → 4.03 (wrong result)
'1' + '3' + 'a' → error

When multiple plus signs are used in a row, and the arguments on both sides of 
the plus sign are constants, and the first plus sign results in null, the use 
of plus signs in a row is not supported.
e.g. 'q' + 1 + 1 -> error






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KYLIN-5746) On the page, select online model operation offline, click the model online again, and put the model online button into ash.

2023-12-11 Thread zhong.zhu (Jira)
zhong.zhu created KYLIN-5746:


 Summary: On the page, select online model operation offline, click 
the model online again, and put the model online button into ash.
 Key: KYLIN-5746
 URL: https://issues.apache.org/jira/browse/KYLIN-5746
 Project: Kylin
  Issue Type: Bug
Affects Versions: 5.0-beta
Reporter: zhong.zhu
Assignee: zhong.zhu
 Fix For: 5.0.0
 Attachments: image-2023-12-11-17-33-31-889.png

Repeat step:
1:Create model, build data, model for online.
2:Operate the model offline
3:Click on the model again and select the model to go online.

Actual result
The model on-line button is grayed out, indicating that the model is not 
available for on-line operation.
 !image-2023-12-11-17-33-31-889.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KYLIN-5745) The historical garbage cleanup task was not completed, causing the subsequent scheduled garbage cleanup task cannot be executed normally

2023-12-11 Thread zhong.zhu (Jira)
zhong.zhu created KYLIN-5745:


 Summary: The historical garbage cleanup task was not completed, 
causing the subsequent scheduled garbage cleanup task cannot be executed 
normally
 Key: KYLIN-5745
 URL: https://issues.apache.org/jira/browse/KYLIN-5745
 Project: Kylin
  Issue Type: Bug
Affects Versions: 5.0-beta
Reporter: zhong.zhu
Assignee: zhong.zhu
 Fix For: 5.0.0


{*}Problem description{*}: 
Timed garbage cleanup operation cannot be completed successfully


{*}Background{*}: 
The customer found that Kylin has a large number of small files occupying hdfs 
storage, we need to clean up, we check the customer's environment and found 
that the timed garbage cleanup has not been completed properly, has been 
timeout!


*Troubleshooting:*
After the check, it is found that the customer's garbage clearing is triggered 
for the first time in the morning of 4.6 after KE is restarted on the night of 
4.5. After this clearing operation is triggered, the thread of query history 
has been deleted since then. As a result, subsequent periodic garbage clearing 
tasks cannot be completed

Delete 2,000 rows of data at a time, one of the customer's projects need to 
delete 550,000 query history, look at the kylin.log record, delete 
time-consuming because of table locking problems lead to a delete operation 
even reached more than 20 minutes!

The following record is that the main thread of garbage collection is waiting 
for the query history cleaning to complete, but the query history cleaning has 
not been completed, and then the main thread timeout and exit.


{code:shell}
2023-04-06T00:00:00,015 INFO  [RoutineOpsWorker-287] service.ScheduleService : 
execute task MetadataBackup with remaining time: 1435 ms
2023-04-06T00:01:52,649 INFO  [RoutineOpsWorker-287] service.ScheduleService : 
execute task QueryHistoriesCleanup with remaining time: 14287361 ms
...
2023-04-06T04:00:00,012 WARN  [DefaultTaskScheduler-3] service.ScheduleService 
: Routine task execution timeout
java.util.concurrent.TimeoutException: null
at java.util.concurrent.FutureTask.get(FutureTask.java:205) 
~[?:1.8.0_242]
at 
org.apache.kylin.rest.service.ScheduleService.executeTask(ScheduleService.java:107)
 ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?]
at 
org.apache.kylin.rest.service.ScheduleService.routineTask(ScheduleService.java:77)
 ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?]
at 
org.apache.kylin.rest.service.ScheduleService$$FastClassBySpringCGLIB$$afbfc46c.invoke()
 ~[kylin-job-service-5.0.0-ke-4.6.2.0.jar:?]
{code}

The following record is until the latest time provided by the log, after 9:00 
pm the query history is still processing deletion, not with the termination of 
the main thread
{code:shell}
2023-04-06T00:08:43,015 DEBUG [QueryHistoryCleanWorker-23145] 
QueryHistoryMapper.selectByProject : <==  Total: 12
2023-04-06T00:08:43,016 INFO  [QueryHistoryCleanWorker-23145] 
util.QueryHisStoreUtil : Query histories of project is less than the 
maximum limit, so skip it.
2023-04-06T00:08:43,016 INFO  [QueryHistoryCleanWorker-23145] 
util.QueryHisStoreUtil : Query histories of project is less than the 
maximum limit, so skip it.
2023-04-06T00:08:43,016 INFO  [QueryHistoryCleanWorker-23145] 
util.QueryHisStoreUtil : Query histories of project is less than the 
maximum limit, so skip it.
2023-04-06T00:08:43,016 INFO  [QueryHistoryCleanWorker-23145] 
util.QueryHisStoreUtil : Query histories of project is less than the 
maximum limit, so skip it.
2023-04-06T00:08:43,017 INFO  [QueryHistoryCleanWorker-23145] 
util.QueryHisStoreUtil : Start to delete query histories that are beyond max 
size for project, records:1551669
...
2023-04-06T09:03:54,974 INFO  [QueryHistoryCleanWorker-23145] 
query.JdbcQueryHistoryStore : Delete 2000 row query history for project [CXCZH] 
takes 938060 ms
2023-04-06T09:03:54,975 DEBUG [QueryHistoryCleanWorker-23145] 
QueryHistoryMapper.delete : ==>  Preparing: delete from 
ke4_instance_query_history_realization where query_time < ? and project_name = ?
2023-04-06T09:03:54,975 DEBUG [QueryHistoryCleanWorker-23145] 
QueryHistoryMapper.delete : ==> Parameters: 1678863450091(Long), CXCZH(String)
{code}


 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)