[ 
https://issues.apache.org/jira/browse/HUDI-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540154#comment-17540154
 ] 

Yue Zhang edited comment on HUDI-1575 at 5/20/22 2:58 PM:
----------------------------------------------------------

Eager conflict detection based on marker

For now we have three base HoodieWriteHandle implements which are 
HoodieCreateHandle, HoodieMergeHandle and HoodieAppendHandle.
They all will create a new marker file during initialized before actually 
writing data.

We can do this eager conflict detection before creating marker file, details 
are as followed:

First we need to create a TaskTransactionManager at task level which hold their 
own taskLocker, let take ZK lock as example.

Then we try lock partitionPath + "/" + fileId on ZK before creating marker file.
After that we need to do conflict detection:
    1. List `.temp` directory and try to find all the marker files which 
contains `partitionPath + "/" + fileId` prefix. (we can do list improvement 
here and don't need to list all the dir.)
    2. If the list result is not empty, it means that there is a conflict 
caused by another inflight ingestion job. Then we need to fail current 
ingestio. On the contrary there is no inflight conflict.
    3. Then we also need to make sure there is no committed conflict which 
finished before conflict detection(the corresponding marker files are already 
deleted). We need to reload activetimeline, 
getLatestFileSlice/getLatestBaseFile and compares it with the original one. If 
not equaled , we also failed current ingestion.

Then create marker file

 Finally release this file group level lock.


was (Author: zhangyue19921010):
Eager conflict detection based on marker file

For now we have three base HoodieWriteHandle implements which are 
HoodieCreateHandle, HoodieMergeHandle and HoodieAppendHandle.
They all will create a new marker file during initialized before actually 
writing data.

We can do conflict detection during create marker file, details are as followed:

First we need to new a TaskTransactionManager at task level which hold their 
own taskLocker, let take ZK lock as example.

Then we try lock partitionPath + "/" + fileId on ZK before create marker file.
After that we need to do conflict detection:
    1. List .temp dictionary and try to find all the marker file contains 
partitionPath + "/" + fileId prefix. (we can do list improvement here and don't 
need to list all the dic.)
    2. If the list result is not empty, it means that there is a conflict 
caused by another inflight ingestion. Then we need to fail current ingestion 
job. On the contrary there is no inflight conflict.
    3. Then we also need to make sure there is no committed conflict which 
finished before conflict detection(the corresponding marker files are deleted). 
We need to reolad activetimeline, getLatestFileSlice/getLatestBaseFile and 
compres it with the original one. If not equaled , we also failed current 
ingestion.
    4. Then create marker file
    5. Finally release this fiel group level lock

> Early detection by periodically checking last written commit & active markers
> -----------------------------------------------------------------------------
>
>                 Key: HUDI-1575
>                 URL: https://issues.apache.org/jira/browse/HUDI-1575
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: writer-core
>            Reporter: Nishith Agarwal
>            Assignee: Yue Zhang
>            Priority: Blocker
>             Fix For: 0.12.0
>
>
> Check if there are more commits, try to do resolution based on its current 
> markers, and abort for a currently running job to avoid using up resources 
> and running a concurrent job if we already found a commit that happened in 
> the meantime. This can give back so much of the cluster early and 
> dramatically lower costs in the cloud.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to