Re: Manual deletion of a parquet file
Awesome. https://issues.apache.org/jira/browse/HUDI-796 tracks this. On Wed, Apr 15, 2020 at 3:17 AM Vinoth Chandar wrote: > Okay makes sense.. I think we can raise a PR for the tool, integrated into > the CLI.. > Then everyone can weigh in more as well ? > > Thanks for taking this up, Pratyaksh! > > On Tue, Apr 14, 2020 at 2:58 AM Pratyaksh Sharma > wrote: > > > Hi Vinoth, > > > > Thank you for your guidance. > > > > I went through the code for RepairsCommand in Hudi-cli package which > > internally calls DedupeSparkJob.scala. The logic therein basically marks > > the file as bad based on the commit time of records. However in my case > > even the commit time is same for duplicates. The only thing that varies > is > > `_hoodie_commit_seqno` and `_hoodie_file_name`. So I guess this class > will > > not help me. > > > > IIUC the logic in DedupeSparkJob can only work when duplicates were > created > > due to INSERT operation. If we have UPDATE coming in for some duplicate > > record, then both the files where that record is present will have the > same > > commit time henceforth. Such cases cannot be dealt with by considering > > `_hoodie_commit_time` which is the same as I am experiencing. > > > > I have written one script to solve my use case. It is a no brainer where > I > > simply delete the duplicate keys and rewrite the file. Wanted to check if > > it would add any value to our code base and if I should raise a PR for > the > > same. If the community agrees, then we can work together to further > improve > > it and make it generic enough. > > > > On Mon, Apr 13, 2020 at 8:22 PM Vinoth Chandar > wrote: > > > > > Hi Pratyaksh, > > > > > > Your understanding is correct. There is a duplicate fix tool in the cli > > (I > > > wrote this a while ago for cow, but did use it in production few times > > for > > > situations like these). Check that out? IIRC it will keep the both the > > > commits and its files, but simply get rid of the duplicate records and > > > replace parquet files in place. > > > > > > >> Also once duplicates are written, you > > > are not sure of which file the update will go to next since the record > is > > > already present in 2 different parquet files. > > > > > > IIRC bloom index will tag both files and both will be updated. > > > > > > Table could show many side effects depending on when exactly the race > > > happened. > > > > > > - the second commit may have rolled back the first inflight commit and > > > mistaking it for a failed write. In this case, some data may also be > > > missing. In this case though i expect first commit to actually fail > since > > > files got deleted midway into writing. > > > - if both of them indeed succeeded, then then its just the duplicates > > > > > > > > > Thanks > > > Vinoth > > > > > > > > > > > > > > > > > > On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma < > [email protected]> > > > wrote: > > > > > > > Hi, > > > > > > > > From my experience so far of working with Hudi, I understand that > Hudi > > is > > > > not designed to handle concurrent writes from 2 different sources for > > > > example 2 instances of HoodieDeltaStreamer are simultaneously running > > and > > > > writing to the same dataset. I have experienced such a case can > result > > in > > > > duplicate writes in case of inserts. Also once duplicates are > written, > > > you > > > > are not sure of which file the update will go to next since the > record > > is > > > > already present in 2 different parquet files. Please correct me if I > am > > > > wrong. > > > > > > > > Having experienced this in few Hudi datasets, I now want to delete > one > > of > > > > the parquet files which contains duplicates in some partition of a > COW > > > type > > > > Hudi dataset. I want to know if deleting a parquet file manually can > > have > > > > any repercussions? If yes, what all can be the side effects of doing > > the > > > > same? > > > > > > > > Any leads will be highly appreciated. > > > > > > > > > >
Re: Manual deletion of a parquet file
Okay makes sense.. I think we can raise a PR for the tool, integrated into the CLI.. Then everyone can weigh in more as well ? Thanks for taking this up, Pratyaksh! On Tue, Apr 14, 2020 at 2:58 AM Pratyaksh Sharma wrote: > Hi Vinoth, > > Thank you for your guidance. > > I went through the code for RepairsCommand in Hudi-cli package which > internally calls DedupeSparkJob.scala. The logic therein basically marks > the file as bad based on the commit time of records. However in my case > even the commit time is same for duplicates. The only thing that varies is > `_hoodie_commit_seqno` and `_hoodie_file_name`. So I guess this class will > not help me. > > IIUC the logic in DedupeSparkJob can only work when duplicates were created > due to INSERT operation. If we have UPDATE coming in for some duplicate > record, then both the files where that record is present will have the same > commit time henceforth. Such cases cannot be dealt with by considering > `_hoodie_commit_time` which is the same as I am experiencing. > > I have written one script to solve my use case. It is a no brainer where I > simply delete the duplicate keys and rewrite the file. Wanted to check if > it would add any value to our code base and if I should raise a PR for the > same. If the community agrees, then we can work together to further improve > it and make it generic enough. > > On Mon, Apr 13, 2020 at 8:22 PM Vinoth Chandar wrote: > > > Hi Pratyaksh, > > > > Your understanding is correct. There is a duplicate fix tool in the cli > (I > > wrote this a while ago for cow, but did use it in production few times > for > > situations like these). Check that out? IIRC it will keep the both the > > commits and its files, but simply get rid of the duplicate records and > > replace parquet files in place. > > > > >> Also once duplicates are written, you > > are not sure of which file the update will go to next since the record is > > already present in 2 different parquet files. > > > > IIRC bloom index will tag both files and both will be updated. > > > > Table could show many side effects depending on when exactly the race > > happened. > > > > - the second commit may have rolled back the first inflight commit and > > mistaking it for a failed write. In this case, some data may also be > > missing. In this case though i expect first commit to actually fail since > > files got deleted midway into writing. > > - if both of them indeed succeeded, then then its just the duplicates > > > > > > Thanks > > Vinoth > > > > > > > > > > > > On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma > > wrote: > > > > > Hi, > > > > > > From my experience so far of working with Hudi, I understand that Hudi > is > > > not designed to handle concurrent writes from 2 different sources for > > > example 2 instances of HoodieDeltaStreamer are simultaneously running > and > > > writing to the same dataset. I have experienced such a case can result > in > > > duplicate writes in case of inserts. Also once duplicates are written, > > you > > > are not sure of which file the update will go to next since the record > is > > > already present in 2 different parquet files. Please correct me if I am > > > wrong. > > > > > > Having experienced this in few Hudi datasets, I now want to delete one > of > > > the parquet files which contains duplicates in some partition of a COW > > type > > > Hudi dataset. I want to know if deleting a parquet file manually can > have > > > any repercussions? If yes, what all can be the side effects of doing > the > > > same? > > > > > > Any leads will be highly appreciated. > > > > > >
Re: Manual deletion of a parquet file
Hi Vinoth, Thank you for your guidance. I went through the code for RepairsCommand in Hudi-cli package which internally calls DedupeSparkJob.scala. The logic therein basically marks the file as bad based on the commit time of records. However in my case even the commit time is same for duplicates. The only thing that varies is `_hoodie_commit_seqno` and `_hoodie_file_name`. So I guess this class will not help me. IIUC the logic in DedupeSparkJob can only work when duplicates were created due to INSERT operation. If we have UPDATE coming in for some duplicate record, then both the files where that record is present will have the same commit time henceforth. Such cases cannot be dealt with by considering `_hoodie_commit_time` which is the same as I am experiencing. I have written one script to solve my use case. It is a no brainer where I simply delete the duplicate keys and rewrite the file. Wanted to check if it would add any value to our code base and if I should raise a PR for the same. If the community agrees, then we can work together to further improve it and make it generic enough. On Mon, Apr 13, 2020 at 8:22 PM Vinoth Chandar wrote: > Hi Pratyaksh, > > Your understanding is correct. There is a duplicate fix tool in the cli (I > wrote this a while ago for cow, but did use it in production few times for > situations like these). Check that out? IIRC it will keep the both the > commits and its files, but simply get rid of the duplicate records and > replace parquet files in place. > > >> Also once duplicates are written, you > are not sure of which file the update will go to next since the record is > already present in 2 different parquet files. > > IIRC bloom index will tag both files and both will be updated. > > Table could show many side effects depending on when exactly the race > happened. > > - the second commit may have rolled back the first inflight commit and > mistaking it for a failed write. In this case, some data may also be > missing. In this case though i expect first commit to actually fail since > files got deleted midway into writing. > - if both of them indeed succeeded, then then its just the duplicates > > > Thanks > Vinoth > > > > > > On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma > wrote: > > > Hi, > > > > From my experience so far of working with Hudi, I understand that Hudi is > > not designed to handle concurrent writes from 2 different sources for > > example 2 instances of HoodieDeltaStreamer are simultaneously running and > > writing to the same dataset. I have experienced such a case can result in > > duplicate writes in case of inserts. Also once duplicates are written, > you > > are not sure of which file the update will go to next since the record is > > already present in 2 different parquet files. Please correct me if I am > > wrong. > > > > Having experienced this in few Hudi datasets, I now want to delete one of > > the parquet files which contains duplicates in some partition of a COW > type > > Hudi dataset. I want to know if deleting a parquet file manually can have > > any repercussions? If yes, what all can be the side effects of doing the > > same? > > > > Any leads will be highly appreciated. > > >
Re: Manual deletion of a parquet file
Hi Pratyaksh, Your understanding is correct. There is a duplicate fix tool in the cli (I wrote this a while ago for cow, but did use it in production few times for situations like these). Check that out? IIRC it will keep the both the commits and its files, but simply get rid of the duplicate records and replace parquet files in place. >> Also once duplicates are written, you are not sure of which file the update will go to next since the record is already present in 2 different parquet files. IIRC bloom index will tag both files and both will be updated. Table could show many side effects depending on when exactly the race happened. - the second commit may have rolled back the first inflight commit and mistaking it for a failed write. In this case, some data may also be missing. In this case though i expect first commit to actually fail since files got deleted midway into writing. - if both of them indeed succeeded, then then its just the duplicates Thanks Vinoth On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma wrote: > Hi, > > From my experience so far of working with Hudi, I understand that Hudi is > not designed to handle concurrent writes from 2 different sources for > example 2 instances of HoodieDeltaStreamer are simultaneously running and > writing to the same dataset. I have experienced such a case can result in > duplicate writes in case of inserts. Also once duplicates are written, you > are not sure of which file the update will go to next since the record is > already present in 2 different parquet files. Please correct me if I am > wrong. > > Having experienced this in few Hudi datasets, I now want to delete one of > the parquet files which contains duplicates in some partition of a COW type > Hudi dataset. I want to know if deleting a parquet file manually can have > any repercussions? If yes, what all can be the side effects of doing the > same? > > Any leads will be highly appreciated. >
