Pratyaksh Sharma created HUDI-796:
-------------------------------------

             Summary: Rewrite DedupeSparkJob.scala without considering the 
_hoodie_commit_time
                 Key: HUDI-796
                 URL: https://issues.apache.org/jira/browse/HUDI-796
             Project: Apache Hudi (incubating)
          Issue Type: Improvement
            Reporter: Pratyaksh Sharma
            Assignee: Pratyaksh Sharma


_`_hoodie_commit_time` can only be used for deduping a partition path if 
duplicates happened due to INSERT operation. In case of updates, bloom filter 
tags both the files where a record is present for update, and all such files 
will have the same `___hoodie_commit_time__` for a duplicate record 
henceforth._ 

_Hence it makes sense to rewrite this class without considering the metadata 
field._ 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to