Indeed Hudi seems promising but is it suitable for production environment as it's incubating ?
Le lun. 15 avr. 2019 à 22:46, Ivan Panico <iv.pan...@gmail.com> a écrit : > All right but that means migrate everything to Hbase / Kudu ? That also > kinda means that GDPR is killing HDFS ? That’s what you are suggesting ? > > Le lun. 15 avr. 2019 à 22:43, Wei-Chiu Chuang <weic...@cloudera.com> a > écrit : > >> Wow, Chao, didn't realize you guys are making Hudi into Apache :) >> HDFS is generally not a good fit for this use case. I've seen people >> using Kudu for GDPR compliance. >> >> On Mon, Apr 15, 2019 at 11:11 AM Chao Sun <sunc...@apache.org> wrote: >> >>> Checkout Hudi (https://github.com/apache/incubator-hudi) which adds >>> upsert functionality on top of columnar data such as Parquet. >>> >>> Chao >>> >>> On Mon, Apr 15, 2019 at 10:49 AM Vinod Kumar Vavilapalli < >>> vino...@apache.org> wrote: >>> >>>> If one uses HDFS as raw file storage where a single file intermingles >>>> data from all users, it's not easy to achieve what you are trying to do. >>>> >>>> Instead, using systems (e.g. HBase, Hive) that support updates and >>>> deletes to individual records is the only way to go. >>>> >>>> +Vinod >>>> >>>> On Apr 15, 2019, at 1:32 AM, Ivan Panico <iv.pan...@gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> Recent GDPR introduced a new right for people : the right to be >>>> forgotten. This right means that if an organization is asked by a customer >>>> to delete all his data, the organization have to comply most of the time >>>> (there are conditions which can suspend this right but that's besides my >>>> point). >>>> >>>> Now HDFS being WORM (Write Once Read Multpliple Times), I guess you see >>>> where I'm going. What would be the best way to implement this line deletion >>>> feature (supposing that when a customer asks for a delete of all his data, >>>> the organization would have to delete some lines in some HDFS files). >>>> >>>> Right now I'm going for the following : >>>> >>>> - Create a key-value base (user, [files]) >>>> - On file writing, feed this base with the users and file location >>>> (by appending or updating a key). >>>> - When the deletion is requested by the user "john", look in that >>>> base and rewrite all the files of the "john" key (read the file in >>>> memmory, >>>> suppress the lines of "john", rewrite the files) >>>> >>>> >>>> Would this be the most hadoop way to do that ? >>>> I discarded some cryptoshredding like solution because the HDFS data >>>> has to be readable by some mutliple proprietary softwares and by users at >>>> some point and I'm not sur how to incorporate a decyphering step for all >>>> those uses cases. >>>> Also, I came up with this table solution because a violent grep for >>>> some key on the whole HDFS tree seemed unlikely to scale but maybe I'm >>>> mistaken ? >>>> >>>> Thanks for your help, >>>> Best regards >>>> >>>> >>>>