Re: Right to be forgotten and HDFS

Ivan Panico Tue, 16 Apr 2019 01:10:55 -0700

Indeed Hudi seems promising but is it suitable for production environment
as it's incubating ?


Le lun. 15 avr. 2019 à 22:46, Ivan Panico <iv.pan...@gmail.com> a écrit :

> All right but that means migrate everything to Hbase / Kudu ? That also
> kinda means that GDPR is killing HDFS ? That’s what you are suggesting ?
>
> Le lun. 15 avr. 2019 à 22:43, Wei-Chiu Chuang <weic...@cloudera.com> a
> écrit :
>
>> Wow, Chao, didn't realize you guys are making Hudi into Apache :)
>> HDFS is generally not a good fit for this use case. I've seen people
>> using Kudu for GDPR compliance.
>>
>> On Mon, Apr 15, 2019 at 11:11 AM Chao Sun <sunc...@apache.org> wrote:
>>
>>> Checkout Hudi (https://github.com/apache/incubator-hudi) which adds
>>> upsert functionality on top of columnar data such as Parquet.
>>>
>>> Chao
>>>
>>> On Mon, Apr 15, 2019 at 10:49 AM Vinod Kumar Vavilapalli <
>>> vino...@apache.org> wrote:
>>>
>>>> If one uses HDFS as raw file storage where a single file intermingles
>>>> data from all users, it's not easy to achieve what you are trying to do.
>>>>
>>>> Instead, using systems (e.g. HBase, Hive) that support updates and
>>>> deletes to individual records is the only way to go.
>>>>
>>>> +Vinod
>>>>
>>>> On Apr 15, 2019, at 1:32 AM, Ivan Panico <iv.pan...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Recent GDPR introduced a new right for people : the right to be
>>>> forgotten. This right means that if an organization is asked by a customer
>>>> to delete all his data, the organization have to comply most of the time
>>>> (there are conditions which can suspend this right but that's besides my
>>>> point).
>>>>
>>>> Now HDFS being WORM (Write Once Read Multpliple Times), I guess you see
>>>> where I'm going. What would be the best way to implement this line deletion
>>>> feature (supposing that when a customer asks for a delete of all his data,
>>>> the organization would have to delete some lines in some HDFS files).
>>>>
>>>> Right now I'm going for the following :
>>>>
>>>>    - Create a key-value base (user, [files])
>>>>    - On file writing, feed this base with the users and file location
>>>>    (by appending or updating a key).
>>>>    - When the deletion is requested by the user "john", look in that
>>>>    base and rewrite all the files of the "john" key (read the file in 
>>>> memmory,
>>>>    suppress the lines of "john", rewrite the files)
>>>>
>>>>
>>>> Would this be the most hadoop way to do that ?
>>>> I discarded some cryptoshredding like solution because the HDFS data
>>>> has to be readable by some mutliple proprietary softwares and by users at
>>>> some point and I'm not sur how to incorporate a decyphering step for all
>>>> those uses cases.
>>>> Also, I came up with this table solution because a violent grep for
>>>> some key on the whole HDFS tree seemed unlikely to scale but maybe I'm
>>>> mistaken ?
>>>>
>>>> Thanks for your help,
>>>> Best regards
>>>>
>>>>
>>>>

Re: Right to be forgotten and HDFS

Reply via email to