What is your data source? Can you add a row identifier or use some combination of columns as a unique key?
On Thu, Mar 19, 2020 at 7:20 AM Aniruddh Sharma <[email protected]> wrote: > Hi > > Need some advise on how to implement following use case. > > I read dataset which is 1+ TB in size, this has 1000+ columns. > > Only 3 columns out of these 1000+ columns contain PII information and I > need to call Google DLP API. > > I want to select only 3 columns out of these 1000+ columns and submit only > these 3 columns to DLP API. Once I get the results back from DLP, I want to > change these 3 columns in my original data set. > > I dont have any UUID for each row, so I will not be able to join original > data (1000+ columns) with another data (3 columns). > > Any suggestions how to implement it. > > Thanks > Aniruddh >
