Re: De-identification_in Hive

Damien Carol Fri, 18 Mar 2016 20:07:47 -0700

For the record, see this ticket:
https://issues.apache.org/jira/browse/HIVE-13125


2016-03-17 17:02 GMT+01:00 Ajay Chander <[email protected]>:

> Thanks for your time Mich! I will try this one out.
>
>
> On Thursday, March 17, 2016, Mich Talebzadeh <[email protected]>
> wrote:
>
>> Then probably the easiest option would be in INSERT/SELECT from external
>> table to target table and make that column NULL
>>
>> Check the VAT column here that I made it NULL
>>
>> DROP TABLE IF EXISTS stg_t2;
>> CREATE EXTERNAL TABLE stg_t2 (
>>  INVOICENUMBER string
>> ,PAYMENTDATE string
>> ,NET string
>> ,VAT string
>> ,TOTAL string
>> )
>> COMMENT 'from csv file from excel sheet xxxx'
>> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>> STORED AS TEXTFILE
>> LOCATION '/data/stg/table2'
>> TBLPROPERTIES ("skip.header.line.count"="1")
>> ;
>> --3)
>> DROP TABLE IF EXISTS t2;
>> CREATE TABLE t2 (
>>  INVOICENUMBER          INT
>> ,PAYMENTDATE            timestamp
>> ,NET                    DECIMAL(20,2)
>> ,VAT                    DECIMAL(20,2)
>> ,TOTAL                  DECIMAL(20,2)
>> )
>> COMMENT 'from csv file from excel sheet xxxx'
>> CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS
>> STORED AS ORC
>> TBLPROPERTIES ( "orc.compress"="ZLIB",
>> "transactional"="true")
>> ;
>> --4) Put data in target table. do the conversion and ignore empty rows
>> INSERT INTO TABLE t2
>> SELECT
>>           INVOICENUMBER
>>         , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp)
>>         , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
>>         , NULL
>>         , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
>> FROM
>> stg_t2
>> WHERE
>> --        INVOICENUMBER > 0 AND
>>         CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0
>> -- Exclude empty rows
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 March 2016 at 15:32, Ajay Chander <[email protected]> wrote:
>>
>>> Mich, I am okay with replacing the columns data with some characters
>>> like asterisk. Thanks
>>>
>>>
>>> On Thursday, March 17, 2016, Mich Talebzadeh <[email protected]>
>>> wrote:
>>>
>>>> Hi Ajay,
>>>>
>>>> Do you want to be able to unmask it (at any time) or just have it
>>>> totally scrambled (for example replace the column with random characters)
>>>> in Hive?
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 17 March 2016 at 15:14, Ajay Chander <[email protected]> wrote:
>>>>
>>>>> Mich thbaks for looking into this. I have a 'csvfile.txt ' on hdfs. I
>>>>> have created an external table 'xyz' to load that data into it. One of the
>>>>> columns data 'ssn' needs to be masked. Is there any built in function is
>>>>> give that I could use?
>>>>>
>>>>>
>>>>> On Thursday, March 17, 2016, Mich Talebzadeh <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Are you loading your CSV file from an External table into Hive table.?
>>>>>>
>>>>>> Basically you want to scramble that column before putting into Hive
>>>>>> table?
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 17 March 2016 at 14:37, Ajay Chander <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Tustin, Is there anyway I can deidentify it in hive ?
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, March 17, 2016, Marcin Tustin <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> This is a classic transform-load problem. You'll want to anonymise
>>>>>>>> it once before making it available for analysis.
>>>>>>>>
>>>>>>>> On Thursday, March 17, 2016, Ajay Chander <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Everyone,
>>>>>>>>>
>>>>>>>>> I have a csv.file which has some sensitive data in a particular
>>>>>>>>> column in it.  Now I have to create a table in hive and load the data 
>>>>>>>>> into
>>>>>>>>> it. But when loading the data I have to make sure that the data is 
>>>>>>>>> masked.
>>>>>>>>> Is there any built in function is used ch supports this or do I have 
>>>>>>>>> to
>>>>>>>>> write UDF ? Any suggestions are appreciated. Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>>>> <http://www.handy.com/careers>
>>>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>>>> Handy just raised $50m
>>>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>>>>>>>  led
>>>>>>>> by Fidelity
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

Re: De-identification_in Hive

Reply via email to