Re: De-identification_in Hive

2016-03-18 Thread Mich Talebzadeh
Hi Ajay,

Do you want to be able to unmask it (at any time) or just have it totally
scrambled (for example replace the column with random characters) in Hive?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 March 2016 at 15:14, Ajay Chander  wrote:

> Mich thbaks for looking into this. I have a 'csvfile.txt ' on hdfs. I
> have created an external table 'xyz' to load that data into it. One of the
> columns data 'ssn' needs to be masked. Is there any built in function is
> give that I could use?
>
>
> On Thursday, March 17, 2016, Mich Talebzadeh 
> wrote:
>
>> Are you loading your CSV file from an External table into Hive table.?
>>
>> Basically you want to scramble that column before putting into Hive table?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 March 2016 at 14:37, Ajay Chander  wrote:
>>
>>> Tustin, Is there anyway I can deidentify it in hive ?
>>>
>>>
>>> On Thursday, March 17, 2016, Marcin Tustin 
>>> wrote:
>>>
 This is a classic transform-load problem. You'll want to anonymise it
 once before making it available for analysis.

 On Thursday, March 17, 2016, Ajay Chander 
 wrote:

> Hi Everyone,
>
> I have a csv.file which has some sensitive data in a particular column
> in it.  Now I have to create a table in hive and load the data into it. 
> But
> when loading the data I have to make sure that the data is masked. Is 
> there
> any built in function is used ch supports this or do I have to write UDF ?
> Any suggestions are appreciated. Thanks


 Want to work at Handy? Check out our culture deck and open roles
 
 Latest news  at Handy
 Handy just raised $50m
 
  led
 by Fidelity


>>


Re: hive need access the hdfs of hbase?

2016-03-18 Thread songj songj
hbase-site.xml in hive's classpath?

2016-03-17 17:17 GMT+08:00 Divya Gehlot :

>  Do you have hbase-site.xml in classpath ?
>
>
> On 17 March 2016 at 17:08, songj songj  wrote:
>
>> 
>>zookeeper.znode.parent
>>/hbase
>> 
>>
>> and I found it that ,bind any ip which the hive can access to
>> 'hbase-cluster' ,they are all ok!
>>
>>
>>
>> 2016-03-17 16:46 GMT+08:00 Divya Gehlot :
>>
>>> Hi,
>>> Please check your zookeeper.znode.parent property
>>> where is it pointing to ?
>>>
>>> On 17 March 2016 at 15:21, songj songj  wrote:
>>>
 hi all:
 I have 2 cluster,one is hive cluster(2.0.0),another is hbase
 cluster(1.1.1),
 this two clusters have dependent hdfs:

 hive cluster:
 
fs.defaultFS
hdfs://*hive-cluster*
 

 hbase cluster:
 
fs.defaultFS
hdfs://*hbase-cluster*
 

 *1)*but when I use hive shell to access hbase cluster
 >set hbase.zookeeper.quorum=10.24.19.88;
 >CREATE EXTERNAL TABLE IF NOT EXISTS pagecounts_hbase (rowkey STRING,
 pageviews STRING, bytes STRING) STORED BY
 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
 ('hbase.columns.mapping' = ':key,cf:c1,cf:c2') TBLPROPERTIES ('
 hbase.table.name' = 'test');

 *2)*then I got exceptions:

 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.DDLTask.
 MetaException(message:MetaException(message:java.io.IOException:
 java.lang.reflect.InvocationTargetException
  at
 org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
  at
 org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:420)
  at
 org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:413)
  at
 org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:291)
  at
 org.apache.hadoop.hbase.client.HBaseAdmin.(HBaseAdmin.java:222)
  at
 org.apache.hadoop.hive.hbase.HBaseStorageHandler.getHBaseAdmin(HBaseStorageHandler.java:102)
  at
 org.apache.hadoop.hive.hbase.HBaseStorageHandler.preCreateTable(HBaseStorageHandler.java:182)
  at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:608)
  at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:601)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:90)
  at com.sun.proxy.$Proxy15.createTable(Unknown Source)
  at
 org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:671)
  at
 org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3973)
  at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:295)
  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
  at
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
  at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1604)
  at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1364)
  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1177)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1004)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:994)
  at
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:201)
  at
 org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:153)
  at
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:364)
  at
 org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:712)
  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:631)
  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:570)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
 Caused by: java.lang.reflect.InvocationTargetException
  at 

hive need access the hdfs of hbase?

2016-03-18 Thread songj songj
hi all:
I have 2 cluster,one is hive cluster(2.0.0),another is hbase cluster(1.1.1),
this two clusters have dependent hdfs:

hive cluster:

   fs.defaultFS
   hdfs://*hive-cluster*


hbase cluster:

   fs.defaultFS
   hdfs://*hbase-cluster*


*1)*but when I use hive shell to access hbase cluster
>set hbase.zookeeper.quorum=10.24.19.88;
>CREATE EXTERNAL TABLE IF NOT EXISTS pagecounts_hbase (rowkey STRING,
pageviews STRING, bytes STRING) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
('hbase.columns.mapping' = ':key,cf:c1,cf:c2') TBLPROPERTIES ('
hbase.table.name' = 'test');

*2)*then I got exceptions:

FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:MetaException(message:java.io.IOException:
java.lang.reflect.InvocationTargetException
 at
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
 at
org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:420)
 at
org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:413)
 at
org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:291)
 at
org.apache.hadoop.hbase.client.HBaseAdmin.(HBaseAdmin.java:222)
 at
org.apache.hadoop.hive.hbase.HBaseStorageHandler.getHBaseAdmin(HBaseStorageHandler.java:102)
 at
org.apache.hadoop.hive.hbase.HBaseStorageHandler.preCreateTable(HBaseStorageHandler.java:182)
 at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:608)
 at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:601)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:90)
 at com.sun.proxy.$Proxy15.createTable(Unknown Source)
 at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:671)
 at
org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3973)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:295)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
 at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1604)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1364)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1177)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1004)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:994)
 at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:201)
 at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:153)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:364)
 at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:712)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:631)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:570)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
 at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
 ... 36 more
Caused by: java.lang.ExceptionInInitializerError
 at org.apache.hadoop.hbase.ClusterId.parseFrom(ClusterId.java:64)
 at
org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:75)
 at
org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:105)
 at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:879)
 at
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:635)
 ... 41 more
Caused by: java.lang.IllegalArgumentException:
java.net.UnknownHostException: hbase-cluster
 at

Re: The build-in indexes in ORC file does not work.

2016-03-18 Thread Gopal Vijayaraghavan

> I have tried  bloom filter ,but it makes no improvement。I know about
> tez, but never use, I will try it later.
...
>select count(*) from gprs where terminal_type=25080;
>   will not scan data
>  Time taken: 353.345 seconds

CombineInputFormat does not do any split-elimination, so MapReduce does
not get container speedups there.

Most of your ~300s looks to be the fixed overheads of setting up each task.

We could not fix this in MRv2 due to historical compatibility issues with
merge-joins & schema evolution (see HiveSplitGenerator.java).

This is not recommended for regular use (other than in Tez), but you can
force split-elimination with


set hive.input.format=${hive.tez.input.format};

 So,  has anyone used ORC's build-in indexes before (especially in
spark SQL)?  What's my issue?

We work on SparkSQL perf issues as well - this has to do with OrcRelation

https://github.com/apache/spark/pull/10938

+
https://github.com/apache/spark/pull/10842


Cheers,
Gopal




Re: De-identification_in Hive

2016-03-18 Thread Damien Carol
For the record, see this ticket:
https://issues.apache.org/jira/browse/HIVE-13125

2016-03-17 17:02 GMT+01:00 Ajay Chander :

> Thanks for your time Mich! I will try this one out.
>
>
> On Thursday, March 17, 2016, Mich Talebzadeh 
> wrote:
>
>> Then probably the easiest option would be in INSERT/SELECT from external
>> table to target table and make that column NULL
>>
>> Check the VAT column here that I made it NULL
>>
>> DROP TABLE IF EXISTS stg_t2;
>> CREATE EXTERNAL TABLE stg_t2 (
>>  INVOICENUMBER string
>> ,PAYMENTDATE string
>> ,NET string
>> ,VAT string
>> ,TOTAL string
>> )
>> COMMENT 'from csv file from excel sheet '
>> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>> STORED AS TEXTFILE
>> LOCATION '/data/stg/table2'
>> TBLPROPERTIES ("skip.header.line.count"="1")
>> ;
>> --3)
>> DROP TABLE IF EXISTS t2;
>> CREATE TABLE t2 (
>>  INVOICENUMBER  INT
>> ,PAYMENTDATEtimestamp
>> ,NETDECIMAL(20,2)
>> ,VATDECIMAL(20,2)
>> ,TOTAL  DECIMAL(20,2)
>> )
>> COMMENT 'from csv file from excel sheet '
>> CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS
>> STORED AS ORC
>> TBLPROPERTIES ( "orc.compress"="ZLIB",
>> "transactional"="true")
>> ;
>> --4) Put data in target table. do the conversion and ignore empty rows
>> INSERT INTO TABLE t2
>> SELECT
>>   INVOICENUMBER
>> , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/')*1000 as timestamp)
>> , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
>> , NULL
>> , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
>> FROM
>> stg_t2
>> WHERE
>> --INVOICENUMBER > 0 AND
>> CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0
>> -- Exclude empty rows
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 March 2016 at 15:32, Ajay Chander  wrote:
>>
>>> Mich, I am okay with replacing the columns data with some characters
>>> like asterisk. Thanks
>>>
>>>
>>> On Thursday, March 17, 2016, Mich Talebzadeh 
>>> wrote:
>>>
 Hi Ajay,

 Do you want to be able to unmask it (at any time) or just have it
 totally scrambled (for example replace the column with random characters)
 in Hive?

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 17 March 2016 at 15:14, Ajay Chander  wrote:

> Mich thbaks for looking into this. I have a 'csvfile.txt ' on hdfs. I
> have created an external table 'xyz' to load that data into it. One of the
> columns data 'ssn' needs to be masked. Is there any built in function is
> give that I could use?
>
>
> On Thursday, March 17, 2016, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Are you loading your CSV file from an External table into Hive table.?
>>
>> Basically you want to scramble that column before putting into Hive
>> table?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 March 2016 at 14:37, Ajay Chander 
>> wrote:
>>
>>> Tustin, Is there anyway I can deidentify it in hive ?
>>>
>>>
>>> On Thursday, March 17, 2016, Marcin Tustin 
>>> wrote:
>>>
 This is a classic transform-load problem. You'll want to anonymise
 it once before making it available for analysis.

 On Thursday, March 17, 2016, Ajay Chander 
 wrote:

> Hi Everyone,
>
> I have a csv.file which has some sensitive data in a particular
> column in it.  Now I have to create a table in hive and load the data 
> into
> it. But when loading the data I have to make sure that the data is 
> masked.
> Is there any built in function is used ch supports this or do I have 
> to
> write UDF ? Any suggestions are appreciated. Thanks


 Want to work at Handy? Check out our culture deck and open roles
 
 Latest news