subject:"\[jira\] \[Commented\] \(HIVE\-6147\) Support avro data stored in HBase columns"

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2020-08-19 Thread Jira



[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180514#comment-17180514
 ] 

彭佳 commented on HIVE-6147:
--

I also encountered such problems, how to solve them?

 

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>Priority: Major
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2020-06-18 Thread yeshwanth (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139492#comment-17139492
 ] 

yeshwanth commented on HIVE-6147:
-

We have schema less avro bytes written to hbase cells, with schema id prefixed 
to the avro bytes, similar to kafka avro serializer in confluent schema 
registry. how can i customize HBaseSerDe to read & query the data from Hive. I 
have found "hbase.struct.serialization.class" property but not able to identify 
which class/method to implement. Wondering anyone had same use case and solved 
this already.

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>Priority: Major
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2018-08-27 Thread Swarnim Kulkarni (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594095#comment-16594095
 ] 

Swarnim Kulkarni commented on HIVE-6147:


What error message do you see in the logs?

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>Priority: Major
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2018-08-22 Thread Vamsi Subhash Achanta (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588536#comment-16588536
 ] 

Vamsi Subhash Achanta commented on HIVE-6147:
-

[~swarnim]

Hi,

Is there any work currently going on to make the serialization work at a column 
level? If I have 2 columns with different avro schema for each column, the 
current code is not working. Ex:

{{CREATE EXTERNAL TABLE txn_store.transactions_single3_1}}
{{ROW FORMAT SERDE "org.apache.hadoop.hive.hbase.HBaseSerDe"}}
{{STORED BY "org.apache.hadoop.hive.hbase.HBaseStorageHandler"}}
{{WITH SERDEPROPERTIES (}}
{{"hbase.columns.mapping" = 
":key,nsp:scr_p_1,nsp:scr_m_1,nsp:scr_a_1",}}{{"nsp.scr_p_1.serialization.type" 
= "avro",}}
{{"avro.schema.retriever" = 
"com.phonepe.hive.schema.PaymentNamespaceAvroSchemaRetriever",}}
{{"nsp.scr_p_1.avro.schema.url" = 
"hdfs://namenode:8020/user/admin/schemas/payment_namespace.avsc",}}{{"nsp.scr_m_1.serialization.type"
 = "avro",}}
{{"avro.schema.retriever" = 
"com.phonepe.hive.schema.MerchantNamespaceAvroSchemaRetriever",}}
{{"nsp.scr_m_1.avro.schema.url" = 
"hdfs://namenode.nm1:8020/user/admin/schemas/merchant_namespace.avsc",}}{{"nsp.scr_a_1.serialization.type"
 = "avro",}}
{{"avro.schema.retriever" = 
"com.phonepe.hive.schema.AccountingNamespaceAvroSchemaRetriever",}}
{{"nsp.scr_a_1.avro.schema.url" = 
"hdfs://namenode:8020/user/admin/schemas/accounting_namespace.avsc"}}
{{)}}
{{TBLPROPERTIES (}}
{{"hbase.table.name"="txn_store:transactions_single3_cf",}}
{{"hbase.table.default.storage.type"="binary",}}
{{"hbase.mapred.output.outputtable"="txn_store:transactions_single3_cf",}}
{{"hbase.struct.autogenerate"="true");}}

When executing the select query, it fails with below exception:
Caused by: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating nsp_scra1
 

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>Priority: Major
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2016-02-20 Thread Lefty Leverenz (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155834#comment-15155834
 ] 

Lefty Leverenz commented on HIVE-6147:
--

[~swarnim], could you please review the docs in the wiki?

If they're okay, you can remove the TODOC14 label from this issue.  Thanks.

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2016-02-01 Thread Swarnim Kulkarni (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126803#comment-15126803
 ] 

Swarnim Kulkarni commented on HIVE-6147:


{quote}
Avro supports schema evolution that allows data to be written with one schema 
and read with another
{quote}

Yup. Definitely agree. However the point I was trying to make is that you would 
still need to provide the same exact schema that was used when writing the 
data. Let's take an example. Let's say you used Schema S1 to write a billion 
rows to HBase. The Schema then evolved to S2(hopefully in a compatible way) and 
you write another billion rows with it. The Schema evolves again to S3 and then 
you write another billion rows. Now to be able to read all this data, this is 
what you would need to do.

1st billion rows:

Writer Schema: S1
Reader Schema: S3

2nd billion rows:

Writer Schema: S2
Reader Schema: S3

3rd billion rows:

Writer Schema: S3
Reader Schema: S3

So as you see, you are still providing the *exact same version* of the schema 
that was used to write the data to be able to read it back successfully. 
Without it, it would be extremely hard for avro for make out head and tail of 
our data. You "might" still get lucky and be able to deserialize the 1st 
billion rows using S3 as reader/writer schema but there are absolutely no 
guarantees whatsoever. Which is why you would still need a way regardless to 
track what schema was used to write the persist the data when you read it back 
and the current design of hive/hbase avro support closely follows that pattern.

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2016-02-01 Thread Shannon Ladymon (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127394#comment-15127394
 ] 

Shannon Ladymon commented on HIVE-6147:
---

Thanks for writing the documentation, [~swarnim]. I have included what you 
wrote with some minor edits in the Hive wiki:
* [HBase Integration - Avro Data Stored in HBase Columns | 
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-AvroDataStoredinHBaseColumns]
* [AvroSerDe - HBase Integration | 
https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-HBaseIntegration]

Please let me know if there is anything you would like changed or clarified 
(for example, I added quotation marks around the 
"test_col_fam.test_col.avro.schema.url" property for consistency, but if the 
lack of quotation marks originally was intentional, this can be changed).

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2016-02-01 Thread Ilya Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126719#comment-15126719
 ] 

Ilya Kats commented on HIVE-6147:
-

> How do you ensure the first billion rows are still correctly readable?

Avro supports schema evolution that allows data to be written with one schema 
and read with another, as described, for example, here 
https://docs.oracle.com/cd/E26161_02/html/GettingStartedGuide/schemaevolution.html
 (section How Schema Evolution Works). Of course, feasibility of this feature 
depends on the application and how drastically the schema can be updated, but 
many applications can commit on backward compatible schema changes that make it 
appropriate to use the latest reader schema for data items including the old 
ones.  

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2016-01-31 Thread Swarnim Kulkarni (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125676#comment-15125676
 ] 

Swarnim Kulkarni commented on HIVE-6147:


{quote}
It is pretty common to use schema-less avro objects in HBase.
{quote}

I am not sure if that is true(if possible at all). As far as my understanding 
goes, you will have to almost always provide the exact schema that was used 
while persisting the data when attempting to deserialize it and the best way to 
do that would be to store alongside the schema itself. Plus schema evolution is 
going to be a mess. Imagine writing a billion rows in HBase with one schema 
which evolves and then you write another billion rows with new schema. How do 
you ensure the first billion rows are still correctly readable?

{quote}
(if there are billions of rows with objects of the same type, it is not 
reasonable to store the same schema in all of them) and it is not convenient to 
write a customer schema retriever for each such case.
{quote}

Correct. I agree it is inefficient to store it for every single cell. Although 
IMO that isn't a good excuse to not write the schema at all. A better design in 
this case is to use some kind of schema registry, use a custom serializer, 
write the schema to the schema registry, generate a id of some kind and persist 
the id along with the data. Then when you are reading the data, use the id to 
pull the schema from the store and read the data. That is also where a custom 
implementation of an AvroSchemaRetriever makes sense where your custom 
implementation would know how to read your schema from the schema registry and 
get that to hive and let hive handle the deserialization from there on.  

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2016-01-30 Thread Ilya Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125129#comment-15125129
 ] 

Ilya Kats commented on HIVE-6147:
-

Thank you for a quick reply, Swarnim. I'll try to implement a custom 
AvroSchemaRetriever. However, I'm not quite clear on why the writer schema is 
necessarily loaded if the reader schema is provided. It is pretty common to use 
schema-less avro objects in HBase (if there are billions of rows with objects 
of the same type, it is not reasonable to store the same schema in all of them) 
and it is not convenient to write a customer schema retriever for each such 
case. So, I wonder if it is better to assume that the writer schema is equal to 
the reader schema if the former cannot be found neither in data nor via a 
customer retriever? 

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2016-01-30 Thread Swarnim Kulkarni (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125117#comment-15125117
 ] 

Swarnim Kulkarni commented on HIVE-6147:


{noformat}
 it tries to retrieve the write schema from data (ws = 
retrieveSchemaFromBytes(data)) even if the schema URL (reader schema) had been 
provided
{noformat}

Correct. That is the default behavior. The writer schema defaults to the reader 
schema if one has not been provided. If it has been(like you are doing in your 
case), it would use the reader schema from the given URL but still default to 
the writer schema from the data. If you want to provide the writer schema as 
well, I would recommend you take a look into the AvroSchemaRetriever[1]. You 
can provide a custom implementation of it and provide both reader and write 
schema from any custom source that you would like. A test implementation can be 
found here for reference[2] and the corresponding test that uses this 
implementation here[3]. Once done, simply plug it in with 
"avro.schema.retriever" property. One caveat is that this will currently apply 
to the whole table and not individual columns. So it makes the assumption that 
there is a uniform schema across the table.

Hope this helps. Let me know if there are any additional questions.

[1] 
https://github.com/apache/hive/blob/release-1.2.1/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSchemaRetriever.java
[2] 
https://github.com/apache/hive/blob/release-1.2.1/hbase-handler/src/test/org/apache/hadoop/hive/hbase/HBaseTestAvroSchemaRetriever.java
[3] 
https://github.com/apache/hive/blob/release-1.2.1/hbase-handler/src/test/org/apache/hadoop/hive/hbase/TestHBaseSerDe.java#L1293-L1344

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2016-01-29 Thread Ilya Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123690#comment-15123690
 ] 

Ilya Kats commented on HIVE-6147:
-

I'm trying to create a table in Hive 0.14 that points to an HBase table with 
one column family ("c") and one column ("b") that contains schema-less avro 
serialized object:
{code:sql}
CREATE EXTERNAL TABLE customers
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES (
  "hbase.columns.mapping" = ":key,c:b", 
  "c.b.serialization.type"="avro", 
  "c.b.avro.schema.url"="hdfs:/../Customer.avsc") 
TBLPROPERTIES ("hbase.table.name" = "customers", 
"hbase.struct.autogenerate"="true", 
"hive.serialization.extend.nesting.levels"="true");
{code}

The DDL above creates the table successfully, but queries fail with the 
following error:
{code}
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error 
evaluating c_b
16/01/29 15:36:55 [main]: ERROR CliDriver: Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error 
evaluating c_b
java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Error 
evaluating c_b
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:152)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1621)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:267)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:199)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:410)
at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:783)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating 
c_b
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:82)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:571)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:563)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
... 12 more
Caused by: org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorException: An 
error occurred retrieving schema from bytes
at 
org.apache.hadoop.hive.serde2.avro.AvroLazyObjectInspector.retrieveSchemaFromBytes(AvroLazyObjectInspector.java:331)
at 
org.apache.hadoop.hive.serde2.avro.AvroLazyObjectInspector.deserializeStruct(AvroLazyObjectInspector.java:287)
at 
org.apache.hadoop.hive.serde2.avro.AvroLazyObjectInspector.getStructFieldData(AvroLazyObjectInspector.java:142)
at 
org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldData(LazySimpleStructObjectInspector.java:109)
at 
org.apache.hadoop.hive.serde2.objectinspector.DelegatedStructObjectInspector.getStructFieldData(DelegatedStructObjectInspector.java:88)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator._evaluate(ExprNodeColumnEvaluator.java:94)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:77)
... 17 more
Caused by: java.io.IOException: Not a data file.
at 
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
at org.apache.avro.file.DataFileStream.(DataFileStream.java:84)
at 
org.apache.hadoop.hive.serde2.avro.AvroLazyObjectInspector.retrieveSchemaFromBytes(AvroLazyObjectInspector.java:328)
... 25 more
{code}

It seems that there is a problem in the following code in 
AvroLazyObjectInspector:
{code}
...
private Object deserializeStruct(Object struct, String fieldName) {
...
if (readerSchema == null) {
...
} else {
  // a reader schema was provided
  if (schemaRetriever != null) {
// a schema retriever has been provided as well. Attempt to read the 
write schema from the
// retriever
ws =

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2015-07-23 Thread Swarnim Kulkarni (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638965#comment-14638965
 ] 

Swarnim Kulkarni commented on HIVE-6147:


[~brocknoland] Apologies for getting back to you so late. Some how I completely 
missed the notification on this.

Using this support is pretty straight forward. An example query looks like this:

{noformat}
CREATE EXTERNAL TABLE test_hbase_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES (hbase.columns.mapping = 
:key,test_col_fam:test_col,test_col_fam.test_col.serialization.type=avro,test_col_fam.test_col.avro.schema.url=hdfs://testcluster/tmp/schema.avsc)
 
TBLPROPERTIES (hbase.table.name = hbase_avro_table, 
hbase.struct.autogenerate=true);
{noformat}

So basically this looks exactly a query that we would use to query an hbase 
table. The only difference here being:

{noformat}
test_col_fam.test_col.serialization.type=avro
{noformat}

Using this property, we are telling hive that the given column under the given 
column family is an avro column, so we need to deserialize it accordingly.

{noformat}
test_col_fam.test_col.avro.schema.url=hdfs://testcluster/tmp/schema.avsc
{noformat}

Using this property you specify where is the reader schema for the column that 
will be used to deserialize. This can be on HDFS like mentioned here, or 
provided inline using something like  
test_col_fam.test_col.avro.schema.literal property. If you have a custom 
store where you store this schema, you can write a custom implementation of 
AvroSchemaRetriever[1] and plug that in using the avro.schema.retriever 
property using a property like test_col_fam.test_col.avro.schema.retriever. 
Ofcourse you would need to ensure that the jar having this custom class is on 
the hive classpath.

{noformat}
hbase.struct.autogenerate=true
{noformat}

Avro schemas can be complicated and deeply nested. So at times manually 
creating the columns and types for them is not feasible. Specifying this 
property lets hive auto deduce the columns and types using the schema that was 
provided.

Please do let me know if there are any more questions that I can help out with.

[1] 
https://github.com/cloudera/hive/blob/cdh5.3.2-release/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSchemaRetriever.java

 Support avro data stored in HBase columns
 -

 Key: HIVE-6147
 URL: https://issues.apache.org/jira/browse/HIVE-6147
 Project: Hive
  Issue Type: Improvement
  Components: HBase Handler
Affects Versions: 0.12.0, 0.13.0
Reporter: Swarnim Kulkarni
Assignee: Swarnim Kulkarni
  Labels: TODOC14
 Fix For: 0.14.0

 Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
 HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
 HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt


 Presently, the HBase Hive integration supports querying only primitive data 
 types in columns. It would be nice to be able to store and query Avro objects 
 in HBase columns by making them visible as structs to Hive. This will allow 
 Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2015-07-08 Thread Brock Noland (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618770#comment-14618770
 ] 

Brock Noland commented on HIVE-6147:


bq. documentation on the usage of this that we can add to the Hive wiki 

[~swarnim], someone just asked me how to use this. Did you happen to have those 
docs, even if in a rough draft form?

 Support avro data stored in HBase columns
 -

 Key: HIVE-6147
 URL: https://issues.apache.org/jira/browse/HIVE-6147
 Project: Hive
  Issue Type: Improvement
  Components: HBase Handler
Affects Versions: 0.12.0, 0.13.0
Reporter: Swarnim Kulkarni
Assignee: Swarnim Kulkarni
  Labels: TODOC14
 Fix For: 0.14.0

 Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
 HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
 HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt


 Presently, the HBase Hive integration supports querying only primitive data 
 types in columns. It would be nice to be able to store and query Avro objects 
 in HBase columns by making them visible as structs to Hive. This will allow 
 Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

14 matches

Site Navigation

Mail list logo

Footer information