[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180514#comment-17180514 ] 彭佳 commented on HIVE-6147: -- I also encountered such problems, how to solve them? > Support avro data stored in HBase columns > - > > Key: HIVE-6147 > URL: https://issues.apache.org/jira/browse/HIVE-6147 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.12.0, 0.13.0 >Reporter: Swarnim Kulkarni >Assignee: Swarnim Kulkarni >Priority: Major > Fix For: 0.14.0 > > Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, > HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, > HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt > > > Presently, the HBase Hive integration supports querying only primitive data > types in columns. It would be nice to be able to store and query Avro objects > in HBase columns by making them visible as structs to Hive. This will allow > Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139492#comment-17139492 ] yeshwanth commented on HIVE-6147: - We have schema less avro bytes written to hbase cells, with schema id prefixed to the avro bytes, similar to kafka avro serializer in confluent schema registry. how can i customize HBaseSerDe to read & query the data from Hive. I have found "hbase.struct.serialization.class" property but not able to identify which class/method to implement. Wondering anyone had same use case and solved this already. > Support avro data stored in HBase columns > - > > Key: HIVE-6147 > URL: https://issues.apache.org/jira/browse/HIVE-6147 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.12.0, 0.13.0 >Reporter: Swarnim Kulkarni >Assignee: Swarnim Kulkarni >Priority: Major > Fix For: 0.14.0 > > Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, > HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, > HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt > > > Presently, the HBase Hive integration supports querying only primitive data > types in columns. It would be nice to be able to store and query Avro objects > in HBase columns by making them visible as structs to Hive. This will allow > Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594095#comment-16594095 ] Swarnim Kulkarni commented on HIVE-6147: What error message do you see in the logs? > Support avro data stored in HBase columns > - > > Key: HIVE-6147 > URL: https://issues.apache.org/jira/browse/HIVE-6147 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.12.0, 0.13.0 >Reporter: Swarnim Kulkarni >Assignee: Swarnim Kulkarni >Priority: Major > Fix For: 0.14.0 > > Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, > HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, > HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt > > > Presently, the HBase Hive integration supports querying only primitive data > types in columns. It would be nice to be able to store and query Avro objects > in HBase columns by making them visible as structs to Hive. This will allow > Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588536#comment-16588536 ] Vamsi Subhash Achanta commented on HIVE-6147: - [~swarnim] Hi, Is there any work currently going on to make the serialization work at a column level? If I have 2 columns with different avro schema for each column, the current code is not working. Ex: {{CREATE EXTERNAL TABLE txn_store.transactions_single3_1}} {{ROW FORMAT SERDE "org.apache.hadoop.hive.hbase.HBaseSerDe"}} {{STORED BY "org.apache.hadoop.hive.hbase.HBaseStorageHandler"}} {{WITH SERDEPROPERTIES (}} {{"hbase.columns.mapping" = ":key,nsp:scr_p_1,nsp:scr_m_1,nsp:scr_a_1",}}{{"nsp.scr_p_1.serialization.type" = "avro",}} {{"avro.schema.retriever" = "com.phonepe.hive.schema.PaymentNamespaceAvroSchemaRetriever",}} {{"nsp.scr_p_1.avro.schema.url" = "hdfs://namenode:8020/user/admin/schemas/payment_namespace.avsc",}}{{"nsp.scr_m_1.serialization.type" = "avro",}} {{"avro.schema.retriever" = "com.phonepe.hive.schema.MerchantNamespaceAvroSchemaRetriever",}} {{"nsp.scr_m_1.avro.schema.url" = "hdfs://namenode.nm1:8020/user/admin/schemas/merchant_namespace.avsc",}}{{"nsp.scr_a_1.serialization.type" = "avro",}} {{"avro.schema.retriever" = "com.phonepe.hive.schema.AccountingNamespaceAvroSchemaRetriever",}} {{"nsp.scr_a_1.avro.schema.url" = "hdfs://namenode:8020/user/admin/schemas/accounting_namespace.avsc"}} {{)}} {{TBLPROPERTIES (}} {{"hbase.table.name"="txn_store:transactions_single3_cf",}} {{"hbase.table.default.storage.type"="binary",}} {{"hbase.mapred.output.outputtable"="txn_store:transactions_single3_cf",}} {{"hbase.struct.autogenerate"="true");}} When executing the select query, it fails with below exception: Caused by: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating nsp_scra1 > Support avro data stored in HBase columns > - > > Key: HIVE-6147 > URL: https://issues.apache.org/jira/browse/HIVE-6147 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.12.0, 0.13.0 >Reporter: Swarnim Kulkarni >Assignee: Swarnim Kulkarni >Priority: Major > Fix For: 0.14.0 > > Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, > HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, > HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt > > > Presently, the HBase Hive integration supports querying only primitive data > types in columns. It would be nice to be able to store and query Avro objects > in HBase columns by making them visible as structs to Hive. This will allow > Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155834#comment-15155834 ] Lefty Leverenz commented on HIVE-6147: -- [~swarnim], could you please review the docs in the wiki? If they're okay, you can remove the TODOC14 label from this issue. Thanks. > Support avro data stored in HBase columns > - > > Key: HIVE-6147 > URL: https://issues.apache.org/jira/browse/HIVE-6147 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.12.0, 0.13.0 >Reporter: Swarnim Kulkarni >Assignee: Swarnim Kulkarni > Labels: TODOC14 > Fix For: 0.14.0 > > Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, > HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, > HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt > > > Presently, the HBase Hive integration supports querying only primitive data > types in columns. It would be nice to be able to store and query Avro objects > in HBase columns by making them visible as structs to Hive. This will allow > Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126803#comment-15126803 ] Swarnim Kulkarni commented on HIVE-6147: {quote} Avro supports schema evolution that allows data to be written with one schema and read with another {quote} Yup. Definitely agree. However the point I was trying to make is that you would still need to provide the same exact schema that was used when writing the data. Let's take an example. Let's say you used Schema S1 to write a billion rows to HBase. The Schema then evolved to S2(hopefully in a compatible way) and you write another billion rows with it. The Schema evolves again to S3 and then you write another billion rows. Now to be able to read all this data, this is what you would need to do. 1st billion rows: Writer Schema: S1 Reader Schema: S3 2nd billion rows: Writer Schema: S2 Reader Schema: S3 3rd billion rows: Writer Schema: S3 Reader Schema: S3 So as you see, you are still providing the *exact same version* of the schema that was used to write the data to be able to read it back successfully. Without it, it would be extremely hard for avro for make out head and tail of our data. You "might" still get lucky and be able to deserialize the 1st billion rows using S3 as reader/writer schema but there are absolutely no guarantees whatsoever. Which is why you would still need a way regardless to track what schema was used to write the persist the data when you read it back and the current design of hive/hbase avro support closely follows that pattern. > Support avro data stored in HBase columns > - > > Key: HIVE-6147 > URL: https://issues.apache.org/jira/browse/HIVE-6147 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.12.0, 0.13.0 >Reporter: Swarnim Kulkarni >Assignee: Swarnim Kulkarni > Labels: TODOC14 > Fix For: 0.14.0 > > Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, > HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, > HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt > > > Presently, the HBase Hive integration supports querying only primitive data > types in columns. It would be nice to be able to store and query Avro objects > in HBase columns by making them visible as structs to Hive. This will allow > Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127394#comment-15127394 ] Shannon Ladymon commented on HIVE-6147: --- Thanks for writing the documentation, [~swarnim]. I have included what you wrote with some minor edits in the Hive wiki: * [HBase Integration - Avro Data Stored in HBase Columns | https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-AvroDataStoredinHBaseColumns] * [AvroSerDe - HBase Integration | https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-HBaseIntegration] Please let me know if there is anything you would like changed or clarified (for example, I added quotation marks around the "test_col_fam.test_col.avro.schema.url" property for consistency, but if the lack of quotation marks originally was intentional, this can be changed). > Support avro data stored in HBase columns > - > > Key: HIVE-6147 > URL: https://issues.apache.org/jira/browse/HIVE-6147 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.12.0, 0.13.0 >Reporter: Swarnim Kulkarni >Assignee: Swarnim Kulkarni > Labels: TODOC14 > Fix For: 0.14.0 > > Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, > HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, > HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt > > > Presently, the HBase Hive integration supports querying only primitive data > types in columns. It would be nice to be able to store and query Avro objects > in HBase columns by making them visible as structs to Hive. This will allow > Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126719#comment-15126719 ] Ilya Kats commented on HIVE-6147: - > How do you ensure the first billion rows are still correctly readable? Avro supports schema evolution that allows data to be written with one schema and read with another, as described, for example, here https://docs.oracle.com/cd/E26161_02/html/GettingStartedGuide/schemaevolution.html (section How Schema Evolution Works). Of course, feasibility of this feature depends on the application and how drastically the schema can be updated, but many applications can commit on backward compatible schema changes that make it appropriate to use the latest reader schema for data items including the old ones. > Support avro data stored in HBase columns > - > > Key: HIVE-6147 > URL: https://issues.apache.org/jira/browse/HIVE-6147 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.12.0, 0.13.0 >Reporter: Swarnim Kulkarni >Assignee: Swarnim Kulkarni > Labels: TODOC14 > Fix For: 0.14.0 > > Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, > HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, > HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt > > > Presently, the HBase Hive integration supports querying only primitive data > types in columns. It would be nice to be able to store and query Avro objects > in HBase columns by making them visible as structs to Hive. This will allow > Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125676#comment-15125676 ] Swarnim Kulkarni commented on HIVE-6147: {quote} It is pretty common to use schema-less avro objects in HBase. {quote} I am not sure if that is true(if possible at all). As far as my understanding goes, you will have to almost always provide the exact schema that was used while persisting the data when attempting to deserialize it and the best way to do that would be to store alongside the schema itself. Plus schema evolution is going to be a mess. Imagine writing a billion rows in HBase with one schema which evolves and then you write another billion rows with new schema. How do you ensure the first billion rows are still correctly readable? {quote} (if there are billions of rows with objects of the same type, it is not reasonable to store the same schema in all of them) and it is not convenient to write a customer schema retriever for each such case. {quote} Correct. I agree it is inefficient to store it for every single cell. Although IMO that isn't a good excuse to not write the schema at all. A better design in this case is to use some kind of schema registry, use a custom serializer, write the schema to the schema registry, generate a id of some kind and persist the id along with the data. Then when you are reading the data, use the id to pull the schema from the store and read the data. That is also where a custom implementation of an AvroSchemaRetriever makes sense where your custom implementation would know how to read your schema from the schema registry and get that to hive and let hive handle the deserialization from there on. > Support avro data stored in HBase columns > - > > Key: HIVE-6147 > URL: https://issues.apache.org/jira/browse/HIVE-6147 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.12.0, 0.13.0 >Reporter: Swarnim Kulkarni >Assignee: Swarnim Kulkarni > Labels: TODOC14 > Fix For: 0.14.0 > > Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, > HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, > HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt > > > Presently, the HBase Hive integration supports querying only primitive data > types in columns. It would be nice to be able to store and query Avro objects > in HBase columns by making them visible as structs to Hive. This will allow > Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125129#comment-15125129 ] Ilya Kats commented on HIVE-6147: - Thank you for a quick reply, Swarnim. I'll try to implement a custom AvroSchemaRetriever. However, I'm not quite clear on why the writer schema is necessarily loaded if the reader schema is provided. It is pretty common to use schema-less avro objects in HBase (if there are billions of rows with objects of the same type, it is not reasonable to store the same schema in all of them) and it is not convenient to write a customer schema retriever for each such case. So, I wonder if it is better to assume that the writer schema is equal to the reader schema if the former cannot be found neither in data nor via a customer retriever? > Support avro data stored in HBase columns > - > > Key: HIVE-6147 > URL: https://issues.apache.org/jira/browse/HIVE-6147 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.12.0, 0.13.0 >Reporter: Swarnim Kulkarni >Assignee: Swarnim Kulkarni > Labels: TODOC14 > Fix For: 0.14.0 > > Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, > HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, > HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt > > > Presently, the HBase Hive integration supports querying only primitive data > types in columns. It would be nice to be able to store and query Avro objects > in HBase columns by making them visible as structs to Hive. This will allow > Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125117#comment-15125117 ] Swarnim Kulkarni commented on HIVE-6147: {noformat} it tries to retrieve the write schema from data (ws = retrieveSchemaFromBytes(data)) even if the schema URL (reader schema) had been provided {noformat} Correct. That is the default behavior. The writer schema defaults to the reader schema if one has not been provided. If it has been(like you are doing in your case), it would use the reader schema from the given URL but still default to the writer schema from the data. If you want to provide the writer schema as well, I would recommend you take a look into the AvroSchemaRetriever[1]. You can provide a custom implementation of it and provide both reader and write schema from any custom source that you would like. A test implementation can be found here for reference[2] and the corresponding test that uses this implementation here[3]. Once done, simply plug it in with "avro.schema.retriever" property. One caveat is that this will currently apply to the whole table and not individual columns. So it makes the assumption that there is a uniform schema across the table. Hope this helps. Let me know if there are any additional questions. [1] https://github.com/apache/hive/blob/release-1.2.1/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSchemaRetriever.java [2] https://github.com/apache/hive/blob/release-1.2.1/hbase-handler/src/test/org/apache/hadoop/hive/hbase/HBaseTestAvroSchemaRetriever.java [3] https://github.com/apache/hive/blob/release-1.2.1/hbase-handler/src/test/org/apache/hadoop/hive/hbase/TestHBaseSerDe.java#L1293-L1344 > Support avro data stored in HBase columns > - > > Key: HIVE-6147 > URL: https://issues.apache.org/jira/browse/HIVE-6147 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.12.0, 0.13.0 >Reporter: Swarnim Kulkarni >Assignee: Swarnim Kulkarni > Labels: TODOC14 > Fix For: 0.14.0 > > Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, > HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, > HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt > > > Presently, the HBase Hive integration supports querying only primitive data > types in columns. It would be nice to be able to store and query Avro objects > in HBase columns by making them visible as structs to Hive. This will allow > Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123690#comment-15123690 ] Ilya Kats commented on HIVE-6147: - I'm trying to create a table in Hive 0.14 that points to an HBase table with one column family ("c") and one column ("b") that contains schema-less avro serialized object: {code:sql} CREATE EXTERNAL TABLE customers ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,c:b", "c.b.serialization.type"="avro", "c.b.avro.schema.url"="hdfs:/../Customer.avsc") TBLPROPERTIES ("hbase.table.name" = "customers", "hbase.struct.autogenerate"="true", "hive.serialization.extend.nesting.levels"="true"); {code} The DDL above creates the table successfully, but queries fail with the following error: {code} Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating c_b 16/01/29 15:36:55 [main]: ERROR CliDriver: Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating c_b java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating c_b at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:152) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1621) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:267) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:199) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:410) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:783) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:160) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating c_b at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:82) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:571) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:563) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138) ... 12 more Caused by: org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorException: An error occurred retrieving schema from bytes at org.apache.hadoop.hive.serde2.avro.AvroLazyObjectInspector.retrieveSchemaFromBytes(AvroLazyObjectInspector.java:331) at org.apache.hadoop.hive.serde2.avro.AvroLazyObjectInspector.deserializeStruct(AvroLazyObjectInspector.java:287) at org.apache.hadoop.hive.serde2.avro.AvroLazyObjectInspector.getStructFieldData(AvroLazyObjectInspector.java:142) at org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldData(LazySimpleStructObjectInspector.java:109) at org.apache.hadoop.hive.serde2.objectinspector.DelegatedStructObjectInspector.getStructFieldData(DelegatedStructObjectInspector.java:88) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator._evaluate(ExprNodeColumnEvaluator.java:94) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:77) ... 17 more Caused by: java.io.IOException: Not a data file. at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105) at org.apache.avro.file.DataFileStream.(DataFileStream.java:84) at org.apache.hadoop.hive.serde2.avro.AvroLazyObjectInspector.retrieveSchemaFromBytes(AvroLazyObjectInspector.java:328) ... 25 more {code} It seems that there is a problem in the following code in AvroLazyObjectInspector: {code} ... private Object deserializeStruct(Object struct, String fieldName) { ... if (readerSchema == null) { ... } else { // a reader schema was provided if (schemaRetriever != null) { // a schema retriever has been provided as well. Attempt to read the write schema from the // retriever ws =
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638965#comment-14638965 ] Swarnim Kulkarni commented on HIVE-6147: [~brocknoland] Apologies for getting back to you so late. Some how I completely missed the notification on this. Using this support is pretty straight forward. An example query looks like this: {noformat} CREATE EXTERNAL TABLE test_hbase_avro ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,test_col_fam:test_col,test_col_fam.test_col.serialization.type=avro,test_col_fam.test_col.avro.schema.url=hdfs://testcluster/tmp/schema.avsc) TBLPROPERTIES (hbase.table.name = hbase_avro_table, hbase.struct.autogenerate=true); {noformat} So basically this looks exactly a query that we would use to query an hbase table. The only difference here being: {noformat} test_col_fam.test_col.serialization.type=avro {noformat} Using this property, we are telling hive that the given column under the given column family is an avro column, so we need to deserialize it accordingly. {noformat} test_col_fam.test_col.avro.schema.url=hdfs://testcluster/tmp/schema.avsc {noformat} Using this property you specify where is the reader schema for the column that will be used to deserialize. This can be on HDFS like mentioned here, or provided inline using something like test_col_fam.test_col.avro.schema.literal property. If you have a custom store where you store this schema, you can write a custom implementation of AvroSchemaRetriever[1] and plug that in using the avro.schema.retriever property using a property like test_col_fam.test_col.avro.schema.retriever. Ofcourse you would need to ensure that the jar having this custom class is on the hive classpath. {noformat} hbase.struct.autogenerate=true {noformat} Avro schemas can be complicated and deeply nested. So at times manually creating the columns and types for them is not feasible. Specifying this property lets hive auto deduce the columns and types using the schema that was provided. Please do let me know if there are any more questions that I can help out with. [1] https://github.com/cloudera/hive/blob/cdh5.3.2-release/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSchemaRetriever.java Support avro data stored in HBase columns - Key: HIVE-6147 URL: https://issues.apache.org/jira/browse/HIVE-6147 Project: Hive Issue Type: Improvement Components: HBase Handler Affects Versions: 0.12.0, 0.13.0 Reporter: Swarnim Kulkarni Assignee: Swarnim Kulkarni Labels: TODOC14 Fix For: 0.14.0 Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt Presently, the HBase Hive integration supports querying only primitive data types in columns. It would be nice to be able to store and query Avro objects in HBase columns by making them visible as structs to Hive. This will allow Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns
[ https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618770#comment-14618770 ] Brock Noland commented on HIVE-6147: bq. documentation on the usage of this that we can add to the Hive wiki [~swarnim], someone just asked me how to use this. Did you happen to have those docs, even if in a rough draft form? Support avro data stored in HBase columns - Key: HIVE-6147 URL: https://issues.apache.org/jira/browse/HIVE-6147 Project: Hive Issue Type: Improvement Components: HBase Handler Affects Versions: 0.12.0, 0.13.0 Reporter: Swarnim Kulkarni Assignee: Swarnim Kulkarni Labels: TODOC14 Fix For: 0.14.0 Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt Presently, the HBase Hive integration supports querying only primitive data types in columns. It would be nice to be able to store and query Avro objects in HBase columns by making them visible as structs to Hive. This will allow Hive to perform ad hoc analysis of HBase data which can be deeply structured. -- This message was sent by Atlassian JIRA (v6.3.4#6332)