[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2016-02-01 Thread Ilya Kats (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126719#comment-15126719
 ] 

Ilya Kats commented on HIVE-6147:
-

> How do you ensure the first billion rows are still correctly readable?

Avro supports schema evolution that allows data to be written with one schema 
and read with another, as described, for example, here 
https://docs.oracle.com/cd/E26161_02/html/GettingStartedGuide/schemaevolution.html
 (section How Schema Evolution Works). Of course, feasibility of this feature 
depends on the application and how drastically the schema can be updated, but 
many applications can commit on backward compatible schema changes that make it 
appropriate to use the latest reader schema for data items including the old 
ones.  

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2016-01-30 Thread Ilya Kats (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125129#comment-15125129
 ] 

Ilya Kats commented on HIVE-6147:
-

Thank you for a quick reply, Swarnim. I'll try to implement a custom 
AvroSchemaRetriever. However, I'm not quite clear on why the writer schema is 
necessarily loaded if the reader schema is provided. It is pretty common to use 
schema-less avro objects in HBase (if there are billions of rows with objects 
of the same type, it is not reasonable to store the same schema in all of them) 
and it is not convenient to write a customer schema retriever for each such 
case. So, I wonder if it is better to assume that the writer schema is equal to 
the reader schema if the former cannot be found neither in data nor via a 
customer retriever? 

> Support avro data stored in HBase columns
> -
>
> Key: HIVE-6147
> URL: https://issues.apache.org/jira/browse/HIVE-6147
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Swarnim Kulkarni
>Assignee: Swarnim Kulkarni
>  Labels: TODOC14
> Fix For: 0.14.0
>
> Attachments: HIVE-6147.1.patch.txt, HIVE-6147.2.patch.txt, 
> HIVE-6147.3.patch.txt, HIVE-6147.3.patch.txt, HIVE-6147.4.patch.txt, 
> HIVE-6147.5.patch.txt, HIVE-6147.6.patch.txt
>
>
> Presently, the HBase Hive integration supports querying only primitive data 
> types in columns. It would be nice to be able to store and query Avro objects 
> in HBase columns by making them visible as structs to Hive. This will allow 
> Hive to perform ad hoc analysis of HBase data which can be deeply structured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-6147) Support avro data stored in HBase columns

2016-01-29 Thread Ilya Kats (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123690#comment-15123690
 ] 

Ilya Kats commented on HIVE-6147:
-

I'm trying to create a table in Hive 0.14 that points to an HBase table with 
one column family ("c") and one column ("b") that contains schema-less avro 
serialized object:
{code:sql}
CREATE EXTERNAL TABLE customers
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES (
  "hbase.columns.mapping" = ":key,c:b", 
  "c.b.serialization.type"="avro", 
  "c.b.avro.schema.url"="hdfs:/../Customer.avsc") 
TBLPROPERTIES ("hbase.table.name" = "customers", 
"hbase.struct.autogenerate"="true", 
"hive.serialization.extend.nesting.levels"="true");
{code}

The DDL above creates the table successfully, but queries fail with the 
following error:
{code}
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error 
evaluating c_b
16/01/29 15:36:55 [main]: ERROR CliDriver: Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error 
evaluating c_b
java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Error 
evaluating c_b
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:152)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1621)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:267)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:199)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:410)
at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:783)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating 
c_b
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:82)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:571)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:563)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
... 12 more
Caused by: org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorException: An 
error occurred retrieving schema from bytes
at 
org.apache.hadoop.hive.serde2.avro.AvroLazyObjectInspector.retrieveSchemaFromBytes(AvroLazyObjectInspector.java:331)
at 
org.apache.hadoop.hive.serde2.avro.AvroLazyObjectInspector.deserializeStruct(AvroLazyObjectInspector.java:287)
at 
org.apache.hadoop.hive.serde2.avro.AvroLazyObjectInspector.getStructFieldData(AvroLazyObjectInspector.java:142)
at 
org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldData(LazySimpleStructObjectInspector.java:109)
at 
org.apache.hadoop.hive.serde2.objectinspector.DelegatedStructObjectInspector.getStructFieldData(DelegatedStructObjectInspector.java:88)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator._evaluate(ExprNodeColumnEvaluator.java:94)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:77)
... 17 more
Caused by: java.io.IOException: Not a data file.
at 
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
at org.apache.avro.file.DataFileStream.(DataFileStream.java:84)
at 
org.apache.hadoop.hive.serde2.avro.AvroLazyObjectInspector.retrieveSchemaFromBytes(AvroLazyObjectInspector.java:328)
... 25 more
{code}

It seems that there is a problem in the following code in 
AvroLazyObjectInspector:
{code}
...
private Object deserializeStruct(Object struct, String fieldName) {
...
if (readerSchema == null) {
...
} else {
  // a reader schema was provided
  if (schemaRetriever != null) {
// a schema retriever has been provided as well. Attempt to read the 
write schema from the
// retriever
ws =