On 07/23/2015 12:01 PM, David Rosenstrauch wrote:
Just wondering what's the difference between these 2 classes. Is there
a guideline as to when we should use one vs. the other?
Thanks,
DR
Had a follow-up question along the same lines:
What's VectorizedOrcInputFormat?
Also, a couple of other things I'm mulling over as we get a bit deeper
into our work with ORC:
* In the docs it states "Seek to row number is implemented to support
secondary indexes". (See:
http://hive.apache.org/javadocs/r0.13.1/api/ql/org/apache/hadoop/hive/ql/io/orc/package-summary.html)
A colleague and I are working on this exact use case (secondary
index). And we were under the impression that we had to create our own
row numbering scheme to support the secondary index. Does ORC already
write a row number on each record? If so, how is that accessed?
* We're thinking over how to structure our secondary index. And
although we can envision an ORC-based structure that would provide the
functionality we need, it'd be a bit clunky/complex/verbose to query
using Hive. I was thinking perhaps it might be an option for us to
implement a layer in front of ORC that hides some of the complexity of
how the secondary index is physically structured, and makes it possible
to query it using simple HQL. I know that Hive allows developers to use
a custom InputFormat to implement custom storage formats. So
theoretically we could write a wrapper around OrcNewInputFormat and/or
OrcSerDe to provide the functionality we're looking for. Any
suggestions or pointers to someone looking to go this route? (I.e.,
specific code we might look at? Where we might want to insert our own
code? Etc.)
Thanks!
DR