On 07/23/2015 12:01 PM, David Rosenstrauch wrote:
Just wondering what's the difference between these 2 classes.  Is there
a guideline as to when we should use one vs. the other?

Thanks,

DR

Had a follow-up question along the same lines:

What's VectorizedOrcInputFormat?


Also, a couple of other things I'm mulling over as we get a bit deeper into our work with ORC:

* In the docs it states "Seek to row number is implemented to support secondary indexes". (See: http://hive.apache.org/javadocs/r0.13.1/api/ql/org/apache/hadoop/hive/ql/io/orc/package-summary.html) A colleague and I are working on this exact use case (secondary index). And we were under the impression that we had to create our own row numbering scheme to support the secondary index. Does ORC already write a row number on each record? If so, how is that accessed?

* We're thinking over how to structure our secondary index. And although we can envision an ORC-based structure that would provide the functionality we need, it'd be a bit clunky/complex/verbose to query using Hive. I was thinking perhaps it might be an option for us to implement a layer in front of ORC that hides some of the complexity of how the secondary index is physically structured, and makes it possible to query it using simple HQL. I know that Hive allows developers to use a custom InputFormat to implement custom storage formats. So theoretically we could write a wrapper around OrcNewInputFormat and/or OrcSerDe to provide the functionality we're looking for. Any suggestions or pointers to someone looking to go this route? (I.e., specific code we might look at? Where we might want to insert our own code? Etc.)

Thanks!

DR

Reply via email to