I'm not aware of anyone having written a tuning guide for ORC. If someone
has one, it would be great to add to the ORC website.
Some of the top level points:
* Stripe size is a tradeoff:
+ larger is better for throughput and compression
+ smaller is better for parallelism and memory consumption
* Sorting is a big win for predicate pushdown with either equals or
comparison operators.
* Stride size is the granularity of the index:
+ larger consumes less space
+ smaller provides faster seeks and predicate push down
* Bloom filters are good for columns with predicate pushdown with equals
operators
.. Owen
On Mon, Oct 3, 2016 at 6:04 AM, Rohit <[email protected]> wrote:
> Is there a design and tuning guide for ORC that may cover things like
> choosing and implications / impact of:
> - partitioning column
> - sorting column(s)
> - strip size
> - stride size
> - bloom filters
> - anything else ...
>
> Rohit
>