Recently, bloom filter index is added to ORC which is much more accurate in row group elimination than min/max based index.
Thanks Prasanth > On Jul 16, 2015, at 9:07 AM, Thomas Abeler <[email protected]> wrote: > > Hey, > > > > i have an question about how indexing in ORC works > > > > The way I understood ORC indexing is, that ORC keeps statistics (min, max, > sum) about the rows every 10'000 rows (by default )and if I query the data it > looks at the statistics to figure out if it needs to read the row chunk or > not. > > > > If that's true - is it possible to build an index on an ORC file that is more > similar to an database index - meaning that i want to create another sorted > data structure which holds the field value and a pointer to the record it > relates to. > > > > The problem i have is that i have a huge dataset. >300TB and 69 columns. > There is no 'key' column that gets frequently queried and i would like to > perform ad-hoc queries on nearly every of these columns. I think building an > index on ever column would be a good approach to get this ability. > > > > Regards, > > Thomas >
