Hey,


i have an question about how indexing in ORC works



The way I understood ORC indexing is, that ORC keeps statistics (min, max,
sum) about the rows every 10'000 rows (by default )and if I query the data
it looks at the statistics to figure out if it needs to read the row chunk
or not.



If that's true - is it possible to build an index on an ORC file that is
more similar to an database index - meaning that i want to create another
sorted data structure which holds the field value and a pointer to the
record it relates to.



The problem i have is that i have a huge dataset. >300TB and 69 columns.
There is no 'key' column that gets frequently queried and i would like to
perform ad-hoc queries on nearly every of these columns. I think building
an index on ever column would be a good approach to get this ability.



Regards,

Thomas

Reply via email to