Hey,
i have an question about how indexing in ORC works The way I understood ORC indexing is, that ORC keeps statistics (min, max, sum) about the rows every 10'000 rows (by default )and if I query the data it looks at the statistics to figure out if it needs to read the row chunk or not. If that's true - is it possible to build an index on an ORC file that is more similar to an database index - meaning that i want to create another sorted data structure which holds the field value and a pointer to the record it relates to. The problem i have is that i have a huge dataset. >300TB and 69 columns. There is no 'key' column that gets frequently queried and i would like to perform ad-hoc queries on nearly every of these columns. I think building an index on ever column would be a good approach to get this ability. Regards, Thomas
