> I am curious about how is map type serialized in ORC files? -- One simple 
> guess would be (conceptually) storing two arrays for key and value, 
> respectively :).

Close enough, it's stored as 3 streams.

https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/WriterImpl.java#L2280

            lengths.write(childLength);
            childrenWriters[0].writeBatch(vec.keys, childOffset, childLength);
            childrenWriters[1].writeBatch(vec.values, childOffset, childLength);

The first stream stores the cardinality of the map and the other two stores 
keys and values in whatever type they might represent, in sequence (to be read 
out & zipped to tuples).

Actually, I've yet to confirm if the String dictionary encoding kicks in for 
the key streams - usually for things like attributes, the keys repeat in 
massive numbers for that to be a useful optimization.

Cheers,
Gopal



Reply via email to