> I am curious about how is map type serialized in ORC files? -- One simple > guess would be (conceptually) storing two arrays for key and value, > respectively :).
Close enough, it's stored as 3 streams. https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/WriterImpl.java#L2280 lengths.write(childLength); childrenWriters[0].writeBatch(vec.keys, childOffset, childLength); childrenWriters[1].writeBatch(vec.values, childOffset, childLength); The first stream stores the cardinality of the map and the other two stores keys and values in whatever type they might represent, in sequence (to be read out & zipped to tuples). Actually, I've yet to confirm if the String dictionary encoding kicks in for the key streams - usually for things like attributes, the keys repeat in massive numbers for that to be a useful optimization. Cheers, Gopal
