Thank you Gopal for the explanation ! Is there any JIRA ticket that tracks the String dictionary encoding ? :) -- I can see it has huge potential values, for example map type is used as a flexible struct :)
Best, Wenlei On Wed, Nov 30, 2016 at 11:51 PM, Gopal Vijayaraghavan <[email protected]> wrote: > > > I am curious about how is map type serialized in ORC files? -- One > simple guess would be (conceptually) storing two arrays for key and value, > respectively :). > > Close enough, it's stored as 3 streams. > > https://github.com/apache/orc/blob/master/java/core/src/ > java/org/apache/orc/impl/WriterImpl.java#L2280 > > lengths.write(childLength); > childrenWriters[0].writeBatch(vec.keys, childOffset, > childLength); > childrenWriters[1].writeBatch(vec.values, childOffset, > childLength); > > The first stream stores the cardinality of the map and the other two > stores keys and values in whatever type they might represent, in sequence > (to be read out & zipped to tuples). > > Actually, I've yet to confirm if the String dictionary encoding kicks in > for the key streams - usually for things like attributes, the keys repeat > in massive numbers for that to be a useful optimization. > > Cheers, > Gopal > > > > -- Best Regards, Wenlei Xie (谢文磊) Email: [email protected]
