Thank you Gopal for the explanation !

Is there any JIRA ticket that tracks the String dictionary encoding ? :) --
I can see it has huge potential values, for example map type is used as a
flexible struct :)

Best,
Wenlei

On Wed, Nov 30, 2016 at 11:51 PM, Gopal Vijayaraghavan <gop...@apache.org>
wrote:

>
> > I am curious about how is map type serialized in ORC files? -- One
> simple guess would be (conceptually) storing two arrays for key and value,
> respectively :).
>
> Close enough, it's stored as 3 streams.
>
> https://github.com/apache/orc/blob/master/java/core/src/
> java/org/apache/orc/impl/WriterImpl.java#L2280
>
>             lengths.write(childLength);
>             childrenWriters[0].writeBatch(vec.keys, childOffset,
> childLength);
>             childrenWriters[1].writeBatch(vec.values, childOffset,
> childLength);
>
> The first stream stores the cardinality of the map and the other two
> stores keys and values in whatever type they might represent, in sequence
> (to be read out & zipped to tuples).
>
> Actually, I've yet to confirm if the String dictionary encoding kicks in
> for the key streams - usually for things like attributes, the keys repeat
> in massive numbers for that to be a useful optimization.
>
> Cheers,
> Gopal
>
>
>
>


-- 
Best Regards,
Wenlei Xie (谢文磊)

Email: wenlei....@gmail.com

Reply via email to