All complex types are flattened out and written as primitive column streams 
(string, longs, double, float etc.). String columns are dictionary encoded. If 
there are too many distinct keys then dictionary encoding will automatically be 
turned off.

Thanks
Prasanth

> On Dec 1, 2016, at 11:35 AM, Wenlei Xie <wenlei....@gmail.com> wrote:
> 
> Thank you Gopal for the explanation !
> 
> Is there any JIRA ticket that tracks the String dictionary encoding ? :) -- I 
> can see it has huge potential values, for example map type is used as a 
> flexible struct :) 
> 
> Best,
> Wenlei
> 
> On Wed, Nov 30, 2016 at 11:51 PM, Gopal Vijayaraghavan <gop...@apache.org 
> <mailto:gop...@apache.org>> wrote:
> 
> > I am curious about how is map type serialized in ORC files? -- One simple 
> > guess would be (conceptually) storing two arrays for key and value, 
> > respectively :).
> 
> Close enough, it's stored as 3 streams.
> 
> https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/WriterImpl.java#L2280
>  
> <https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/WriterImpl.java#L2280>
> 
>             lengths.write(childLength);
>             childrenWriters[0].writeBatch(vec.keys, childOffset, childLength);
>             childrenWriters[1].writeBatch(vec.values, childOffset, 
> childLength);
> 
> The first stream stores the cardinality of the map and the other two stores 
> keys and values in whatever type they might represent, in sequence (to be 
> read out & zipped to tuples).
> 
> Actually, I've yet to confirm if the String dictionary encoding kicks in for 
> the key streams - usually for things like attributes, the keys repeat in 
> massive numbers for that to be a useful optimization.
> 
> Cheers,
> Gopal
> 
> 
> 
> 
> 
> 
> -- 
> Best Regards,
> Wenlei Xie (谢文磊)
> 
> Email: wenlei....@gmail.com <mailto:wenlei....@gmail.com>

Reply via email to