Thanks. I forgot to consider the DOUBLE data type in the table. For the case of lineitem, ColumnarSerDe can use less bytes to store a double than LazyBinaryColumnarSerDe (8bytes).
Yin On Tue, Mar 6, 2012 at 2:42 PM, yongqiang he <heyongqiang...@gmail.com>wrote: > I guess LazyBinaryColumnarSerDe is not saving spaces, but is cpu efficient. > You tests aligns with our internal tests long time ago. > > On Tue, Mar 6, 2012 at 8:58 AM, Yin Huai <huaiyin....@gmail.com> wrote: > > Hi, > > > > Is LazyBinaryColumnarSerDe more space efficient than ColumnarSerDe in > > general? > > > > Let me make my question more specific. > > > > I generated two tables from the table lineitem of TPC-H > > using ColumnarSerDe and LazyBinaryColumnarSerDe as follows... > > CREATE TABLE lineitem_rcfile_lazybinary > > ROW FORMAT SERDE > > "org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe" > > STORED AS RCFile AS > > SELECT * from lineitem; > > > > CREATE TABLE lineitem_rcfile_lazy > > ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" > > STORED AS RCFile AS > > SELECT * from lineitem; > > > > Since serialization of LazyBinaryColumnarSerDe is binary-based and that > > of ColumnarSerDe is text-based, I expect to see > > table lineitem_rcfile_lazybinary is smaller than lineitem_rcfile_lazy. > > However, no matter whether compression is > > enabled, lineitem_rcfile_lazybinary is little bit larger > > than lineitem_rcfile_lazy. Did I use LazyBinaryColumnarSerDe in a wrong > way? > > > > btw, the row group size of RCFile is 32MB. > > > > Thanks, > > > > Yin >