Hi,

This probably has come up before but I wanted to know if there is a 
recommendation around having tables with all attribute data as separate columns 
v/s an approach with most of the attribute data stored as a blob in a single 
column and the rest as separate columns(for column filter searches). I am aware 
of the limitations with lumping the data into a blob but was curious to see if 
there is an improvement on throughput/latency.

I am leaning towards there not being much of a difference or this being a 
micro-optimization not worth the tradeoff but when we ran a set of benchmarks 
to test this(on ver 0.94), the hybrid approach with the blob data seem to show 
a 10-12% improvement in write throughput for the same number of client threads 
with evenly distributed puts over a pre-spit table on a 12 node cluster. I used 
Avro for serialization and all the columns (there are about 40 without the blob 
column and 10 with it) are part of one column family. The size of data for a 
row is around 5 MB before serialization. Any thoughts whether this is worth 
pursuing?

Thanks,
Melvin

Reply via email to