My use of HBase is essentially what Stack describes: I serialize little log 
entry objects with (mostly) protobuf and store them in a single cell in HBase.  
I did this at first because it was easy, and made a note to go back and break 
out the fields into their own columns, and in fact into multiple column 
families in some cases.  When I went back and did this, I found that my 
'exploded' schema was actually slower to scan than the 'blob' schema was, and 
filters didn't seem to help all that much.  This was in the 0.20 days, IIRC.  
So this is to say, +1 on storing blobs in HBase.

I don't know if this would work for you, but what's worked well for me is to 
write side files for Hive to read as I ingest entries into HBase.  I like HBase 
for durability, random access, sorting, and scanning, and I'll continue to use 
it to store the golden copy for the foreseeable future, but I've found that 
Hive against text files is at least a couple of times faster than MR against an 
HBase source for my map reduce needs.  If you find that what you need from the 
Hive schema changes over time, you can simply nuke the files and recreate them 
with a map reduce against the golden copy in HBase.

Sandy

Reply via email to