On 01/07/11 10:23, Andrew Purtell wrote:
From: Stack<[email protected]>
3. The size of them varies like this
70% from them have their length< 1MB
29% from them have their length between 1MB and 10 MB
1% from them have their length> 10MB (they can have also
100MB)
What David says above though Jack in his yfrog presentation today
talks of storing all images in hbase up to 5MB in size.
Karthick in his presentation at hadoop summit talked about how once
cells cross a certain size -- he didn't saw what the threshold was I
believe -- then only the metadata is stored in hbase and the content
goes to their "big stuff" system.
Try it I'd say. If only a few instances of 100MB, HBase might be fine.
I've seen problematic behavior in the past, if you store values larger than 100
MB and then do concurrent scans over table(s) containing many such objects. The
default KeyValue size limit is 10 MB. This is usually sufficient. For
webtable-like applications I may raise it to 50 MB, and larger objects are not
interesting anyway (to me).
One reasonable way to handle native storage of large objects in HBase would be
to introduce a layer of indirection.
Do you see this layer on the client or on the server side?
(the considerations you expose a bit later on the round-trip between
client and server does not allow me to identify it :)
Break the large object up into chunks.
Chunk size could be configured.
I was also thinking on the "update": Le's say we store a new version of
the large object which is smaller than the previous one (less chunks).
The previously created chunks will remain for the TimeToLive, but could
be potentially removed. The indirection layer would be responsible for
this maintenance?
Store the chunks in a manner that gets good distribution in the keyspace, maybe
by SHA-1 hash of the content.
An alternative would be to add a "_chunk#" to the original key value.
I guess you prefer to randomly distribute the chunks in the available
regions?
Then store an index to the chunks with the key of your choice.
With "index", you mean a list of chunk keys?
Get the key to retrieve the index, then use a MultiAction to retrieve the
referenced chunks in parallel.
Given large objects you are going to need a number of round trips over the
network to pull all of the data anyway. Adding a couple more in the front may
not cause the result to fall outside the performance bound of your application.
However you will put your client under heap pressure that way, as objects in
HBase are fully transferred at once to the client in the RPC response. Another
option is to store large objects directly into HDFS and keep only the path to
it in HBase. A benefit of this approach is you can stream the data out of HDFS
with as little or as much buffering in your application as you would like.
Storing the large ones in HDFS and simply having the pointer in HBase
allows to benefit from HDFS streaming.
I was wondering if it was already discussed on a StreamingPut
(StreamingGet)?
Thx.
Best regards,
- Andy
Problems worthy of attack prove their worth by hitting back. - Piet Hein (via
Tom White)
--
Eric