On 01/07/11 10:23, Andrew Purtell wrote:
From: Stack<[email protected]>

  3. The size of them varies like this
            70% from them have their length<  1MB
            29% from them have their length between 1MB and 10 MB
            1% from them have their length>  10MB (they can have also
100MB)

What David says above though Jack in his yfrog presentation today
talks of storing all images in hbase up to 5MB in size.

Karthick in his presentation at hadoop summit talked about how once
cells cross a certain size -- he didn't saw what the threshold was I
believe -- then only the metadata is stored in hbase and the content
goes to their "big stuff" system.

Try it I'd say.  If only a few instances of 100MB, HBase might be fine.


I've seen problematic behavior in the past, if you store values larger than 100 
MB and then do concurrent scans over table(s) containing many such objects. The 
default KeyValue size limit is 10 MB. This is usually sufficient. For 
webtable-like applications I may raise it to 50 MB, and larger objects are not 
interesting anyway (to me).

One reasonable way to handle native storage of large objects in HBase would be 
to introduce a layer of indirection.

Do you see this layer on the client or on the server side?
(the considerations you expose a bit later on the round-trip between client and server does not allow me to identify it :)

Break the large object up into chunks.

Chunk size could be configured.
I was also thinking on the "update": Le's say we store a new version of the large object which is smaller than the previous one (less chunks). The previously created chunks will remain for the TimeToLive, but could be potentially removed. The indirection layer would be responsible for this maintenance?

Store the chunks in a manner that gets good distribution in the keyspace, maybe 
by SHA-1 hash of the content.

An alternative would be to add a "_chunk#" to the original key value.
I guess you prefer to randomly distribute the chunks in the available regions?

Then store an index to the chunks with the key of your choice.

With "index", you mean a list of chunk keys?

Get the key to retrieve the index, then use a MultiAction to retrieve the 
referenced chunks in parallel.
Given large objects you are going to need a number of round trips over the 
network to pull all of the data anyway. Adding a couple more in the front may 
not cause the result to fall outside the performance bound of your application.

However you will put your client under heap pressure that way, as objects in 
HBase are fully transferred at once to the client in the RPC response. Another 
option is to store large objects directly into HDFS and keep only the path to 
it in HBase. A benefit of this approach is you can stream the data out of HDFS 
with as little or as much buffering in your application as you would like.

Storing the large ones in HDFS and simply having the pointer in HBase allows to benefit from HDFS streaming.

I was wondering if it was already discussed on a StreamingPut (StreamingGet)?

Thx.


Best regards,


    - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)


--
Eric

Reply via email to