> > One reasonable way to handle native storage of large objects in HBase would 
> > be to introduce a layer of indirection.
> 
> Do you see this layer on the client or on the server side?


Client side.

> I was also thinking on the "update": Le's say we store a new version of 
> the large object which is smaller than the previous one (less chunks). 
> The previously created chunks will remain for the TimeToLive, but could 
> be potentially removed. The indirection layer would be responsible for 
> this maintenance?


Yes.

> > Store the chunks in a manner that gets good distribution in the keyspace, 
> > maybe by SHA-1 hash of the content.
> 
> An alternative would be to add a "_chunk#" to the original key value.
> I guess you prefer to randomly distribute the chunks in the available 
> regions?


Yes. This will increase the probability that a MultiAction<Get> of the chunks 
is parallelized over multiple region servers. That would be beneficial for 
distributing load, but also if most or all of the chunks are in the same region 
-- as would be the case with appending "_chunk#" to the key -- then performance 
will suffer because they will be retrieved serially.

> With "index", you mean a list of chunk keys?


Yes.


> > Storing the large ones in HDFS and simply having the pointer in HBase 
> > allows to benefit from HDFS streaming.
> 
> I was wondering if it was already discussed on a StreamingPut
> (StreamingGet)?


The way HBase RPC currently works, it's not possible to stream data out of 
HBase. The objects that satisfy your Get or Scanner.next request are marshalled 
fully into the RPC response, which is sent all at once.

You could use the HBase REST gateway and therefore stream the response through. 
In that case your client side access to the HBase cluster is via your favorite 
HTTP client library. But then your actions transit a gateway, which adds 
latency (and the gateway must buffer the objects fully in memory), and if 
addressing resources in a RESTful manner there are HTTP transaction overheads 
to consider. This type of configuration would work best for supporting user 
facing services that are RESTful in nature themselves: API services, websites.


Best regards,


  - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)

Reply via email to