Hey, Streaming is one of those kinds of things that would require a major wholesale change... good ones, but needless to say reworking the fundamentals of how the RPC system and the storage system and the file format works is not really an overnight project.
If you are storing extremely large cells the best bet is HDFS. Most systems end up having to do mixed storage, and it might be difficult to make HBase useful for 10 byte cells and 10 GB cells. With some good API layers on your app side it shouldn't be too hard. -ryan On Wed, Aug 18, 2010 at 9:02 PM, Stack <[email protected]> wrote: > On Wed, Aug 18, 2010 at 4:47 PM, Stuart Smith <[email protected]> wrote: >> >> Hello, >> >> I was wondering if there are any plans for a stream interface to Cell data. >> I saw this: >> >>> > or they are using large client write buffers so big >>> payloads are being >>> > passed to the server in each RPC request. Our >>> RPC is not streaming. >> >> So I'm guessing there's not one now (and I couldn't find one in 0.20.6 >> either). HDFS does seem to provide a stream interface (I'm about to try it >> out). >> >> So is there a fundamental limitation on hbase that prevents a streaming >> interface to Cells, is it possible but distasteful for some reason, or is it >> just a TODO item? >> > > > > Our RPC doesn't do streaming. > > A streaming/chunking protocol would be nice -- there is even an old > issue to do it -- but I think general consensus is that its low > priority (do you think different)? > > Also, if your cells are large, you might consider keeping the content > in hdfs and their location up in hbase. If the cell is 100MB, the > lookup in hbase pales beside the time to stream from hdfs. > > >> I'm thinking this could help alleviate the Big Cell OOME situation. This >> would be especially handy if you just have a few outlier cells that are >> really big, and lots of smaller ones. >> > > Big cell OOME is rare, unless I'm mistaken. Or saying it another way, > its rare in my experience that hbase is used hosting big cells. We > should add better cell size checks out on client and like the > speed-limiter on your hertz ferrari, it'll keep you safe at least > until you go out of your way to dismantle the check. > >> Right now I'm just going with the solution of putting a layer on top of my >> system that writes filemetadata and most (smaller) files to hbase, and the >> occasional big file to HDFS. This should work, and is probably best in the >> long run, but a streaming interface would be handy! >> > > Oh, yeah, this is a bit of a pain having to handle two sources for > data. Does your dataset fluctuate wildly in its size? Is there a > way you can separate the big from the small? If so, perhaps you could > model it so the big was in one column family and the small in another. > The big column family held the hdfs location where the small-data > column family actually carried the data? > > St.Ack >
