Todd, I could not get stargate to work on 0.89 for some reason, thats why we are running 0.20.6. Also in regards to bloom filters, I thought they were mainly for column seeking, in our case we have this schema:
row att:data filename file_data -Jack On Mon, Sep 20, 2010 at 11:53 AM, Todd Lipcon <[email protected]> wrote: > Hey Jack, > > This sounds like a very exciting project! A few thoughts that might help you: > - Check out the Bloom filter support that is in the 0.89 series. It > sounds like all of your access is going to be random key gets - adding > blooms will save you lots of disk seeks. > - I might even bump the region size up to 1G or more given the planned > capacity. > - The "HA" setup will be tricky - we don't have a great HA story yet. > Given you have two DCs, you may want to consider running separate > HBase clusters, one in each, and either using the new replication > support, or simply doing "client replication" by writing all images to > both. > > Good luck with the project, and keep us posted how it goes. > > Thanks > -Todd > > On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <[email protected]> wrote: >> >> Greetings all. My name is Jack and I work for an image hosting >> company Image Shack, we also have a property thats widely used as a >> twitter app called yfrog (yfrog.com). >> >> Image-Shack gets close to two million image uploads per day, which are >> usually stored on regular servers (we have about 700), as regular >> files, and each server has its own host name, such as (img55). I've >> been researching on how to improve our backend design in terms of data >> safety and stumped onto the Hbase project. >> >> We have been running hadoop for data access log analysis for a while >> now, quite successfully. We are receiving about 2 billion hits per >> day and store all of that data into RCFiles (attribution to Facebook >> applies here), that are loadable into Hive (thanks to FB again). So >> we know how to manage HDFS, and run mapreduce jobs. >> >> Now, I think hbase is he most beautiful thing that happen to >> distributed DB world :). The idea is to store image files (about >> 400Kb on average into HBASE). The setup will include the following >> configuration: >> >> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x >> 2TB disks each. >> 3 to 5 Zookeepers >> 2 Masters (in a datacenter each) >> 10 to 20 Stargate REST instances (one per server, hash loadbalanced) >> 40 to 50 RegionServers (will probably keep masters separate on dedicated >> boxes). >> 2 Namenode servers (one backup, highly available, will do fsimage and >> edits snapshots also) >> >> So far I got about 13 servers running, and doing about 20 insertions / >> second (file size ranging from few KB to 2-3MB, ave. 400KB). via >> Stargate API. Our frontend servers receive files, and I just >> fork-insert them into stargate via http (curl). >> The inserts are humming along nicely, without any noticeable load on >> regionservers, so far inserted about 2 TB worth of images. >> I have adjusted the region file size to be 512MB, and table block size >> to about 400KB , trying to match average access block to limit HDFS >> trips. So far the read performance was more than adequate, and of >> course write performance is nowhere near capacity. >> So right now, all newly uploaded images go to HBASE. But we do plan >> to insert about 170 Million images (about 100 days worth), which is >> only about 64 TB, or 10% of planned cluster size of 600TB. >> The end goal is to have a storage system that creates data safety, >> e.g. system may go down but data can not be lost. Our Front-End >> servers will continue to serve images from their own file system (we >> are serving about 16 Gbits at peak), however should we need to bring >> any of those down for maintenance, we will redirect all traffic to >> Hbase (should be no more than few hundred Mbps), while the front end >> server is repaired (for example having its disk replaced), after the >> repairs, we quickly repopulate it with missing files, while serving >> the missing remaining off Hbase. >> All in all should be very interesting project, and I am hoping not to >> run into any snags, however, should that happens, I am pleased to know >> that such a great and vibrant tech group exists that supports and uses >> HBASE :). >> >> -Jack > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
