you'll need a major compaction to generate the blooms on the existing data.
On Mon, Sep 20, 2010 at 2:15 PM, Alexey Kovyrin <[email protected]> wrote: > When one enables blooms for rows, is major compaction or something > else required (aside from enabling the table after the alter)? > > On Mon, Sep 20, 2010 at 5:06 PM, Todd Lipcon <[email protected]> wrote: >> On Mon, Sep 20, 2010 at 1:13 PM, Jack Levin <[email protected]> wrote: >>> Todd, I could not get stargate to work on 0.89 for some reason, thats >>> why we are running 0.20.6. Also in regards to bloom filters, I >>> thought they were mainly for column seeking, in our case we have this >>> schema: >>> >>> row att:data >>> filename file_data >>> >> >> The bloom filters work either in a ROW basis or a ROW_COL basis. If >> you turn on the row key blooms, then your get of a particular filename >> will avoid looking in the storefiles that don't have any data for the >> row. >> >> Regarding stargate in 0.89, it's been renamed to "rest" since the old >> rest server got removed. I haven't used it much but hopefully someone >> can give you a pointer (or even better, update the wiki/docs!) >> >> -Todd >> >>> >>> -Jack >>> >>> On Mon, Sep 20, 2010 at 11:53 AM, Todd Lipcon <[email protected]> wrote: >>>> Hey Jack, >>>> >>>> This sounds like a very exciting project! A few thoughts that might help >>>> you: >>>> - Check out the Bloom filter support that is in the 0.89 series. It >>>> sounds like all of your access is going to be random key gets - adding >>>> blooms will save you lots of disk seeks. >>>> - I might even bump the region size up to 1G or more given the planned >>>> capacity. >>>> - The "HA" setup will be tricky - we don't have a great HA story yet. >>>> Given you have two DCs, you may want to consider running separate >>>> HBase clusters, one in each, and either using the new replication >>>> support, or simply doing "client replication" by writing all images to >>>> both. >>>> >>>> Good luck with the project, and keep us posted how it goes. >>>> >>>> Thanks >>>> -Todd >>>> >>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <[email protected]> wrote: >>>>> >>>>> Greetings all. My name is Jack and I work for an image hosting >>>>> company Image Shack, we also have a property thats widely used as a >>>>> twitter app called yfrog (yfrog.com). >>>>> >>>>> Image-Shack gets close to two million image uploads per day, which are >>>>> usually stored on regular servers (we have about 700), as regular >>>>> files, and each server has its own host name, such as (img55). I've >>>>> been researching on how to improve our backend design in terms of data >>>>> safety and stumped onto the Hbase project. >>>>> >>>>> We have been running hadoop for data access log analysis for a while >>>>> now, quite successfully. We are receiving about 2 billion hits per >>>>> day and store all of that data into RCFiles (attribution to Facebook >>>>> applies here), that are loadable into Hive (thanks to FB again). So >>>>> we know how to manage HDFS, and run mapreduce jobs. >>>>> >>>>> Now, I think hbase is he most beautiful thing that happen to >>>>> distributed DB world :). The idea is to store image files (about >>>>> 400Kb on average into HBASE). The setup will include the following >>>>> configuration: >>>>> >>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x >>>>> 2TB disks each. >>>>> 3 to 5 Zookeepers >>>>> 2 Masters (in a datacenter each) >>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced) >>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated >>>>> boxes). >>>>> 2 Namenode servers (one backup, highly available, will do fsimage and >>>>> edits snapshots also) >>>>> >>>>> So far I got about 13 servers running, and doing about 20 insertions / >>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via >>>>> Stargate API. Our frontend servers receive files, and I just >>>>> fork-insert them into stargate via http (curl). >>>>> The inserts are humming along nicely, without any noticeable load on >>>>> regionservers, so far inserted about 2 TB worth of images. >>>>> I have adjusted the region file size to be 512MB, and table block size >>>>> to about 400KB , trying to match average access block to limit HDFS >>>>> trips. So far the read performance was more than adequate, and of >>>>> course write performance is nowhere near capacity. >>>>> So right now, all newly uploaded images go to HBASE. But we do plan >>>>> to insert about 170 Million images (about 100 days worth), which is >>>>> only about 64 TB, or 10% of planned cluster size of 600TB. >>>>> The end goal is to have a storage system that creates data safety, >>>>> e.g. system may go down but data can not be lost. Our Front-End >>>>> servers will continue to serve images from their own file system (we >>>>> are serving about 16 Gbits at peak), however should we need to bring >>>>> any of those down for maintenance, we will redirect all traffic to >>>>> Hbase (should be no more than few hundred Mbps), while the front end >>>>> server is repaired (for example having its disk replaced), after the >>>>> repairs, we quickly repopulate it with missing files, while serving >>>>> the missing remaining off Hbase. >>>>> All in all should be very interesting project, and I am hoping not to >>>>> run into any snags, however, should that happens, I am pleased to know >>>>> that such a great and vibrant tech group exists that supports and uses >>>>> HBASE :). >>>>> >>>>> -Jack >>>> >>>> >>>> >>>> -- >>>> Todd Lipcon >>>> Software Engineer, Cloudera >>>> >>> >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > > > -- > Alexey Kovyrin > http://kovyrin.net/ >
