Re: Millions of photos into Hbase

Jack Levin Mon, 20 Sep 2010 13:13:48 -0700

Todd, I could not get stargate to work on 0.89 for some reason, thats
why we are running 0.20.6.  Also in regards to bloom filters, I
thought they were mainly for column seeking, in our case we have this
schema:


row           att:data
filename    file_data


-Jack

On Mon, Sep 20, 2010 at 11:53 AM, Todd Lipcon <[email protected]> wrote:
> Hey Jack,
>
> This sounds like a very exciting project! A few thoughts that might help you:
> - Check out the Bloom filter support that is in the 0.89 series. It
> sounds like all of your access is going to be random key gets - adding
> blooms will save you lots of disk seeks.
> - I might even bump the region size up to 1G or more given the planned 
> capacity.
> - The "HA" setup will be tricky - we don't have a great HA story yet.
> Given you have two DCs, you may want to consider running separate
> HBase clusters, one in each, and either using the new replication
> support, or simply doing "client replication" by writing all images to
> both.
>
> Good luck with the project, and keep us posted how it goes.
>
> Thanks
> -Todd
>
> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <[email protected]> wrote:
>>
>> Greetings all.  My name is Jack and I work for an image hosting
>> company Image Shack, we also have a property thats widely used as a
>> twitter app called yfrog (yfrog.com).
>>
>> Image-Shack gets close to two million image uploads per day, which are
>> usually stored on regular servers (we have about 700), as regular
>> files, and each server has its own host name, such as (img55).   I've
>> been researching on how to improve our backend design in terms of data
>> safety and stumped onto the Hbase project.
>>
>> We have been running hadoop for data access log analysis for a while
>> now, quite successfully.  We are receiving about 2 billion hits per
>> day and store all of that data into RCFiles (attribution to Facebook
>> applies here), that are loadable into Hive (thanks to FB again).  So
>> we know how to manage HDFS, and run mapreduce jobs.
>>
>> Now, I think hbase is he most beautiful thing that happen to
>> distributed DB world :).   The idea is to store image files (about
>> 400Kb on average into HBASE).  The setup will include the following
>> configuration:
>>
>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>> 2TB disks each.
>> 3 to 5 Zookeepers
>> 2 Masters (in a datacenter each)
>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>> 40 to 50 RegionServers (will probably keep masters separate on dedicated 
>> boxes).
>> 2 Namenode servers (one backup, highly available, will do fsimage and
>> edits snapshots also)
>>
>> So far I got about 13 servers running, and doing about 20 insertions /
>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>> Stargate API.  Our frontend servers receive files, and I just
>> fork-insert them into stargate via http (curl).
>> The inserts are humming along nicely, without any noticeable load on
>> regionservers, so far inserted about 2 TB worth of images.
>> I have adjusted the region file size to be 512MB, and table block size
>> to about 400KB , trying to match average access block to limit HDFS
>> trips.   So far the read performance was more than adequate, and of
>> course write performance is nowhere near capacity.
>> So right now, all newly uploaded images go to HBASE.  But we do plan
>> to insert about 170 Million images (about 100 days worth), which is
>> only about 64 TB, or 10% of planned cluster size of 600TB.
>> The end goal is to have a storage system that creates data safety,
>> e.g. system may go down but data can not be lost.   Our Front-End
>> servers will continue to serve images from their own file system (we
>> are serving about 16 Gbits at peak), however should we need to bring
>> any of those down for maintenance, we will redirect all traffic to
>> Hbase (should be no more than few hundred Mbps), while the front end
>> server is repaired (for example having its disk replaced), after the
>> repairs, we quickly repopulate it with missing files, while serving
>> the missing remaining off Hbase.
>> All in all should be very interesting project, and I am hoping not to
>> run into any snags, however, should that happens, I am pleased to know
>> that such a great and vibrant tech group exists that supports and uses
>> HBASE :).
>>
>> -Jack
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Millions of photos into Hbase

Reply via email to