Re: Millions of photos into Hbase

Ryan Rawson Mon, 20 Sep 2010 14:23:12 -0700

you'll need a major compaction to generate the blooms on the existing data.




On Mon, Sep 20, 2010 at 2:15 PM, Alexey Kovyrin <[email protected]> wrote:
> When one enables blooms for rows, is major compaction or something
> else required (aside from enabling the table after the alter)?
>
> On Mon, Sep 20, 2010 at 5:06 PM, Todd Lipcon <[email protected]> wrote:
>> On Mon, Sep 20, 2010 at 1:13 PM, Jack Levin <[email protected]> wrote:
>>> Todd, I could not get stargate to work on 0.89 for some reason, thats
>>> why we are running 0.20.6.  Also in regards to bloom filters, I
>>> thought they were mainly for column seeking, in our case we have this
>>> schema:
>>>
>>> row           att:data
>>> filename    file_data
>>>
>>
>> The bloom filters work either in a ROW basis or a ROW_COL basis. If
>> you turn on the row key blooms, then your get of a particular filename
>> will avoid looking in the storefiles that don't have any data for the
>> row.
>>
>> Regarding stargate in 0.89, it's been renamed to "rest" since the old
>> rest server got removed. I haven't used it much but hopefully someone
>> can give you a pointer (or even better, update the wiki/docs!)
>>
>> -Todd
>>
>>>
>>> -Jack
>>>
>>> On Mon, Sep 20, 2010 at 11:53 AM, Todd Lipcon <[email protected]> wrote:
>>>> Hey Jack,
>>>>
>>>> This sounds like a very exciting project! A few thoughts that might help 
>>>> you:
>>>> - Check out the Bloom filter support that is in the 0.89 series. It
>>>> sounds like all of your access is going to be random key gets - adding
>>>> blooms will save you lots of disk seeks.
>>>> - I might even bump the region size up to 1G or more given the planned 
>>>> capacity.
>>>> - The "HA" setup will be tricky - we don't have a great HA story yet.
>>>> Given you have two DCs, you may want to consider running separate
>>>> HBase clusters, one in each, and either using the new replication
>>>> support, or simply doing "client replication" by writing all images to
>>>> both.
>>>>
>>>> Good luck with the project, and keep us posted how it goes.
>>>>
>>>> Thanks
>>>> -Todd
>>>>
>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <[email protected]> wrote:
>>>>>
>>>>> Greetings all.  My name is Jack and I work for an image hosting
>>>>> company Image Shack, we also have a property thats widely used as a
>>>>> twitter app called yfrog (yfrog.com).
>>>>>
>>>>> Image-Shack gets close to two million image uploads per day, which are
>>>>> usually stored on regular servers (we have about 700), as regular
>>>>> files, and each server has its own host name, such as (img55).   I've
>>>>> been researching on how to improve our backend design in terms of data
>>>>> safety and stumped onto the Hbase project.
>>>>>
>>>>> We have been running hadoop for data access log analysis for a while
>>>>> now, quite successfully.  We are receiving about 2 billion hits per
>>>>> day and store all of that data into RCFiles (attribution to Facebook
>>>>> applies here), that are loadable into Hive (thanks to FB again).  So
>>>>> we know how to manage HDFS, and run mapreduce jobs.
>>>>>
>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>> distributed DB world :).   The idea is to store image files (about
>>>>> 400Kb on average into HBASE).  The setup will include the following
>>>>> configuration:
>>>>>
>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>>> 2TB disks each.
>>>>> 3 to 5 Zookeepers
>>>>> 2 Masters (in a datacenter each)
>>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated 
>>>>> boxes).
>>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>>> edits snapshots also)
>>>>>
>>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>> fork-insert them into stargate via http (curl).
>>>>> The inserts are humming along nicely, without any noticeable load on
>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>> I have adjusted the region file size to be 512MB, and table block size
>>>>> to about 400KB , trying to match average access block to limit HDFS
>>>>> trips.   So far the read performance was more than adequate, and of
>>>>> course write performance is nowhere near capacity.
>>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>>> to insert about 170 Million images (about 100 days worth), which is
>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>> The end goal is to have a storage system that creates data safety,
>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>> servers will continue to serve images from their own file system (we
>>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>>> server is repaired (for example having its disk replaced), after the
>>>>> repairs, we quickly repopulate it with missing files, while serving
>>>>> the missing remaining off Hbase.
>>>>> All in all should be very interesting project, and I am hoping not to
>>>>> run into any snags, however, should that happens, I am pleased to know
>>>>> that such a great and vibrant tech group exists that supports and uses
>>>>> HBASE :).
>>>>>
>>>>> -Jack
>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Alexey Kovyrin
> http://kovyrin.net/
>

Re: Millions of photos into Hbase

Reply via email to