Re: Millions of photos into Hbase

Ryan Rawson Mon, 20 Sep 2010 22:16:38 -0700

So we are running this code in production:

http://github.com/stumbleupon/hbase


The branch off point is 8dc5a1a353ffc9fa57ac59618f76928b5eb31f6c, and
everything past that is our rebase and cherry-picked changes.

We use git to manage this internally, and don't use svn.  Included is
the LZO libraries we use checked directly into the code, and the
assembly changes to publish those.

So when we are ready to do a deploy, we do this:
mvn install assembly:assembly
(or include the -DskipTests to make it go faster)

and then we have a new tarball to deploy.

Note there is absolutely NO warranty here, not even that it will run
for a microsecond... futhermore this is NOT an ASF release, just a
courtesy.  If there ever was to be a release it would look
differently, because ASF releases cant include GPL code (this does)
and depend on commercial releases of haoopp.

Enjoy,
-ryan

On Mon, Sep 20, 2010 at 9:57 PM, Ryan Rawson <[email protected]> wrote:
> no no, 20 GB heap per node.  each node with 24-32gb ram, etc.
>
> we cant rely on the linux buffer cache to save us, so we have to cache
> in hbase ram.
>
> :-)
>
> -ryan
>
> On Mon, Sep 20, 2010 at 9:44 PM, Jack Levin <[email protected]> wrote:
>> 20GB+?, hmmm..... I do plan to run 50 regionserver nodes though, with
>> 3 GB Heap likely, this should be plenty to rip through say, 350TB of
>> data.
>>
>> -Jack
>>
>> On Mon, Sep 20, 2010 at 9:39 PM, Ryan Rawson <[email protected]> wrote:
>>> yes that is the new ZK based coordination.  when i publish the SU code
>>> we have a patch which limits that and is faster.  2GB is a little
>>> small for a regionserver memory... in my ideal world we'll be putting
>>> 20GB+ of ram to regionserver.
>>>
>>> I just figured you were using the DEB/RPMs because your files were in
>>> /usr/local... I usually run everything out of /home/hadoop b/c it
>>> allows me to easily rsync as user hadoop.
>>>
>>> but you are on the right track yes :-)
>>>
>>> On Mon, Sep 20, 2010 at 9:32 PM, Jack Levin <[email protected]> wrote:
>>>> Who said anything about deb :). I do use tarballs.... Yes, so what did
>>>> it is the copy of that jar to under hbase/lib, and then full restart.
>>>>  Now here is a funny thing, the master shuddered for about 10 minutes,
>>>> spewing those messages:
>>>>
>>>> 2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>> Event NodeCreated with state SyncConnected with path
>>>> /hbase/UNASSIGNED/97999366
>>>> 2010-09-20 21:23:45,827 DEBUG
>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>>> NodeCreated with path /hbase/UNASSIGNED/97999366
>>>> 2010-09-20 21:23:45,827 DEBUG
>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>>> Got zkEvent NodeCreated state:SyncConnected
>>>> path:/hbase/UNASSIGNED/97999366
>>>> 2010-09-20 21:23:45,827 DEBUG
>>>> org.apache.hadoop.hbase.master.RegionManager: Created/updated
>>>> UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state
>>>> M2ZK_REGION_OFFLINE
>>>> 2010-09-20 21:23:45,828 INFO
>>>> org.apache.hadoop.hbase.master.RegionServerOperation:
>>>> img13,p1000319tq.jpg,1284952655960.812544765 open on
>>>> 10.103.2.3,60020,1285042333293
>>>> 2010-09-20 21:23:45,828 DEBUG
>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [
>>>> M2ZK_REGION_OFFLINE ] for region 97999366
>>>> 2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>> Event NodeChildrenChanged with state SyncConnected with path
>>>> /hbase/UNASSIGNED
>>>> 2010-09-20 21:23:45,828 DEBUG
>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
>>>> NodeChildrenChanged with path /hbase/UNASSIGNED
>>>> 2010-09-20 21:23:45,828 DEBUG
>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
>>>> Got zkEvent NodeChildrenChanged state:SyncConnected
>>>> path:/hbase/UNASSIGNED
>>>> 2010-09-20 21:23:45,830 DEBUG
>>>> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of
>>>> img150,,1284859678248.3116007 is not valid;
>>>> serverAddress=10.103.2.1:60020, startCode=1285038205920 unknown.
>>>>
>>>>
>>>> Does anyone know what they mean?   At first it would kill one of my
>>>> datanodes.  But what helped is when I changed to heap size to 4GB for
>>>> master and 2GB for datanode that was dying, and after 10 minutes I got
>>>> into a clean state.
>>>>
>>>> -Jack
>>>>
>>>>
>>>> On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson <[email protected]> wrote:
>>>>> yes, on every single machine as well, and restart.
>>>>>
>>>>> again, not sure how how you'd do this in a scalable manner with your
>>>>> deb packages... on the source tarball you can just replace it, rsync
>>>>> it out and done.
>>>>>
>>>>> :-)
>>>>>
>>>>> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <[email protected]> wrote:
>>>>>> ok, I found that file, do I replace hadoop-core.*.jar under 
>>>>>> /usr/lib/hbase/lib?
>>>>>> Then restart, etc?  All regionservers too?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <[email protected]> wrote:
>>>>>>> Well I don't really run CDH, I disagree with their rpm/deb packaging
>>>>>>> policies and I have to highly recommend not using DEBs to install
>>>>>>> software...
>>>>>>>
>>>>>>> So normally installing from tarball, the jar is in
>>>>>>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>>>>>>>
>>>>>>> On CDH/DEB edition, it's somewhere silly ... locate and find will be
>>>>>>> your friend.  It should be called hadoop-core-0.20.2+320.jar though!
>>>>>>>
>>>>>>> I'm working on a github publish of SU's production system, which uses
>>>>>>> the cloudera maven repo to install the correct JAR in hbase so when
>>>>>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
>>>>>>> (the * being whatever version you specified in pom.xml) the cdh3b2 jar
>>>>>>> comes pre-packaged.
>>>>>>>
>>>>>>> Stay tuned :-)
>>>>>>>
>>>>>>> -ryan
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <[email protected]> wrote:
>>>>>>>> Ryan, hadoop jar, what is the usual path to the file? I just to to be
>>>>>>>> sure, and where do I put it?
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>> you need 2 more things:
>>>>>>>>>
>>>>>>>>> - restart hdfs
>>>>>>>>> - make sure the hadoop jar from your install replaces the one we ship 
>>>>>>>>> with
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <[email protected]> wrote:
>>>>>>>>>> So, I switched to 0.89, and we already had CDH3
>>>>>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>>>>>>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>>>>>>>>> hbase-site.xml, the master still reports this:
>>>>>>>>>>
>>>>>>>>>>  You are currently running the HMaster without HDFS append support
>>>>>>>>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>>>>>>>>> details.
>>>>>>>>>> Master Attributes
>>>>>>>>>> Attribute Name  Value   Description
>>>>>>>>>> HBase Version   0.89.20100726, r979826  HBase version and svn 
>>>>>>>>>> revision
>>>>>>>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase 
>>>>>>>>>> version
>>>>>>>>>> was compiled and by whom
>>>>>>>>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>>>>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>>>>>>>>> version was compiled and by whom
>>>>>>>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase  
>>>>>>>>>>    Location
>>>>>>>>>> of HBase home directory
>>>>>>>>>>
>>>>>>>>>> Any ideas whats wrong?
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>>> Hey,
>>>>>>>>>>>
>>>>>>>>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>>>>>>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>>>>>>>>> 0.89 "developer releases" in hopes that people would try them our 
>>>>>>>>>>> and
>>>>>>>>>>> start thinking about the next major version.  One of these is what 
>>>>>>>>>>> SU
>>>>>>>>>>> is running prod on.
>>>>>>>>>>>
>>>>>>>>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>>>>>>>>> to run is a bit of a contact sport, but if you are serious about not
>>>>>>>>>>> losing data it is worthwhile.  SU is based on the most recent DR 
>>>>>>>>>>> with
>>>>>>>>>>> a few minor patches of our own concoction brought in.  If current
>>>>>>>>>>> works, but some Master ops are slow, and there are a few patches on
>>>>>>>>>>> top of that.  I'll poke about and see if its possible to publish to 
>>>>>>>>>>> a
>>>>>>>>>>> github branch or something.
>>>>>>>>>>>
>>>>>>>>>>> -ryan
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <[email protected]> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Sounds, good, only reason I ask is because of this:
>>>>>>>>>>>>
>>>>>>>>>>>> There are currently two active branches of HBase:
>>>>>>>>>>>>
>>>>>>>>>>>>    * 0.20 - the current stable release series, being maintained 
>>>>>>>>>>>> with
>>>>>>>>>>>> patches for bug fixes only. This release series does not support 
>>>>>>>>>>>> HDFS
>>>>>>>>>>>> durability - edits may be lost in the case of node failure.
>>>>>>>>>>>>    * 0.89 - a development release series with active feature and
>>>>>>>>>>>> stability development, not currently recommended for production 
>>>>>>>>>>>> use.
>>>>>>>>>>>> This release does support HDFS durability - cases in which edits 
>>>>>>>>>>>> are
>>>>>>>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Are we talking about data loss in case of datanode going down while
>>>>>>>>>>>> being written to, or RegionServer going down?
>>>>>>>>>>>>
>>>>>>>>>>>> -jack
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <[email protected]> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 
>>>>>>>>>>>>> committers...
>>>>>>>>>>>>>
>>>>>>>>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 
>>>>>>>>>>>>> 0.20
>>>>>>>>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>>>>>>>>> CDH3/append-branch to achieve data durability, and there really 
>>>>>>>>>>>>> is no
>>>>>>>>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, 
>>>>>>>>>>>>> I'd
>>>>>>>>>>>>> stop and rebase those tests onto the latest DR announced on the 
>>>>>>>>>>>>> list.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <[email protected]> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi Stack, see inline:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <[email protected]> wrote:
>>>>>>>>>>>>>>> Hey Jack:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for writing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> See below for some comments.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin 
>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Image-Shack gets close to two million image uploads per day, 
>>>>>>>>>>>>>>>> which are
>>>>>>>>>>>>>>>> usually stored on regular servers (we have about 700), as 
>>>>>>>>>>>>>>>> regular
>>>>>>>>>>>>>>>> files, and each server has its own host name, such as (img55). 
>>>>>>>>>>>>>>>>   I've
>>>>>>>>>>>>>>>> been researching on how to improve our backend design in terms 
>>>>>>>>>>>>>>>> of data
>>>>>>>>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Latency is the second requirement.  We have some services that 
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume 
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> would really put cache into good use.  Some other services 
>>>>>>>>>>>>>> however,
>>>>>>>>>>>>>> have about 25% cache hit ratio, in which case the latency should 
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw 
>>>>>>>>>>>>>> disk,
>>>>>>>>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>>>>>>>>> availability, then speed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>>>>>>>>> distributed DB world :).   The idea is to store image files 
>>>>>>>>>>>>>>>> (about
>>>>>>>>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd guess some images are much bigger than this.  Do you ever 
>>>>>>>>>>>>>>> limit
>>>>>>>>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The setup will include the following
>>>>>>>>>>>>>>>> configuration:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core 
>>>>>>>>>>>>>>>> cpu, 6 x
>>>>>>>>>>>>>>>> 2TB disks each.
>>>>>>>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash 
>>>>>>>>>>>>>>>> loadbalanced)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if 
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> could run with thrift given REST base64s its payload IIRC 
>>>>>>>>>>>>>>> (check the
>>>>>>>>>>>>>>> src yourself).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across 
>>>>>>>>>>>>>> multiple REST APIs.
>>>>>>>>>>>>>> For reading, its a nginx proxy that does Content-type 
>>>>>>>>>>>>>> modification
>>>>>>>>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>>>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>>>>>>>>> Why REST, it was the simplest thing to run, given that its 
>>>>>>>>>>>>>> supports
>>>>>>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as long 
>>>>>>>>>>>>>> as we
>>>>>>>>>>>>>> can use http still to send and receive data (anyone wrote 
>>>>>>>>>>>>>> anything
>>>>>>>>>>>>>> like that say in python, C or java?)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on 
>>>>>>>>>>>>>>>> dedicated boxes).
>>>>>>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do 
>>>>>>>>>>>>>>>> fsimage and
>>>>>>>>>>>>>>>> edits snapshots also)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 
>>>>>>>>>>>>>>>> insertions /
>>>>>>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). 
>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>>>>>>>>> The inserts are humming along nicely, without any noticeable 
>>>>>>>>>>>>>>>> load on
>>>>>>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table 
>>>>>>>>>>>>>>>> block size
>>>>>>>>>>>>>>>> to about 400KB , trying to match average access block to limit 
>>>>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>>>>> trips.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 
>>>>>>>>>>>>>>> 192MB.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>>>>>>>>> function of memstore HEAP, something like 40%?  Or are you 
>>>>>>>>>>>>>> talking
>>>>>>>>>>>>>> about HDFS block size?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we 
>>>>>>>>>>>>>>>> do plan
>>>>>>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), 
>>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>>>>>>>>> The end goal is to have a storage system that creates data 
>>>>>>>>>>>>>>>> safety,
>>>>>>>>>>>>>>>> e.g. system may go down but data can not be lost.   Our 
>>>>>>>>>>>>>>>> Front-End
>>>>>>>>>>>>>>>> servers will continue to serve images from their own file 
>>>>>>>>>>>>>>>> system (we
>>>>>>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need to 
>>>>>>>>>>>>>>>> bring
>>>>>>>>>>>>>>>> any of those down for maintenance, we will redirect all 
>>>>>>>>>>>>>>>> traffic to
>>>>>>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the 
>>>>>>>>>>>>>>>> front end
>>>>>>>>>>>>>>>> server is repaired (for example having its disk replaced), 
>>>>>>>>>>>>>>>> after the
>>>>>>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while 
>>>>>>>>>>>>>>>> serving
>>>>>>>>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>>>>>>>>> All in all should be very interesting project, and I am hoping 
>>>>>>>>>>>>>>>> not to
>>>>>>>>>>>>>>>> run into any snags, however, should that happens, I am pleased 
>>>>>>>>>>>>>>>> to know
>>>>>>>>>>>>>>>> that such a great and vibrant tech group exists that supports 
>>>>>>>>>>>>>>>> and uses
>>>>>>>>>>>>>>>> HBASE :).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We're definetly interested in how your project progresses.  If 
>>>>>>>>>>>>>>> you are
>>>>>>>>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cool.  I'd like that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta 
>>>>>>>>>>>>>> tables,
>>>>>>>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>>>>>>>> Is 0.89 ready for production environment?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Reply via email to