Re: Millions of photos into Hbase

Ryan Rawson Mon, 20 Sep 2010 21:40:00 -0700

yes that is the new ZK based coordination.  when i publish the SU code
we have a patch which limits that and is faster.  2GB is a little
small for a regionserver memory... in my ideal world we'll be putting
20GB+ of ram to regionserver.


I just figured you were using the DEB/RPMs because your files were in
/usr/local... I usually run everything out of /home/hadoop b/c it
allows me to easily rsync as user hadoop.

but you are on the right track yes :-)

On Mon, Sep 20, 2010 at 9:32 PM, Jack Levin <[email protected]> wrote:
> Who said anything about deb :). I do use tarballs.... Yes, so what did
> it is the copy of that jar to under hbase/lib, and then full restart.
>  Now here is a funny thing, the master shuddered for about 10 minutes,
> spewing those messages:
>
> 2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Event NodeCreated with state SyncConnected with path
> /hbase/UNASSIGNED/97999366
> 2010-09-20 21:23:45,827 DEBUG
> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
> NodeCreated with path /hbase/UNASSIGNED/97999366
> 2010-09-20 21:23:45,827 DEBUG
> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
> Got zkEvent NodeCreated state:SyncConnected
> path:/hbase/UNASSIGNED/97999366
> 2010-09-20 21:23:45,827 DEBUG
> org.apache.hadoop.hbase.master.RegionManager: Created/updated
> UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state
> M2ZK_REGION_OFFLINE
> 2010-09-20 21:23:45,828 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation:
> img13,p1000319tq.jpg,1284952655960.812544765 open on
> 10.103.2.3,60020,1285042333293
> 2010-09-20 21:23:45,828 DEBUG
> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [
> M2ZK_REGION_OFFLINE ] for region 97999366
> 2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Event NodeChildrenChanged with state SyncConnected with path
> /hbase/UNASSIGNED
> 2010-09-20 21:23:45,828 DEBUG
> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event
> NodeChildrenChanged with path /hbase/UNASSIGNED
> 2010-09-20 21:23:45,828 DEBUG
> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS:
> Got zkEvent NodeChildrenChanged state:SyncConnected
> path:/hbase/UNASSIGNED
> 2010-09-20 21:23:45,830 DEBUG
> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of
> img150,,1284859678248.3116007 is not valid;
> serverAddress=10.103.2.1:60020, startCode=1285038205920 unknown.
>
>
> Does anyone know what they mean?   At first it would kill one of my
> datanodes.  But what helped is when I changed to heap size to 4GB for
> master and 2GB for datanode that was dying, and after 10 minutes I got
> into a clean state.
>
> -Jack
>
>
> On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson <[email protected]> wrote:
>> yes, on every single machine as well, and restart.
>>
>> again, not sure how how you'd do this in a scalable manner with your
>> deb packages... on the source tarball you can just replace it, rsync
>> it out and done.
>>
>> :-)
>>
>> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <[email protected]> wrote:
>>> ok, I found that file, do I replace hadoop-core.*.jar under 
>>> /usr/lib/hbase/lib?
>>> Then restart, etc?  All regionservers too?
>>>
>>> -Jack
>>>
>>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <[email protected]> wrote:
>>>> Well I don't really run CDH, I disagree with their rpm/deb packaging
>>>> policies and I have to highly recommend not using DEBs to install
>>>> software...
>>>>
>>>> So normally installing from tarball, the jar is in
>>>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar
>>>>
>>>> On CDH/DEB edition, it's somewhere silly ... locate and find will be
>>>> your friend.  It should be called hadoop-core-0.20.2+320.jar though!
>>>>
>>>> I'm working on a github publish of SU's production system, which uses
>>>> the cloudera maven repo to install the correct JAR in hbase so when
>>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz
>>>> (the * being whatever version you specified in pom.xml) the cdh3b2 jar
>>>> comes pre-packaged.
>>>>
>>>> Stay tuned :-)
>>>>
>>>> -ryan
>>>>
>>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <[email protected]> wrote:
>>>>> Ryan, hadoop jar, what is the usual path to the file? I just to to be
>>>>> sure, and where do I put it?
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <[email protected]> wrote:
>>>>>> you need 2 more things:
>>>>>>
>>>>>> - restart hdfs
>>>>>> - make sure the hadoop jar from your install replaces the one we ship 
>>>>>> with
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <[email protected]> wrote:
>>>>>>> So, I switched to 0.89, and we already had CDH3
>>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added
>>>>>>>  <name>dfs.support.append</name> as true to both hdfs-site.xml and
>>>>>>> hbase-site.xml, the master still reports this:
>>>>>>>
>>>>>>>  You are currently running the HMaster without HDFS append support
>>>>>>> enabled. This may result in data loss. Please see the HBase wiki  for
>>>>>>> details.
>>>>>>> Master Attributes
>>>>>>> Attribute Name  Value   Description
>>>>>>> HBase Version   0.89.20100726, r979826  HBase version and svn revision
>>>>>>> HBase Compiled  Sat Jul 31 02:01:58 PDT 2010, stack     When HBase 
>>>>>>> version
>>>>>>> was compiled and by whom
>>>>>>> Hadoop Version  0.20.2, r911707 Hadoop version and svn revision
>>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo   When Hadoop
>>>>>>> version was compiled and by whom
>>>>>>> HBase Root Directory    hdfs://namenode-rd.imageshack.us:9000/hbase     
>>>>>>> Location
>>>>>>> of HBase home directory
>>>>>>>
>>>>>>> Any ideas whats wrong?
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <[email protected]> wrote:
>>>>>>>> Hey,
>>>>>>>>
>>>>>>>> There is actually only 1 active branch of hbase, that being the 0.89
>>>>>>>> release, which is based on 'trunk'.  We have snapshotted a series of
>>>>>>>> 0.89 "developer releases" in hopes that people would try them our and
>>>>>>>> start thinking about the next major version.  One of these is what SU
>>>>>>>> is running prod on.
>>>>>>>>
>>>>>>>> At this point tracking 0.89 and which ones are the 'best' peach sets
>>>>>>>> to run is a bit of a contact sport, but if you are serious about not
>>>>>>>> losing data it is worthwhile.  SU is based on the most recent DR with
>>>>>>>> a few minor patches of our own concoction brought in.  If current
>>>>>>>> works, but some Master ops are slow, and there are a few patches on
>>>>>>>> top of that.  I'll poke about and see if its possible to publish to a
>>>>>>>> github branch or something.
>>>>>>>>
>>>>>>>> -ryan
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <[email protected]> wrote:
>>>>>>>>> Sounds, good, only reason I ask is because of this:
>>>>>>>>>
>>>>>>>>> There are currently two active branches of HBase:
>>>>>>>>>
>>>>>>>>>    * 0.20 - the current stable release series, being maintained with
>>>>>>>>> patches for bug fixes only. This release series does not support HDFS
>>>>>>>>> durability - edits may be lost in the case of node failure.
>>>>>>>>>    * 0.89 - a development release series with active feature and
>>>>>>>>> stability development, not currently recommended for production use.
>>>>>>>>> This release does support HDFS durability - cases in which edits are
>>>>>>>>> lost are considered serious bugs.
>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Are we talking about data loss in case of datanode going down while
>>>>>>>>> being written to, or RegionServer going down?
>>>>>>>>>
>>>>>>>>> -jack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>>> We run 0.89 in production @ Stumbleupon.  We also employ 3 
>>>>>>>>>> committers...
>>>>>>>>>>
>>>>>>>>>> As for safety, you have no choice but to run 0.89.  If you run a 0.20
>>>>>>>>>> release you will lose data.  you must be on 0.89 and
>>>>>>>>>> CDH3/append-branch to achieve data durability, and there really is no
>>>>>>>>>> argument around it.  If you are doing your tests with 0.20.6 now, I'd
>>>>>>>>>> stop and rebase those tests onto the latest DR announced on the list.
>>>>>>>>>>
>>>>>>>>>> -ryan
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>>> Hi Stack, see inline:
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <[email protected]> wrote:
>>>>>>>>>>>> Hey Jack:
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for writing.
>>>>>>>>>>>>
>>>>>>>>>>>> See below for some comments.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <[email protected]> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Image-Shack gets close to two million image uploads per day, 
>>>>>>>>>>>>> which are
>>>>>>>>>>>>> usually stored on regular servers (we have about 700), as regular
>>>>>>>>>>>>> files, and each server has its own host name, such as (img55).   
>>>>>>>>>>>>> I've
>>>>>>>>>>>>> been researching on how to improve our backend design in terms of 
>>>>>>>>>>>>> data
>>>>>>>>>>>>> safety and stumped onto the Hbase project.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Any other requirements other than data safety? (latency, etc).
>>>>>>>>>>>
>>>>>>>>>>> Latency is the second requirement.  We have some services that are
>>>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I assume 
>>>>>>>>>>> this
>>>>>>>>>>> would really put cache into good use.  Some other services however,
>>>>>>>>>>> have about 25% cache hit ratio, in which case the latency should be
>>>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off raw 
>>>>>>>>>>> disk,
>>>>>>>>>>> then its good enough.   Safely is supremely important, then its
>>>>>>>>>>> availability, then speed.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to
>>>>>>>>>>>>> distributed DB world :).   The idea is to store image files (about
>>>>>>>>>>>>> 400Kb on average into HBASE).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'd guess some images are much bigger than this.  Do you ever limit
>>>>>>>>>>>> the size of images folks can upload to your service?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The setup will include the following
>>>>>>>>>>>>> configuration:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 
>>>>>>>>>>>>> x
>>>>>>>>>>>>> 2TB disks each.
>>>>>>>>>>>>> 3 to 5 Zookeepers
>>>>>>>>>>>>> 2 Masters (in a datacenter each)
>>>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash 
>>>>>>>>>>>>> loadbalanced)
>>>>>>>>>>>>
>>>>>>>>>>>> Whats your frontend?  Why REST?  It might be more efficient if you
>>>>>>>>>>>> could run with thrift given REST base64s its payload IIRC (check 
>>>>>>>>>>>> the
>>>>>>>>>>>> src yourself).
>>>>>>>>>>>
>>>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across multiple 
>>>>>>>>>>> REST APIs.
>>>>>>>>>>> For reading, its a nginx proxy that does Content-type modification
>>>>>>>>>>> from image/jpeg to octet-stream, and vice versa,
>>>>>>>>>>> it then hits Haproxy again, which hits balanced REST.
>>>>>>>>>>> Why REST, it was the simplest thing to run, given that its supports
>>>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as long as 
>>>>>>>>>>> we
>>>>>>>>>>> can use http still to send and receive data (anyone wrote anything
>>>>>>>>>>> like that say in python, C or java?)
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate on 
>>>>>>>>>>>>> dedicated boxes).
>>>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do fsimage 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> edits snapshots also)
>>>>>>>>>>>>>
>>>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 
>>>>>>>>>>>>> insertions /
>>>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>>>>>>>>>>> Stargate API.  Our frontend servers receive files, and I just
>>>>>>>>>>>>> fork-insert them into stargate via http (curl).
>>>>>>>>>>>>> The inserts are humming along nicely, without any noticeable load 
>>>>>>>>>>>>> on
>>>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images.
>>>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table block 
>>>>>>>>>>>>> size
>>>>>>>>>>>>> to about 400KB , trying to match average access block to limit 
>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>> trips.
>>>>>>>>>>>>
>>>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least.  You'll
>>>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or maybe 
>>>>>>>>>>>> 192MB.
>>>>>>>>>>>
>>>>>>>>>>> Yep, i will adjust to 1G.  I thought flush was controlled by a
>>>>>>>>>>> function of memstore HEAP, something like 40%?  Or are you talking
>>>>>>>>>>> about HDFS block size?
>>>>>>>>>>>
>>>>>>>>>>>>  So far the read performance was more than adequate, and of
>>>>>>>>>>>>> course write performance is nowhere near capacity.
>>>>>>>>>>>>> So right now, all newly uploaded images go to HBASE.  But we do 
>>>>>>>>>>>>> plan
>>>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), which 
>>>>>>>>>>>>> is
>>>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>>>>>>>>>>> The end goal is to have a storage system that creates data safety,
>>>>>>>>>>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>>>>>>>>>>> servers will continue to serve images from their own file system 
>>>>>>>>>>>>> (we
>>>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need to 
>>>>>>>>>>>>> bring
>>>>>>>>>>>>> any of those down for maintenance, we will redirect all traffic to
>>>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the front 
>>>>>>>>>>>>> end
>>>>>>>>>>>>> server is repaired (for example having its disk replaced), after 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while 
>>>>>>>>>>>>> serving
>>>>>>>>>>>>> the missing remaining off Hbase.
>>>>>>>>>>>>> All in all should be very interesting project, and I am hoping 
>>>>>>>>>>>>> not to
>>>>>>>>>>>>> run into any snags, however, should that happens, I am pleased to 
>>>>>>>>>>>>> know
>>>>>>>>>>>>> that such a great and vibrant tech group exists that supports and 
>>>>>>>>>>>>> uses
>>>>>>>>>>>>> HBASE :).
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> We're definetly interested in how your project progresses.  If you 
>>>>>>>>>>>> are
>>>>>>>>>>>> ever up in the city, you should drop by for a chat.
>>>>>>>>>>>
>>>>>>>>>>> Cool.  I'd like that.
>>>>>>>>>>>
>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>
>>>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms.
>>>>>>>>>>>> P.P.S I updated the wiki on stargate REST:
>>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate
>>>>>>>>>>>
>>>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta 
>>>>>>>>>>> tables,
>>>>>>>>>>> and data?  e.g. cross compatible?
>>>>>>>>>>> Is 0.89 ready for production environment?
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Millions of photos into Hbase

Reply via email to