No fat rows.  We have kept the default hbase client limit of 10mb. And most values 
are quite small < 5k.

We haven't tried raising the memory limit and we can try raising one of the servers and see how it does. However looking at the graphs I don't think it will help...but it is worth a try.

~Jeff


On 10/30/2012 10:45 PM, ramkrishna vasudevan wrote:
Are you writing fat cells?

Did you try raising the heap size? and see if still it is crashing?

Regards
Ram

On Wed, Oct 31, 2012 at 6:10 AM, Jeff Whiting <[email protected] <mailto:[email protected]>> wrote:

    So I'm looking at ganglia so the numbers are somewhat approximate (this is 
for a server that
    just crashed about an 1/2 hour ago due to running out of memory):

    Store files are hovering just below 1k.  Over the last 24 hours it has 
varied by about 100
    files (I'm looking at hbase.regionserver.storefiles).

    Block cache count is about 24k varied by about 2k.  Our block cache free 
goes between 0.7G and
    0.4G.  It looks like we have almost 3G free after restarting a region 
server.

    The evicted block count went from 210k to 320k over a 24 hour period.  Hit 
ratio is close to
    100 (the graph isn't very detailed so I'm guess it is like 98-99%).

    Block cache size stays at about 2GB.

    ~Jeff



    On 10/30/2012 6:21 PM, Jeff Whiting wrote:

        We have no coprossesors.  We are running replication from this cluster 
to another one.

        What is the best way to see how many store files we have? Or checking 
on the block cache?

        ~Jeff

        On 10/30/2012 12:43 AM, ramkrishna vasudevan wrote:

            Hi

            Are you using any coprocessors? Can you see how many store files are
            created?

            The no of blocks getting cached will give you an idea too..

            Regards
            Ram

            On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <[email protected]
            <mailto:[email protected]>> wrote:

                We have 6 region server given 10G of memory for hbase.  Each 
region server
                has an average of about 100 regions and across the cluster we 
are averaging
                about 100 requests / second with a pretty even read / write 
load.  We are
                running cdh4 (0.92.1-cdh4.0.1, rUnknown)

                I feel that looking over our load and our requests that the 
10GB of memory
                should be enough to handle the load and that we shouldn't 
really be pushing
                the the memory limits.

                However what we are seeing is that our memory usage goes up 
slowly until
                the region server starts sputtering due to gc collection issues 
and it will
                eventually get timed out by zookeeper and be killed.

                We'll see aborts like this in the log:
                2012-10-29 08:10:52,132 FATAL 
org.apache.hadoop.hbase.**regionserver.HRegionServer:
                ABORTING region server ds5.h1.ut1.qprod.net
                <http://ds5.h1.ut1.qprod.net>,60020,**1351233245547:
                Unhandled exception: 
org.apache.hadoop.hbase.**YouAreDeadException:
                Server REPORT rejected; currently processing 
ds5.h1.ut1.qprod.net
                <http://ds5.h1.ut1.qprod.net>,60020,**1351233245547
                as dead server
                2012-10-29 08:10:52,250 FATAL 
org.apache.hadoop.hbase.**regionserver.HRegionServer:
                RegionServer abort: loaded coprocessors are: []
                2012-10-29 08:10:52,392 FATAL 
org.apache.hadoop.hbase.**regionserver.HRegionServer:
                ABORTING region server ds5.h1.ut1.qprod.net
                <http://ds5.h1.ut1.qprod.net>,60020,**1351233245547:
                regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-**
                0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf
                regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-**
                0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf 
received
                expired from ZooKeeper, aborting
                2012-10-29 08:10:52,401 FATAL 
org.apache.hadoop.hbase.**regionserver.HRegionServer:
                RegionServer abort: loaded coprocessors are: []

                Which are "caused" by:
                2012-10-29 08:07:40,646 WARN 
org.apache.hadoop.hbase.util.**Sleeper: We
                slept 29014ms instead of 3000ms, this is likely due to a long 
garbage
                collecting pause and it's usually bad, see 
http://hbase.apache.org/book.**
                
html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
                2012-10-29 08:08:39,074 WARN 
org.apache.hadoop.hbase.util.**Sleeper: We
                slept 28121ms instead of 3000ms, this is likely due to a long 
garbage
                collecting pause and it's usually bad, see 
http://hbase.apache.org/book.**
                
html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
                2012-10-29 08:09:13,261 WARN 
org.apache.hadoop.hbase.util.**Sleeper: We
                slept 31124ms instead of 3000ms, this is likely due to a long 
garbage
                collecting pause and it's usually bad, see 
http://hbase.apache.org/book.**
                
html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
                2012-10-29 08:09:45,536 WARN 
org.apache.hadoop.hbase.util.**Sleeper: We
                slept 32209ms instead of 3000ms, this is likely due to a long 
garbage
                collecting pause and it's usually bad, see 
http://hbase.apache.org/book.**
                
html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
                2012-10-29 08:10:18,103 WARN 
org.apache.hadoop.hbase.util.**Sleeper: We
                slept 32557ms instead of 3000ms, this is likely due to a long 
garbage
                collecting pause and it's usually bad, see 
http://hbase.apache.org/book.**
                
html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
                2012-10-29 08:10:51,896 WARN 
org.apache.hadoop.hbase.util.**Sleeper: We
                slept 33741ms instead of 3000ms, this is likely due to a long 
garbage
                collecting pause and it's usually bad, see 
http://hbase.apache.org/book.**
                
html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>


                We'll also see a bunch of responseTooSlow and operationTooSlow 
as GC kicks
                in and really kills the region server's performance.


                We have the jvm metrics kicking out to ganglia and looking at
                jvm.RegionServer.metrics.**memHeapUsedM you can see that it 
will go up
                over time and eventually run out of memory.  I can also see in
                hmaster:60010/master-status that the usedHeapMB just goes up 
and I can make
                a pretty educated guess as to what server will go down next. It 
will take
                several days to a week of continuous running (after restarting 
a region
                server) before we have a potential problem.

                Our next one to go will probably be ds6 and jmap -heap shows:
                concurrent mark-sweep generation:
                    capacity = 10398531584 (9916.8125MB)
                    used     = 9036165000 (8617.558479309082MB)
                    free     = 1362366584 (1299.254020690918MB)
                    86.89847145248619% used

                So we are using 86% of the 10GB heep allocated to the 
concurrent mark and
                sweep generation.  Looking at ds6 in the web interface where has
                information about the a tasks it isn't running rpc stuff it 
doesn't show
                any compactions or any background tasks happening. Nor is there 
any active
                rpc call that are longer than 0 seconds (it seems to be 
handling the
                requests just fine).

                At this point I feel somewhat lost as to how to debug the 
problem. I'm not
                sure what to do next to figure out what is going on.  Any 
suggestions as to
                what to look for or debug where the memory is being used? I can 
generate
                heap dumps via jmap (although it effectively kills the region 
server) but I
                don't really know what to look for to see where the memory is 
going. I also
                have jmx setup on each region server and can connect to it that 
way.

                Thanks,
                ~Jeff

-- Jeff Whiting
                Qualtrics Senior Software Engineer
                [email protected] <mailto:[email protected]>




-- Jeff Whiting
    Qualtrics Senior Software Engineer
    [email protected] <mailto:[email protected]>



--
Jeff Whiting
Qualtrics Senior Software Engineer
[email protected]

Reply via email to