Re: substantial performance degradation when using WAL

Stack Sun, 19 Dec 2010 10:51:14 -0800

That was a great thread Todd... it almost got somewhere.  Looks like
you were owed a response by the hotspot-gc crew.


Friso, I wonder if u23 is better?  There are a bunch of G1 fixes in
it: http://www.oracle.com/technetwork/java/javase/2col/6u23bugfixes-191074.html

St.Ack

On Sat, Dec 18, 2010 at 10:39 PM, Todd Lipcon <[email protected]> wrote:
> I also had long GC pauses on G1 when I tried it last summer. Check out
> this thread for the gory details:
>
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-July/000653.html
>
> -Todd
>
> On Sat, Dec 18, 2010 at 10:37 PM, Stack <[email protected]> wrote:
>> Setting swap to zero is going to an extreme but changing it so its not
>> the default -- 60% IIRC -- could help.
>>
>> I made HBASE-3376 for the NPE.  I'll take a look into it.
>>
>> Congrats on the 113second G1 pause.  I thought G1 was to be the end of
>> such pauses? (Smile).   Its good to hear that its staying up for you.
>> What do your JVM options look like for G1?  Are you running with
>> -XX:MaxGCPauseMillis?
>>
>> Good on you Friso,
>> St.Ack
>>
>> On Sat, Dec 18, 2010 at 12:12 AM, Friso van Vollenhoven
>> <[email protected]> wrote:
>>> Hi J-D,
>>>
>>> I redid the job as before to once more verify. The real problem appears to 
>>> be something I had not noticed, because I never expected it. The machines 
>>> started swapping during the job. I did not expect that, because there is a 
>>> total of about 45GB heap allocated on a 64GB machine and nothing much else 
>>> running, so I had not thought of that immediately (although I should have 
>>> checked nonetheless). Network and IO graphs look normal. We run on 10 disks 
>>> / datanode, so there are some spindles available. I will try to get the 
>>> swapping out of the way and then try again and see if it fixes it.
>>>
>>> Now the GC log shows some obvious evil doers:
>>> 2010-12-18T00:17:42.274+0000: 14275.374: [Full GC (System.gc()) 
>>> 6913M->4722M(15742M), 113.1350880 secs]
>>>  [Times: user=6.99 sys=1.80, real=113.12 secs]
>>>
>>> A (partial) log of one of the RS is here:
>>> GC log: http://pastebin.com/GPRVj8u5
>>> RS log: http://pastebin.com/Vj5K26ss
>>>
>>> The GC is G1, so it may look different from what you expect from a GC log. 
>>> I know it is considered experimental, but I like the concept of it and 
>>> think it's nice to gain some experience with it.
>>>
>>> Also, in the RS log I see some NPEs:
>>> 2010-12-18 04:31:24,043 ERROR 
>>> org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while 
>>> processing event M_RS_OPEN_REGION
>>> java.lang.NullPointerException
>>> at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
>>> at 
>>> org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198)
>>> at 
>>> org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672)
>>> at 
>>> org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpening(ZKAssign.java:552)
>>> at 
>>> org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpening(ZKAssign.java:545)
>>> at 
>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.transitionZookeeperOfflineToOpening(OpenRegionHandler.java:208)
>>> at 
>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:89)
>>> at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>> at java.lang.Thread.run(Thread.java:619)
>>>
>>> These I had not seen before. Any idea?
>>>
>>>
>>> Friso
>>>
>>>
>>>
>>> On 17 dec 2010, at 20:11, Jean-Daniel Cryans wrote:
>>>
>>> If it wasn't GC it would be core dumps and whatnot, nothings free :)
>>>
>>> I will reiterate what I said in my first answer, I'd like to see your
>>> GC log since at this point I haven't seen direct evidence of GC
>>> activity.
>>>
>>> J-D
>>>
>>> On Fri, Dec 17, 2010 at 1:27 AM, Friso van Vollenhoven
>>> <[email protected]<mailto:[email protected]>> wrote:
>>> Hi J-D,
>>>
>>> Thanks for your comments and clarification. I guess GC does blow 
>>> (especially when writing things like databases and filesystems).
>>>
>>> Right now I will dive into GC tuning once more and probably lower the 
>>> number of reducers on the insert jobs.
>>>
>>>
>>> Thanks,
>>> Friso
>>>
>>>
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: substantial performance degradation when using WAL

Reply via email to