Hi guys, conversation went off the list briefly as I resent stack
dumps to stack. We've moved back to hdfs 0.20.2 but want to post this
back here and try to summarize events as well as our experiences with
0.23 and concerns.

Quick summary: after having some issues  with 0.20.2(since resolved),
we tried moving to full cdh4b1(which consists of hadoop 0.23 and hbase
0.92.0)

Upon starting running tests on hadoop 0.23, while we found read
performance to be great, we found that write autoFlushed throughput
was pretty bad(about 1000/wps... as a point of reference, with 0.20.2
we are now getting 10k/wps with autoFlush on).

We used multiple different tests: Hbases PerformanceEvaluation class,
YCSB and our own custom test which allowed us to compare throughput
numbers for increments, reads,  unbuffered/buffered writes, etc

What we found from these tests is that if we flushed writes
immediately(for example by having autoFlush set to true), our
throughput was poor. However if we disabled the wall, or just used a
moderate sized writeBuffer(default of 2mb), disabling
autoFlush(resulting in appends of roughly 200kb of data per server, of
which there were 11), that we got good throughput.

The numbers we got for 0.20.2 and 0.23 are using the same settings and
same tests. Once we moved back to 0.20.2 we found that our test client
would become the bottleneck for some tests and had to run two
simultaneously to get the cluster to work up any sweat at all. With
0.23 adding further clients only increased the latency on individual
puts, leading us to believe there was something wrong with appends.
Stack suggested that the changes in Sync behavior in 0.23 may have
something to do with it, I have been unable to confirm this as I am
not very familiar with the hdfs codebase.

I include the backlog(edited for clarity) of my conversation with
stack, also with some inline comments, hopefully this can be of help
to people trying to figure out how to debug stuff:
Most of this was over the weekend when I couldn't get at the servers,
added in details where I can

Stack takes a look at the thread dumps:
>>>>>>> The servers seem to be doing nothing.  In the last thread dump,
>>>>>>> 2012-03-30 08:48:09, all is idle in essence but for one thread doing
>>>>>>> this:
>>>>>>> "IPC Server handler 14 on 60020" daemon prio=10 tid=0x00002aaac80f0000
>>>>>>> nid=0x2eae runnable [0x0000000046259000]
>>>>>>>   java.lang.Thread.State: RUNNABLE
>>>>>>>        at 
>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1499)
>>>>>>>        - locked <0x00000004038d1cd8> (a 
>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream)
>>>>>>>        at 
>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1417)
>>>>>>>        - locked <0x00000004038d1cd8> (a 
>>>>>>> org.apache.hadoop.hdfs.DFSOutputStream)
>>>>>>>        at 
>>>>>>> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:116)
>>>>>>>        at 
>>>>>>> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:1199)
>>>>>>>        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>>>>>>>        at 
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>        at 
>>>>>>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:228)
>>>>>>>        at 
>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1084)
>>>>>>>        at 
>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1195)
>>>>>>>        at 
>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:1032)
>>>>>>>        at 
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.java:1853)
>>>>>>>        at 
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1723)
>>>>>>>        at 
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3062)
>>>>>>>        at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
>>>>>>>        at 
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>        at 
>>>>>>> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
>>>>>>>        at 
>>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1326)
>>>>>>>
>>>>>>
>>>>>>
>>>>>>During testing we only tried it with a few clients. At one point we
>>>>>>were splitting multiple clients over the network, but latency was
>>>>>>still increasing proportionally to the number of threads. I had a poke
>>>>>>around, and confirmed that each client only has one connection to each
>>>>>>region server, so I did expect more clients to do a better job but it
>>>>>>wasn't. I haven't tried initializing additional tables off
>>>>>>configuration clones to force the creation of additional HConnections
>>>>>>
>>>>>Yeah, each client has one connection only and multiplexes over the 
>>>>>connection.
>>>>>>

Essentially since we only have one rpc connection per region
server(see the implementation of HConnection/HConnectionManager) going
out from the client, that only results in a single thread from each
region server working on hbase. We did try more clients but
interestingly enough this did not result in better throughput, just
the same throughput with a higher latency per request(since there were
more threads trying to share the same throughput)

This single connection per rs behavior can also be changed with the
following configuration settings: hbase.client.ipc.pool.size ,
hbase.client.ipc.pool.type

>>>>>>> No other thread seems blocked waiting on the sync to complete.
>>>>>>>
>>>>>>> The second to last thread dump is doing no work only waiting.
>>>>>>>
>>>>>>> The third to last thread dump is again doing nothing.
>>>>>>>
>>>>>>> There is then a bunch of class unloading going on.   Not enough 
>>>>>>> permanent gen?
>>>>>>>
>>>>>>
>>>>>>Hmm, I'd have to confirm this but this shouldn't be a problem... The
>>>>>>regionservers are allocated 16gb memory, and I've never seen them use
>>>>>>more than 8gb. I haven't looked at the gc settings though so maybe
>>>>>>there's something weird in there.
>>>>>>

Our memory settings should have been solid and aren't causing any
weirdness on 0.20.2 now... I have no idea what was going on here.

>>>>>>> The fourth to last, and last thread dump, is again doing nothing but
>>>>>>> waiting for work.
>>>>>>>
>>>>>>> Perhaps there was no load at the time of the thread dumps? (The thread
>>>>>>> dumps looks like they were made just before the load showed up)
>>>>>>>
>>>>>>
>>>>>>There was load, but the throughput was terrible(1000 wps, 256 bytes each)
>>>>>>
>>>>>
>>>>>Well I'd have expected to have seen more handlers occupied and them
>>>>>blocked waiting to get in and sync.  In all four thread dumps, I saw
>>>>>evidence of one write only which would seem to indicate an issue w/
>>>>>putting load on the servers (or load was not yet on).
>>>>>
>>>>
>>>>With this particular load test there was only one client(running 100
>>>>threads but still sharing one HConnection), sorry. That being said,
>>>>even with more clients engaging more RPC's the throughput figures were
>>>>the same :/
>>>>
Again due to shortage of clients, but even with more clients
throughput doesn't go up.

>>>>>>> Block cache is disabled?  Its very small.  Are you skirting it
>>>>>>> intentionally?  You are just interested in the write throughput?
>>>>>>>
>>>>>>
>>>>>>It shouldn't be disabled... However right now we realy are only
>>>>>>interested in the (autoflushing) write throughput as that is the only
>>>>>>area that is doing poorly(if we set autoflush to false, using the
>>>>>>default 2mb buffer, we can get near 200k wps over the cluster).
>>>>>>
The block cache was enabled, set to 35%. The tables in the test were
set to IN_MEMORY=>TRUE.

>>>>>
>>>>>So autoflush is disabled? (I hate the name of this config; I have to
>>>>>look it up every time.  I don't get it).  So, you are having it rpc
>>>>>for every individual put?
>>>>>
>>>>>With autoflushing enabled, you get 'good' throughput?
>>>>>
>>>>
>>>>Errh, it's on... So it flushes every time a put is made(meaning lots
>>>>of tiny rpc messages).
>>>>With it disabled(so waiting until we have 2mb of puts stored), the
>>>>throughput is good... When it does flush, those get split into groups
>>>>of puts for each region server which can then just make one large
>>>>append rather than lots of small ones.
>>>>
>>>
>>>Ok.
>>>

autoFlush is accessible from HTable.setAutoFlush, if it's set to
false, puts are stored at client size until either manually flushed or
until the writeBuffer is full, when they will be flushed. By storing
up a moderate number of items they can be batched together before
sending to the targetted regionserver(this is all done automatically
in multiPut which divvies up the writeBuffer by regionServer, then
sends the blocks to each server resulting in larger appends, and less
small packages). By disabling autoFlushing however, there will be a
delay of data being stored(until the writeBuffer is full), and if the
client fails before a flush, puts can be lost. Finally as an aside,
reads, increments and deletes do not go in the writeBuffer and are
processed immediately.

>>>So, there is something wrong w/ the multiplexing over single
>>>connection.  Its not putting up enough load.  You need this case to
>>>work well?   Can we concentrate then on making it so when you do a
>>>thread dump all of the handlers are blocked trying to get into the
>>>sync block on the WAL?  When we have that, we know that the puts are
>>>coming in at maximum possible rate.  Are the clients all blocked when
>>>you thread dump?  Maybe they are doing the puts in series?  The
>>>high-latency is over at client-side?
>>>
>>The only way that state would happen is if we have a number of clients
>>equal to the number of RPC connections provided by the region
>>server(since each client only opens up one connection the the RS)?
>>Unfortunately we didn't try this, since even with 2-3 clients we were
>>seeing no improvement in performance.
>>
>>>If you run w/ bulk writing, you see the servers sweating?  Working hard?
>>>
>>We really never saw the servers working hard. We might get stuff to
>>25% load on ganglia at the most, HD usage% on iostat -x wouldn't go
>>over 20% ish, network was well under capacity.
>>

Since moving to 0.20.2 we are finally seeing our region servers push
up a sweat. No matter what we did before they would never be taxed
much. Disabling the wal will increase throughput significantly but the
throughput is so much easier on the server it won't tax it much(I
think the best gains can be made here by fiddling with flushing and
compaction parameters)

>>>>>>> 2012-03-30 08:44:06,928 DEBUG
>>>>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=40.29
>>>>>>> MB, free=4.76 GB, max=4.8 GB, blocks=0, accesses=0, hits=0,
>>>>>>> hitRatio=0cachingAccesses=0, cachingHits=0,
>>>>>>> cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN
>>>>>>>
>>>>>>> Then I see this: 2012-03-30 08:48:11,222 WARN
>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 9 on
>>>>>>> 60020 took 1956 ms appending an edit to hlog; editcount=153058,
>>>>>>> len~=288
>>>>>>>
>>>>>>> Minor: Maybe try without blooms enabled?  Maybe slowing your writes?
>>>>>>>
>>>>>>
>>>>>>I can give it a poke without bloom filters... Not sure what slowing
>>>>>>writes will do if we want to get better throughput. Do you mean
>>>>>>batching them(that certainly improves throughput a lot, but isn't
>>>>>>really compatible with our use case)
>>>>>>
>>>>>This is a long shot.  FB made and run w/ blooms so its probably fine.
>>>>>I was just calling out the non-normal stuff.
>>>>>
>>>>ok, I doubt it too, we did play around a bit with using different
>>>>compression algorithms and bf on/off and weren't getting big
>>>>differences.
>>>>
>>>Makes sense if problem is elsewhere.. we're not putting up the load.
>>>

We've had no problems with bloom filters since moving back to 0.20.2
so we don't think this has anything to do with it

>>>>>>> What is your flush size?  Could double it or quadruple it given you
>>>>>>> are constraining region growth by having large ones.  We're flushing
>>>>>>> smallish files (Put up your configs if you don't mind -- the
>>>>>>> hbase-site.xml and hbase-env.sh).
>>>>>>>
>>>>>>
>>>>>>I did quadruple the flush size as someone suggested earlier(64mb->256
>>>>>>I think?). I'm pretty sure that dump was under the higher flush size
>>>>>>setting.
>>>>>>
>>>>>Ok.  We are flushing in this log sequence to free up log files.  That
>>>>>might be why the stores were smaller (not REAL small.. but smaller...
>>>>>they could be bigger .. it helps w/ throughput... you seemed to be
>>>>>doing a flush or more a second...of about 50MB or so).
>>>>>
>>>>
>>>>Maybe something wrong there then... I'll have another look asap.
>>>>
>>>>>>> Log doesn't seem untoward.
>>>>>>>
>>>>>>> I see that the odd read is failing because client is timing out it
>>>>>>> seems.  Some block cache could help here.
>>>>>>>
>>>>>>> Try running w/o blooms, upped flush size, and some block cache.  See
>>>>>>> if that makes a difference.  A few thread dumps while its under load
>>>>>>> would be appreciated.
>>>>>>>
>>>>>>
>>>>>>I'll try to. We're on the verge of moving back to 0.20.2 and the final
>>>>>>decision isn't mine to make :/
>>>>>>

Unfortunately we weren't able to run these tests due to moving back to
0.20.2, howevver we had been running with 256mb flushes and a 35%
block cache for the last few days of testing. During our initial
testing we had run without bloom filters but this was with a 64mb
flush size.

>>>>>
>>>>>CDH3u3, etc., are better known quantities.  The 0.23 sync pipeline is
>>>>>different code altogether that has not had the same amount of scrutiny
>>>>>done.
>>
>>Back on 0.20.2 we're getting much better numbers, with no modification
>>to our settings. Throughput with autoFlush on is at least 10,000
>>writes per rs, and significantly more when we run with a moderate
>>sized writeBuffer and autoFlush disabled, giving much more
>>satisfactory results. I'd have to dive into the hdfs code to figure
>>out exactly how/if the new sync policy is affecting things. Hopefully
>>I can get some  time on the side somewhere to look at it, or we can
>>get another test cluster onto 0.23
>>

Based on our experience I would advise anyone that isn't feeling very
adventurous to stick with 0.20.2. We lost over a week playing around
on 0.23 and made very little progress.

>>Also, when I mailed the attachments this conversation came off the
>>mailing list, if you don't have a problem with it I'd like to post it
>>back there in case it might be of some use to others
>
>Please post back to the main list.   More than a few folks are
>interested in your story at this stage.

Many thanks to everyone who helped, especially Stack and tlipcon who
were both very helpful. Hopefully some of the stuff from our
experiences may help others.

Reply via email to