Thank you to JD as well for first response!
On Tue, Aug 17, 2010 at 8:59 PM, Ryan Rawson <ryano...@gmail.com> wrote: > You're welcome. Glad you figured it out! Very strange, and it > totally makes sense this is a kernel/os bug. > > -ryan > > On Tue, Aug 17, 2010 at 5:28 PM, George P. Stathis <gstat...@traackr.com> > wrote: >> OK, high load averages while idle is a confirmed Lucid issue: >> https://bugs.launchpad.net/ubuntu-on-ec2/+bug/574910 >> >> Will downgrade back to 9.10 and try again. Thanks for helping out >> Ryan. If you hadn't confirmed that this is not normal, I would have >> wasted a lot more time looking in the wrong places. Now I just have to >> rebuild us a new cluster... >> >> -GS >> >> On Tue, Aug 17, 2010 at 7:39 PM, George P. Stathis <gstat...@traackr.com> >> wrote: >>> Yeap. Load average definitely goes back down to 0.10 or below once >>> hadoop/hbase is shut off. Same issue is happening on the slaves as >>> well BTW. We are running on Ubuntu 10.04 which is the only recent >>> change other than moving to a cluster setup. We were running a >>> pseudo-distributed setup before on one machine running Ubuntu 9.10. >>> Top shows just as much activity under %id but load average was close >>> to zero when idle! Environment is hypervisor'ed as well, so I suspect >>> that's not really the issue...:-) Know anyone running on an Ubuntu >>> 10.04 setup? >>> >>> -GS >>> >>> On Tue, Aug 17, 2010 at 7:08 PM, Ryan Rawson <ryano...@gmail.com> wrote: >>>> This is odd, the run queue has quite a bit of processes in it, but the >>>> actual cpu data says 'nothings happening'. >>>> >>>> Can't tell which processes are in the run queue here, does the load >>>> average go down when you take hadoop/hbase down? >>>> >>>> If so, that seems to identify the locus of the problem, but doesn't >>>> really explain WHY this is so. Consider I've never really seen this >>>> before I'd just write it off to using a hypervisor'ed environment and >>>> blame other people stealing all your cpus :-) >>>> >>>> On Tue, Aug 17, 2010 at 4:03 PM, George P. Stathis <gstat...@traackr.com> >>>> wrote: >>>>> vmstat 2 for 2 mins below. Looks like everything is in idle (github >>>>> gist paste if it's easier to read: http://gist.github.com/532512): >>>>> >>>>> >>>>> procs -----------memory---------- ---swap-- -----io---- -system-- >>>>> ----cpu---- >>>>> r b swpd free buff cache si so bi bo in cs us sy >>>>> id wa >>>>> 0 0 0 15097116 248428 1398444 0 0 0 50 5 24 >>>>> 0 0 100 0 >>>>> 0 0 0 15096948 248428 1398444 0 0 0 0 281 281 >>>>> 0 0 100 0 >>>>> 0 0 0 15096948 248428 1398444 0 0 0 0 279 260 >>>>> 0 0 100 0 >>>>> 0 0 0 15096948 248428 1398444 0 0 0 0 199 216 >>>>> 0 0 100 0 >>>>> 4 0 0 15096612 248428 1398448 0 0 0 0 528 467 >>>>> 0 0 100 0 >>>>> 0 0 0 15096612 248428 1398448 0 0 0 0 208 213 >>>>> 0 0 100 0 >>>>> 4 0 0 15096460 248428 1398448 0 0 0 0 251 261 >>>>> 0 0 100 0 >>>>> 0 0 0 15096460 248428 1398448 0 0 0 12 242 248 >>>>> 0 0 100 0 >>>>> 0 0 0 15096460 248428 1398448 0 0 0 34 228 230 >>>>> 0 0 100 0 >>>>> 0 0 0 15096476 248428 1398448 0 0 0 0 266 272 >>>>> 0 0 100 0 >>>>> 0 0 0 15096324 248428 1398448 0 0 0 10 179 206 >>>>> 0 0 100 0 >>>>> 1 0 0 15096340 248428 1398448 0 0 0 0 225 254 >>>>> 0 0 100 0 >>>>> 1 0 0 15096188 248428 1398448 0 0 0 0 263 245 >>>>> 0 0 100 0 >>>>> 0 0 0 15096188 248428 1398448 0 0 0 2 169 210 >>>>> 0 0 100 0 >>>>> 0 0 0 15096188 248428 1398448 0 0 0 0 201 238 >>>>> 0 0 100 0 >>>>> 0 0 0 15096036 248428 1398448 0 0 0 10 174 202 >>>>> 0 0 100 0 >>>>> 0 0 0 15096036 248428 1398448 0 0 0 6 207 222 >>>>> 0 0 100 0 >>>>> 0 0 0 15095884 248428 1398448 0 0 0 0 198 242 >>>>> 0 0 100 0 >>>>> 2 0 0 15095884 248428 1398448 0 0 0 0 177 215 >>>>> 0 0 100 0 >>>>> 0 0 0 15095884 248428 1398448 0 0 0 0 244 265 >>>>> 0 0 100 0 >>>>> 0 0 0 15095732 248428 1398448 0 0 0 4 197 222 >>>>> 0 0 100 0 >>>>> 6 0 0 15095732 248428 1398448 0 0 0 6 267 260 >>>>> 0 0 100 0 >>>>> 0 0 0 15095732 248428 1398448 0 0 0 0 240 239 >>>>> 0 0 100 0 >>>>> 0 0 0 15095580 248428 1398448 0 0 0 8 180 210 >>>>> 0 0 100 0 >>>>> 5 0 0 15095580 248428 1398448 0 0 0 0 193 224 >>>>> 0 0 100 0 >>>>> 1 0 0 15095580 248428 1398448 0 0 0 36 161 191 >>>>> 0 0 100 0 >>>>> 0 0 0 15095428 248428 1398448 0 0 0 0 176 216 >>>>> 0 0 100 0 >>>>> 4 0 0 15095428 248428 1398448 0 0 0 0 202 236 >>>>> 0 0 100 0 >>>>> 0 0 0 15095428 248428 1398448 0 0 0 6 191 220 >>>>> 0 0 100 0 >>>>> 1 0 0 15095428 248428 1398448 0 0 0 0 188 238 >>>>> 0 0 100 0 >>>>> 2 0 0 15095276 248428 1398448 0 0 0 0 174 206 >>>>> 0 0 100 0 >>>>> 1 0 0 15095276 248428 1398448 0 0 0 0 225 249 >>>>> 0 0 100 0 >>>>> 0 0 0 15095124 248428 1398448 0 0 0 0 222 263 >>>>> 0 0 100 0 >>>>> 1 0 0 15095124 248428 1398448 0 0 0 6 187 236 >>>>> 0 0 100 0 >>>>> 5 0 0 15094940 248428 1398448 0 0 0 0 453 434 >>>>> 0 0 100 0 >>>>> 4 0 0 15094788 248428 1398448 0 0 0 2 227 225 >>>>> 0 0 100 0 >>>>> 0 0 0 15094788 248428 1398448 0 0 0 0 213 236 >>>>> 0 0 100 0 >>>>> 4 0 0 15094788 248428 1398448 0 0 0 6 257 253 >>>>> 0 0 100 0 >>>>> 0 0 0 15094636 248428 1398448 0 0 0 0 215 230 >>>>> 0 0 100 0 >>>>> 0 0 0 15094652 248428 1398448 0 0 0 0 259 285 >>>>> 0 0 100 0 >>>>> 0 0 0 15094500 248428 1398448 0 0 0 14 194 219 >>>>> 0 0 100 0 >>>>> 0 0 0 15094516 248428 1398448 0 0 0 0 227 257 >>>>> 0 0 100 0 >>>>> 4 0 0 15094516 248428 1398448 0 0 0 36 266 263 >>>>> 0 0 100 0 >>>>> 1 0 0 15094516 248428 1398448 0 0 0 0 202 213 >>>>> 0 0 100 0 >>>>> 0 0 0 15094364 248428 1398448 0 0 0 0 204 240 >>>>> 0 0 100 0 >>>>> 0 0 0 15094212 248428 1398448 0 0 0 6 161 194 >>>>> 0 0 100 0 >>>>> 0 0 0 15094212 248428 1398448 0 0 0 0 191 215 >>>>> 0 0 100 0 >>>>> 1 0 0 15094212 248428 1398448 0 0 0 0 216 238 >>>>> 0 0 100 0 >>>>> 5 0 0 15094212 248428 1398448 0 0 0 6 169 202 >>>>> 0 0 100 0 >>>>> 0 0 0 15094060 248428 1398448 0 0 0 0 172 216 >>>>> 0 0 100 0 >>>>> 2 0 0 15094060 248428 1398448 0 0 0 6 201 196 >>>>> 0 0 100 0 >>>>> 1 0 0 15094060 248428 1398448 0 0 0 0 196 218 >>>>> 0 0 100 0 >>>>> procs -----------memory---------- ---swap-- -----io---- -system-- >>>>> ----cpu---- >>>>> r b swpd free buff cache si so bi bo in cs us sy >>>>> id wa >>>>> 0 0 0 15093908 248428 1398448 0 0 0 0 206 236 >>>>> 0 0 100 0 >>>>> 4 0 0 15093908 248428 1398448 0 0 0 0 197 219 >>>>> 0 0 100 0 >>>>> 0 0 0 15093908 248428 1398448 0 0 0 0 186 227 >>>>> 0 0 100 0 >>>>> 0 0 0 15093756 248428 1398448 0 0 0 0 168 182 >>>>> 0 0 100 0 >>>>> 0 0 0 15093756 248428 1398448 0 0 0 0 206 239 >>>>> 0 0 100 0 >>>>> 0 0 0 15093604 248428 1398448 0 0 0 6 281 248 >>>>> 0 0 100 0 >>>>> 0 0 0 15093452 248428 1398448 0 0 0 0 185 198 >>>>> 0 0 100 0 >>>>> 5 0 0 15093452 248428 1398448 0 0 0 0 265 253 >>>>> 0 0 100 0 >>>>> 0 0 0 15093300 248428 1398448 0 0 0 36 194 211 >>>>> 0 0 100 0 >>>>> 0 0 0 15093300 248428 1398448 0 0 0 0 228 242 >>>>> 0 0 100 0 >>>>> 0 0 0 15093300 248428 1398448 0 0 0 0 290 262 >>>>> 0 0 100 0 >>>>> 0 0 0 15093300 248428 1398448 0 0 0 6 187 207 >>>>> 0 0 100 0 >>>>> >>>>> >>>>> On Tue, Aug 17, 2010 at 6:54 PM, Ryan Rawson <ryano...@gmail.com> wrote: >>>>>> what does vmstat say? Run it like 'vmstat 2' for a minute or two and >>>>>> paste the results. >>>>>> >>>>>> With no cpu being consumed by java, it seems like there must be >>>>>> another hidden variable here. Some zombied process perhaps. Or some >>>>>> kind of super IO wait or something else. >>>>>> >>>>>> Since you are running on a hypervisor environment, i cant really say >>>>>> what is happening to your instance, although one would think the LA >>>>>> numbers would be unaffected by outside processes. >>>>>> >>>>>> On Tue, Aug 17, 2010 at 3:49 PM, George P. Stathis >>>>>> <gstat...@traackr.com> wrote: >>>>>>> Actually, there is nothing in %wa but a ton sitting in %id. This is >>>>>>> from the Master: >>>>>>> >>>>>>> top - 18:30:24 up 5 days, 20:10, 1 user, load average: 2.55, 1.99, >>>>>>> 1.25 >>>>>>> Tasks: 89 total, 1 running, 88 sleeping, 0 stopped, 0 zombie >>>>>>> Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, >>>>>>> 0.2%st >>>>>>> Mem: 17920228k total, 2795464k used, 15124764k free, 248428k buffers >>>>>>> Swap: 0k total, 0k used, 0k free, 1398388k cached >>>>>>> >>>>>>> I have atop installed which is reporting the hadoop/hbase java daemons >>>>>>> as the most active processes (barely taking any CPU time though and >>>>>>> most of the time in sleep mode): >>>>>>> >>>>>>> ATOP - domU-12-31-39-18-1 2010/08/17 18:31:46 10 seconds >>>>>>> elapsed >>>>>>> PRC | sys 0.01s | user 0.00s | #proc 89 | #zombie 0 | #exit >>>>>>> 0 | >>>>>>> CPU | sys 0% | user 0% | irq 0% | idle 200% | wait >>>>>>> 0% | >>>>>>> cpu | sys 0% | user 0% | irq 0% | idle 100% | cpu000 >>>>>>> w 0% | >>>>>>> CPL | avg1 2.55 | avg5 2.12 | avg15 1.35 | csw 2397 | intr >>>>>>> 2034 | >>>>>>> MEM | tot 17.1G | free 14.4G | cache 1.3G | buff 242.6M | slab >>>>>>> 193.1M | >>>>>>> SWP | tot 0.0M | free 0.0M | | vmcom 1.6G | vmlim >>>>>>> 8.5G | >>>>>>> NET | transport | tcpi 330 | tcpo 169 | udpi 566 | udpo >>>>>>> 147 | >>>>>>> NET | network | ipi 896 | ipo 316 | ipfrw 0 | deliv >>>>>>> 896 | >>>>>>> NET | eth0 ---- | pcki 777 | pcko 197 | si 248 Kbps | so >>>>>>> 70 Kbps | >>>>>>> NET | lo ---- | pcki 119 | pcko 119 | si 9 Kbps | so >>>>>>> 9 Kbps | >>>>>>> >>>>>>> PID CPU COMMAND-LINE >>>>>>> 1/1 >>>>>>> 17613 0% atop >>>>>>> 17150 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m >>>>>>> -XX:+HeapDumpOnOutOfMemor >>>>>>> 16527 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >>>>>>> -Dcom.sun.managem >>>>>>> 16839 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >>>>>>> -Dcom.sun.managem >>>>>>> 16735 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >>>>>>> -Dcom.sun.managem >>>>>>> 17083 0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m >>>>>>> -XX:+HeapDumpOnOutOfMemor >>>>>>> >>>>>>> Same with atop: >>>>>>> >>>>>>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command >>>>>>> 16527 ubuntu 20 0 2352M 98M 10336 S 0.0 0.6 0:42.05 >>>>>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >>>>>>> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote >>>>>>> -Dhadoop.log.dir=/var/log/h >>>>>>> 16735 ubuntu 20 0 2403M 81544 10236 S 0.0 0.5 0:01.56 >>>>>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server >>>>>>> -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote >>>>>>> -Dhadoop.log.dir=/var/log/h >>>>>>> 17083 ubuntu 20 0 4557M 45388 10912 S 0.0 0.3 0:00.65 >>>>>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m >>>>>>> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC >>>>>>> -XX:+CMSIncrementalMode -server -XX:+Heap >>>>>>> 1 root 20 0 23684 1880 1272 S 0.0 0.0 0:00.23 /sbin/init >>>>>>> 587 root 20 0 247M 4088 2432 S 0.0 0.0 -596523h-14:-8 >>>>>>> /usr/sbin/console-kit-daemon --no-daemon >>>>>>> 3336 root 20 0 49256 1092 540 S 0.0 0.0 0:00.36 >>>>>>> /usr/sbin/sshd >>>>>>> 16430 nobody 20 0 34408 3704 1060 S 0.0 0.0 0:00.01 gmond >>>>>>> 17150 ubuntu 20 0 2519M 112M 11312 S 0.0 0.6 -596523h-14:-8 >>>>>>> /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m >>>>>>> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC >>>>>>> -XX:+CMSIncrementalMode -server -XX >>>>>>> >>>>>>> >>>>>>> So I'm a bit perplexed. Are there any hadoop / hbase specific tricks >>>>>>> that can reveal what these processes are doing? >>>>>>> >>>>>>> -GS >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 17, 2010 at 6:14 PM, Jean-Daniel Cryans >>>>>>> <jdcry...@apache.org> wrote: >>>>>>>> >>>>>>>> It's not normal, but then again I don't have access to your machines >>>>>>>> so I can only speculate. >>>>>>>> >>>>>>>> Does "top" show you which process is in %wa? If so and it's a java >>>>>>>> process, can you figure what's going on in there? >>>>>>>> >>>>>>>> J-D >>>>>>>> >>>>>>>> On Tue, Aug 17, 2010 at 11:03 AM, George Stathis <gstat...@gmail.com> >>>>>>>> wrote: >>>>>>>> > Hello, >>>>>>>> > >>>>>>>> > We have just setup a new cluster on EC2 using Hadoop 0.20.2 and HBase >>>>>>>> > 0.20.3. Our small setup as of right now consists of one master and >>>>>>>> > four >>>>>>>> > slaves with a replication factor of 2: >>>>>>>> > >>>>>>>> > Master: xLarge instance with 2 CPUs and 17.5 GB RAM - runs 1 >>>>>>>> > namenode, 1 >>>>>>>> > secondarynamenode, 1 jobtracker, 1 hbasemaster, 1 zookeeper (uses >>>>>>>> > its' own >>>>>>>> > dedicated EMS drive) >>>>>>>> > Slaves: xLarge instance with 2 CPUs and 17.5 GB RAM each - run 1 >>>>>>>> > datanode, 1 >>>>>>>> > tasktracker, 1 regionserver >>>>>>>> > >>>>>>>> > We have also installed Ganglia to monitor the cluster stats as we >>>>>>>> > are about >>>>>>>> > to run some performance tests but, right out of the box, we are >>>>>>>> > noticing >>>>>>>> > high system loads (especially on the master node) without any >>>>>>>> > activity >>>>>>>> > happening on the clister. Of course, the CPUs are not being utilized >>>>>>>> > at all, >>>>>>>> > but Ganglia is reporting almost all nodes in the red as the 1, 5 an >>>>>>>> > 15 >>>>>>>> > minute load times are all above 100% most of the time (i.e. there >>>>>>>> > are more >>>>>>>> > than two processes at a time competing for the 2 CPUs time). >>>>>>>> > >>>>>>>> > Question1: is this normal? >>>>>>>> > Question2: does it matter since each process barely uses any of the >>>>>>>> > CPU >>>>>>>> > time? >>>>>>>> > >>>>>>>> > Thank you in advance and pardon the noob questions. >>>>>>>> > >>>>>>>> > -GS >>>>>>>> > >>>>>>> >>>>>> >>>>> >>>> >>> >> >