Your nodes are almost 50% idle... Might be something else. Sound it's not your disks nor your CPU... Maybe to many RCPs?
Have you investigate on your network side? netperf might be a good help for you. JM 2013/10/24 Harry Waye <[email protected]> > p.s. I guess this is more turning into a general hadoop issue, but I'll > keep the discussion here seeing that I have an audience, unless there are > objections. > > > On 24 October 2013 22:02, Harry Waye <[email protected]> wrote: > > > So just a short update, I'll read into it a little more tomorrow. This > is > > from three of the nodes: > > https://gist.github.com/hazzadous/1264af7c674e1b3cf867 > > > > The first is the grey guy. Just glancing at it, it looks to fluctuate > > more than the others. I guess that could suggest that there are some > > issues with reading from the disks. Interestingly, it's the only one > that > > doesn't have smartd installed, which alerts us on changes for the other > > nodes. I suspect there's probably some mileage in checking its smart > > attributes. Will do that tomorrow though. > > > > Out of curiosity, how do people normally monitor disk issues? I'm going > > to set up collectd to push various things from smartctl tomorrow, at the > > moment all we do is receive emails, which is mostly noise about problem > > sector counts increasing +1. > > > > > > On 24 October 2013 19:40, Jean-Marc Spaggiari <[email protected] > >wrote: > > > >> Can you try vmstat 2? 2 is the interval in seconds it will display the > >> disk > >> usage. On the extract here, nothing is running. only 8% is used. (1% > disk > >> IO, 6% User, 1% sys) > >> > >> Run it on 2 or 3 different nodes while you are putting the load on the > >> cluster. And take a look at the 4 last numbers and see what the value of > >> the last one? > >> > >> On the usercpu0 graph, who is the gray guy showing hight? > >> > >> JM > >> > >> 2013/10/24 Harry Waye <[email protected]> > >> > >> > Ok I'm running a load job atm, I've add some possibly incomprehensible > >> > coloured lines to the graph: http://goo.gl/cUGCGG > >> > > >> > This is actually with one fewer nodes due to decommissioning to > replace > >> a > >> > disk, hence I guess the reason for one squiggly line showing no disk > >> > activity. I've included only the cpu stats for CPU0 from each node. > >> The > >> > last graph should read "Memory Used". vmstat from one of the nodes: > >> > > >> > procs -----------memory---------- ---swap-- -----io---- -system-- > >> > ----cpu---- > >> > r b swpd free buff cache si so bi bo in cs us > sy > >> id > >> > wa > >> > 6 0 0 392448 524668 43823900 0 0 501 1044 0 0 6 > >> 1 > >> > 91 1 > >> > > >> > To me the wait doesn't seem that high. Job stats are > >> > http://goo.gl/ZYdUKp, the job setup is > >> > https://gist.github.com/hazzadous/ac57a384f2ab685f07f6 > >> > > >> > Does anything jump out at you? > >> > > >> > Cheers > >> > H > >> > > >> > > >> > On 24 October 2013 16:16, Harry Waye <[email protected]> wrote: > >> > > >> > > Hi JM > >> > > > >> > > I took a snapshot on the initial run, before the changes: > >> > > > >> > > >> > https://www.evernote.com/shard/s95/sh/b8e1516d-7c49-43f0-8b5f-d16bbdd3fe13/00d7c6cd6dd9fba92d6f00f90fb54fc1/res/4f0e20a2-1ecb-4085-8bc8-b3263c23afb5/screenshot.png > >> > > > >> > > Good timing, disks appear to be exploding (ATA errors) atm thus I'm > >> > > decommissioning and reprovisioning with new disks. I'll be > >> > reprovisioning > >> > > as without RAID (it's software RAID just to compound the issue) > >> although > >> > > not sure how I'll go about migrating all nodes. I guess I'd need to > >> put > >> > > more correctly speced nodes in the rack and decommission the > existing. > >> > > Makes diff. to > >> > > > >> > > We're using hetzner at the moment which may not have been a good > >> choice. > >> > > Has anyone had any experience with them wrt. Hadoop? They offer 7 > >> and > >> > 15 > >> > > disk options, but are low on the cpu front (quad core). Our > workload > >> > will > >> > > be I assume on the high side. There's also a 8 disk Dell PowerEdge > >> what > >> > is > >> > > a little more powerful. What hosting providers would people > >> recommended? > >> > > (And what would be the strategy for migrating?) > >> > > > >> > > Anyhow, when I have things more stable I'll have a look at checking > >> out > >> > > what's using the cpu. In the mean time, can anything be gleamed > from > >> the > >> > > above snap? > >> > > > >> > > Cheers > >> > > H > >> > > > >> > > > >> > > On 24 October 2013 15:14, Jean-Marc Spaggiari < > >> [email protected] > >> > >wrote: > >> > > > >> > >> Hi Harry, > >> > >> > >> > >> Do you have more details on the exact load? Can you run vmstats and > >> see > >> > >> what kind of load it is? Is it user? cpu? wio? > >> > >> > >> > >> I suspect your disks to be the issue. There is 2 things here. > >> > >> > >> > >> First, we don't recommend RAID for the HDFS/HBase disk. The best is > >> to > >> > >> simply mount the disks on 2 mounting points and give them to HDFS. > >> > >> Second, 2 disks per not is very low. On a dev cluster is not even > >> > >> recommended. In production, you should go with 12 or more. > >> > >> > >> > >> So with only 2 disks in RAID, I suspect your WIO to be high which > is > >> > what > >> > >> might slow your process. > >> > >> > >> > >> Can you take a look on that direction? If it's not that, we will > >> > continue > >> > >> to investigate ;) > >> > >> > >> > >> Thanks, > >> > >> > >> > >> JM > >> > >> > >> > >> > >> > >> 2013/10/23 Harry Waye <[email protected]> > >> > >> > >> > >> > I'm trying to load data into hbase using HFileOutputFormat and > >> > >> incremental > >> > >> > bulk load but am getting rather lackluster performance, 10h for > >> ~0.5TB > >> > >> > data, ~50000 blocks. This is being loaded into a table that has > 2 > >> > >> > families, 9 columns, 2500 regions and is ~10TB in size. Keys are > >> md5 > >> > >> > hashes and regions are pretty evenly spread. The majority of > time > >> > >> appears > >> > >> > to be spend in the reduce phase, with the map phase completing > very > >> > >> > quickly. The network doesn't appear to be saturated, but the > load > >> is > >> > >> > consistently at 6 which is the number or reduce tasks per node. > >> > >> > > >> > >> > 12 hosts (6 cores, 2 disk as RAID0, 1GB eth, no one else on the > >> rack). > >> > >> > > >> > >> > MR conf: 6 mappers, 6 reducers per node. > >> > >> > > >> > >> > I spoke to someone on IRC and they recommended reducing job > output > >> > >> > replication to 1, and reducing the number of mappers which I > >> reduced > >> > to > >> > >> 2. > >> > >> > Reducing replication appeared not to make any difference, > reducing > >> > >> > reducers appeared just to slow the job down. I'm going to have a > >> look > >> > >> at > >> > >> > running the benchmarks mentioned on Michael Noll's blog and see > >> what > >> > >> that > >> > >> > turns up. I guess some questions I have are: > >> > >> > > >> > >> > How does the global number/size of blocks affect perf.? (I have > a > >> lot > >> > >> of > >> > >> > 10mb files, which are the input files) > >> > >> > > >> > >> > How does the job local number/size of input blocks affect perf.? > >> > >> > > >> > >> > What is actually happening in the reduce phase that requires so > >> much > >> > >> CPU? > >> > >> > I assume the actual construction of HFiles isn't intensive. > >> > >> > > >> > >> > Ultimately, how can I improve performance? > >> > >> > Thanks > >> > >> > > >> > >> > >> > > > >> > > > >> > > > >> > > -- > >> > > Harry Waye, Co-founder/CTO > >> > > [email protected] > >> > > +44 7890 734289 > >> > > > >> > > Follow us on Twitter: @arachnys <https://twitter.com/#!/arachnys> > >> > > > >> > > --- > >> > > Arachnys Information Services Limited is a company registered in > >> England > >> > & > >> > > Wales. Company number: 7269723. Registered office: 40 Clarendon St, > >> > > Cambridge, CB1 1JX. > >> > > > >> > > >> > > >> > > >> > -- > >> > Harry Waye, Co-founder/CTO > >> > [email protected] > >> > +44 7890 734289 > >> > > >> > Follow us on Twitter: @arachnys <https://twitter.com/#!/arachnys> > >> > > >> > --- > >> > Arachnys Information Services Limited is a company registered in > >> England & > >> > Wales. Company number: 7269723. Registered office: 40 Clarendon St, > >> > Cambridge, CB1 1JX. > >> > > >> > > > > > > > > -- > > Harry Waye, Co-founder/CTO > > [email protected] > > +44 7890 734289 > > > > Follow us on Twitter: @arachnys <https://twitter.com/#!/arachnys> > > > > --- > > Arachnys Information Services Limited is a company registered in England > & > > Wales. Company number: 7269723. Registered office: 40 Clarendon St, > > Cambridge, CB1 1JX. > > > > > > -- > Harry Waye, Co-founder/CTO > [email protected] > +44 7890 734289 > > Follow us on Twitter: @arachnys <https://twitter.com/#!/arachnys> > > --- > Arachnys Information Services Limited is a company registered in England & > Wales. Company number: 7269723. Registered office: 40 Clarendon St, > Cambridge, CB1 1JX. >
