Re: Optimizing bulk load performance

Harry Waye Thu, 24 Oct 2013 14:08:34 -0700

p.s. I guess this is more turning into a general hadoop issue, but I'll
keep the discussion here seeing that I have an audience, unless there are
objections.



On 24 October 2013 22:02, Harry Waye <[email protected]> wrote:

> So just a short update, I'll read into it a little more tomorrow.  This is
> from three of the nodes:
> https://gist.github.com/hazzadous/1264af7c674e1b3cf867
>
> The first is the grey guy.  Just glancing at it, it looks to fluctuate
> more than the others.  I guess that could suggest that there are some
> issues with reading from the disks.  Interestingly, it's the only one that
> doesn't have smartd installed, which alerts us on changes for the other
> nodes.  I suspect there's probably some mileage in checking its smart
> attributes.  Will do that tomorrow though.
>
> Out of curiosity, how do people normally monitor disk issues?  I'm going
> to set up collectd to push various things from smartctl tomorrow, at the
> moment all we do is receive emails, which is mostly noise about problem
> sector counts increasing +1.
>
>
> On 24 October 2013 19:40, Jean-Marc Spaggiari <[email protected]>wrote:
>
>> Can you try vmstat 2? 2 is the interval in seconds it will display the
>> disk
>> usage. On the extract here, nothing is running. only 8% is used. (1% disk
>> IO, 6% User, 1% sys)
>>
>> Run it on 2 or 3 different nodes while you are putting the load on the
>> cluster. And take a look at the 4 last numbers and see what the value of
>> the last one?
>>
>> On the usercpu0 graph, who is the gray guy showing hight?
>>
>> JM
>>
>> 2013/10/24 Harry Waye <[email protected]>
>>
>> > Ok I'm running a load job atm, I've add some possibly incomprehensible
>> > coloured lines to the graph: http://goo.gl/cUGCGG
>> >
>> > This is actually with one fewer nodes due to decommissioning to replace
>> a
>> > disk, hence I guess the reason for one squiggly line showing no disk
>> > activity.  I've included only the cpu stats for CPU0 from each node.
>>  The
>> > last graph should read "Memory Used".  vmstat from one of the nodes:
>> >
>> > procs -----------memory---------- ---swap-- -----io---- -system--
>> > ----cpu----
>> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
>> id
>> > wa
>> >  6  0      0 392448 524668 43823900    0    0   501  1044    0    0  6
>>  1
>> > 91  1
>> >
>> > To me the wait doesn't seem that high.  Job stats are
>> > http://goo.gl/ZYdUKp,  the job setup is
>> > https://gist.github.com/hazzadous/ac57a384f2ab685f07f6
>> >
>> > Does anything jump out at you?
>> >
>> > Cheers
>> > H
>> >
>> >
>> > On 24 October 2013 16:16, Harry Waye <[email protected]> wrote:
>> >
>> > > Hi JM
>> > >
>> > > I took a snapshot on the initial run, before the changes:
>> > >
>> >
>> https://www.evernote.com/shard/s95/sh/b8e1516d-7c49-43f0-8b5f-d16bbdd3fe13/00d7c6cd6dd9fba92d6f00f90fb54fc1/res/4f0e20a2-1ecb-4085-8bc8-b3263c23afb5/screenshot.png
>> > >
>> > > Good timing, disks appear to be exploding (ATA errors) atm thus I'm
>> > > decommissioning and reprovisioning with new disks.  I'll be
>> > reprovisioning
>> > > as without RAID (it's software RAID just to compound the issue)
>> although
>> > > not sure how I'll go about migrating all nodes.  I guess I'd need to
>> put
>> > > more correctly speced nodes in the rack and decommission the existing.
>> > >  Makes diff. to
>> > >
>> > > We're using hetzner at the moment which may not have been a good
>> choice.
>> > >  Has anyone had any experience with them wrt. Hadoop?  They offer 7
>> and
>> > 15
>> > > disk options, but are low on the cpu front (quad core).  Our workload
>> > will
>> > > be I assume on the high side.  There's also a 8 disk Dell PowerEdge
>> what
>> > is
>> > > a little more powerful.  What hosting providers would people
>> recommended?
>> > >  (And what would be the strategy for migrating?)
>> > >
>> > > Anyhow, when I have things more stable I'll have a look at checking
>> out
>> > > what's using the cpu.  In the mean time, can anything be gleamed from
>> the
>> > > above snap?
>> > >
>> > > Cheers
>> > > H
>> > >
>> > >
>> > > On 24 October 2013 15:14, Jean-Marc Spaggiari <
>> [email protected]
>> > >wrote:
>> > >
>> > >> Hi Harry,
>> > >>
>> > >> Do you have more details on the exact load? Can you run vmstats and
>> see
>> > >> what kind of load it is? Is it user? cpu? wio?
>> > >>
>> > >> I suspect your disks to be the issue. There is 2 things here.
>> > >>
>> > >> First, we don't recommend RAID for the HDFS/HBase disk. The best is
>> to
>> > >> simply mount the disks on 2 mounting points and give them to HDFS.
>> > >> Second, 2 disks per not is very low. On a dev cluster is not even
>> > >> recommended. In production, you should go with 12 or more.
>> > >>
>> > >> So with only 2 disks in RAID, I suspect your WIO to be high which is
>> > what
>> > >> might slow your process.
>> > >>
>> > >> Can you take a look on that direction? If it's not that, we will
>> > continue
>> > >> to investigate ;)
>> > >>
>> > >> Thanks,
>> > >>
>> > >> JM
>> > >>
>> > >>
>> > >> 2013/10/23 Harry Waye <[email protected]>
>> > >>
>> > >> > I'm trying to load data into hbase using HFileOutputFormat and
>> > >> incremental
>> > >> > bulk load but am getting rather lackluster performance, 10h for
>> ~0.5TB
>> > >> > data, ~50000 blocks.  This is being loaded into a table that has 2
>> > >> > families, 9 columns, 2500 regions and is ~10TB in size.  Keys are
>> md5
>> > >> > hashes and regions are pretty evenly spread.  The majority of time
>> > >> appears
>> > >> > to be spend in the reduce phase, with the map phase completing very
>> > >> > quickly.  The network doesn't appear to be saturated, but the load
>> is
>> > >> > consistently at 6 which is the number or reduce tasks per node.
>> > >> >
>> > >> > 12 hosts (6 cores, 2 disk as RAID0, 1GB eth, no one else on the
>> rack).
>> > >> >
>> > >> > MR conf: 6 mappers, 6 reducers per node.
>> > >> >
>> > >> > I spoke to someone on IRC and they recommended reducing job output
>> > >> > replication to 1, and reducing the number of mappers which I
>> reduced
>> > to
>> > >> 2.
>> > >> >  Reducing replication appeared not to make any difference, reducing
>> > >> > reducers appeared just to slow the job down.  I'm going to have a
>> look
>> > >> at
>> > >> > running the benchmarks mentioned on Michael Noll's blog and see
>> what
>> > >> that
>> > >> > turns up.  I guess some questions I have are:
>> > >> >
>> > >> > How does the global number/size of blocks affect perf.?  (I have a
>> lot
>> > >> of
>> > >> > 10mb files, which are the input files)
>> > >> >
>> > >> > How does the job local number/size of input blocks affect perf.?
>> > >> >
>> > >> > What is actually happening in the reduce phase that requires so
>> much
>> > >> CPU?
>> > >> >  I assume the actual construction of HFiles isn't intensive.
>> > >> >
>> > >> > Ultimately, how can I improve performance?
>> > >> > Thanks
>> > >> >
>> > >>
>> > >
>> > >
>> > >
>> > > --
>> > > Harry Waye, Co-founder/CTO
>> > > [email protected]
>> > > +44 7890 734289
>> > >
>> > > Follow us on Twitter: @arachnys <https://twitter.com/#!/arachnys>
>> > >
>> > > ---
>> > > Arachnys Information Services Limited is a company registered in
>> England
>> > &
>> > > Wales. Company number: 7269723. Registered office: 40 Clarendon St,
>> > > Cambridge, CB1 1JX.
>> > >
>> >
>> >
>> >
>> > --
>> > Harry Waye, Co-founder/CTO
>> > [email protected]
>> > +44 7890 734289
>> >
>> > Follow us on Twitter: @arachnys <https://twitter.com/#!/arachnys>
>> >
>> > ---
>> > Arachnys Information Services Limited is a company registered in
>> England &
>> > Wales. Company number: 7269723. Registered office: 40 Clarendon St,
>> > Cambridge, CB1 1JX.
>> >
>>
>
>
>
> --
> Harry Waye, Co-founder/CTO
> [email protected]
> +44 7890 734289
>
> Follow us on Twitter: @arachnys <https://twitter.com/#!/arachnys>
>
> ---
> Arachnys Information Services Limited is a company registered in England &
> Wales. Company number: 7269723. Registered office: 40 Clarendon St,
> Cambridge, CB1 1JX.
>



-- 
Harry Waye, Co-founder/CTO
[email protected]
+44 7890 734289

Follow us on Twitter: @arachnys <https://twitter.com/#!/arachnys>

---
Arachnys Information Services Limited is a company registered in England &
Wales. Company number: 7269723. Registered office: 40 Clarendon St,
Cambridge, CB1 1JX.

Re: Optimizing bulk load performance

Reply via email to