Re: LDA/CVB Performance

Alan Gardner Thu, 13 Jun 2013 13:32:56 -0700

mapred.tasktracker.map.tasks.maximum is loaded at tasktracker load time, I
don't think you can configure it per job. More granular resource control is
a job for Mesos or YARN, MR doesn't support this sort of thing.


I think for our deployment we'll carve out a chunk of the cluster, sized
and configured exclusively to do ML 24/7. If we don't need that much
capacity, it might be better to spin up an Elastic Map Reduce cluster for a
few hours every day.


On Thu, Jun 13, 2013 at 4:20 PM, Sebastian Schelter <[email protected]> wrote:

> On 13.06.2013 22:12, Dmitriy Lyubimov wrote:
> > On Thu, Jun 13, 2013 at 1:00 PM, Sebastian Schelter <[email protected]>
> wrote:
> >
> >> This table is readonly, right? We could try to apply the trick from our
> >> ALS code: Instead of running one mapper per core (and thus having one
> >> copy of the table per core), run a multithreaded mapper and share the
> >> table between its threads. Works very well for ALS.
> >
> >
> > Just out of my ignorance, how will you tell MR that your mapper is using
> > more than 1  core and that it doesn't have to run more than 1 mapepr of
> > that time per box?
>
> You need to use a MultithreadedMapper for which you can set the size of
> the thread pool via MultithreadedMapper.setNumberOfThreads(...)
>
> You can configure the maximum number of mappers to run per task tracker
> with -Dmapred.tasktracker.map.tasks.maximum=x
>
> >
> >
> >> We can also cache
> >> the table in a static variable and make Hadoop reuse JVMs, which
> >> increases performance if the number of blocks to process is larger than
> >> the number of map slots.
> >>
> >
> > This usually (or might be) something the admin doesn't let us override.
> > Also i am not sure if jvm reuse in hadoop is isolated between different
> > jobs (so other tasks may inherit that stuff they probably don't want)
>
> The jvm is only reused during a single job.
>
> >>
> >> -sebastian
> >>
> >> On 13.06.2013 21:56, Ted Dunning wrote:
> >>> On Thu, Jun 13, 2013 at 6:50 PM, Jake Mannix <[email protected]>
> >> wrote:
> >>>
> >>>> Andy, note that he said he's running with a 1.6M-term dictionary.
> >>  That's
> >>>> going
> >>>> to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices.
> >> Still
> >>>> not hitting
> >>>> 8GB, but getting closer.
> >>>>
> >>>
> >>> It will likely be even worse unless this table is shared between
> mappers.
> >>>  With 8 mappers per node, this goes to 41GB.  The OP didn't mention
> >> machine
> >>> configuration, but this could easily cause swapping.
> >>>
> >>
> >>
> >
>
>


-- 
Alan Gardner
Solutions Architect - CTO Office

[email protected] | LinkedIn:
http://www.linkedin.com/profile/view?id=65508699 |
@alanctgardner<https://twitter.com/alanctgardner>
Tel: +1 613 565 8696 x1218
Mobile: +1 613 897 5655

-- 


--

Re: LDA/CVB Performance

Reply via email to