Split size can be set through mapred.min.split.size, e.g. to set split size to 1M: -Dmapred.min.split.size=1048576
--Konstantin On Wed, Sep 7, 2011 at 1:39 AM, Sean Owen <[email protected]> wrote: > I see. On EMR, I think the setting you need > is mapred.tasktracker.map.tasks.minimum. At least that's what I see digging > through my old EMR code. > > Dhruv, yes a lot of these settings are just suggestions to the framework. I > am not entirely clear on the heuristics used, but I do know that Jake is > right, that it's driven primarily off the input size, and how much input it > thinks should go with a worker. You can override these things, but do > beware, you're probably incurring more overhead than is sensible. It might > still make sense if you're running on a dedicated cluster where those > resources are otherwise completely idle, but, not in general a good idea in > a shared cluster. > > Chris are you sure one mapper was running in your last example? I don't see > an indication of that from the log output one way or the other. > > I don't know LDA well. It sounds like you are saying that LDA mappers take a > long time on a little input, which would suggest that's a bottleneck. I > don't know one way or the other there... but if that's true you are right > that we can bake in settings to force an unusually small input split size. > > And Jake's last point echoes Sebastian and Ted's: on EMR, fewer big machines > are better. One of their biggest instances is probably more economical than > 20 small ones. And, as a bonus, all of the data and processing will stay on > one machine. (Of course, the master is still a separate instance. I use 1 > small machine for the master, and make it a reserved instance, not a spot > instance, so it's really unlikely to die.) Of course you're vulnerable to > that one machine dying, but, for all practical purposes it's going to be a > win for you. > > Definitely use the spot instance market! Ted's right that pricing is crazy > good. > > On Tue, Sep 6, 2011 at 11:57 PM, Chris Lu <[email protected]> wrote: > >> Thanks. Very helpful to me! >> >> I tried to change the setting of "mapred.map.tasks". However, the number >> map task is still just one on one of the 20 machines. >> >> ./elastic-mapreduce --create --alive \ >> --num-instances 20 --name "LDA" \ >> --bootstrap-action s3://elasticmapreduce/**bootstrap-actions/configure-* >> *hadoop \ >> --bootstrap-name "Configuring number of map tasks per job" \ >> --args "-m,mapred.map.tasks=40" >> >> Anyone knows how to configure the number of mappers? >> Again, the input size is only 46M. >> >> Chris >> >> >> On 09/06/2011 12:09 PM, Ted Dunning wrote: >> >>> Well, I think that using small instances is a disaster in general. The >>> performance that you get from them can vary easily by an order of >>> magnitude. >>> My own preference for real work is either m2xl or cc14xl. The latter >>> machines give you nearly bare metal performance and no noisy neighbors. >>> The >>> m2xl is typically very much underpriced on the spot market. >>> >>> Sean is right about your job being misconfigured. The Hadoop overhead is >>> considerable and you have only given it two threads to overcome that >>> overhead. >>> >>> On Tue, Sep 6, 2011 at 6:12 PM, Sean Owen<[email protected]> wrote: >>> >>> That's your biggest issue, certainly. Only 2 mappers are running, even >>>> though you have 20 machines available. Hadoop determines the number of >>>> mappers based on input size, and your input isn't so big that it thinks >>>> you >>>> need 20 workers. It's launching 33 reducers, so your cluster is put to >>>> use >>>> there. But it's no wonder you're not seeing anything like 20x speedup in >>>> the >>>> mapper. >>>> >>>> You can of course force it to use more mappers, and that's probably a >>>> good >>>> idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more >>>> overhead >>>> of spinning up mappers to process less data, and Hadoop's guess indicates >>>> that it thinks it's not efficient to use 20 workers. If you know that >>>> those >>>> other 18 are otherwise idle, my guess is you'd benefit from just making >>>> it >>>> use 20. >>>> >>>> If this were a general large cluster where many people are taking >>>> advantage >>>> of the workers, then I'd trust Hadoop's guesses until you are sure you >>>> want >>>> to do otherwise. >>>> >>>> On Tue, Sep 6, 2011 at 7:02 PM, Chris Lu<[email protected]> wrote: >>>> >>>> Thanks for all the suggestions! >>>>> >>>>> All the inputs are the same. It takes 85 hours for 4 iterations on 20 >>>>> Amazon small machines. On my local single node, it got to iteration 19 >>>>> >>>> for >>>> >>>>> also 85 hours. >>>>> >>>>> Here is a section of the Amazon log output. >>>>> It covers the start of iteration 1, and between iteration 4 and >>>>> iteration >>>>> 5. >>>>> >>>>> The number of map tasks is set to 2. Should it be larger or related to >>>>> number of CPU cores? >>>>> >>>>> >>>>> >> > -- ksh:
