Galit, yes this does sound like this is related, and as Matt said, you can test this by setting the max split size on the CLI. I didn't personally find this to be a reliable and efficient method, so I wrote the -m parameter to my job to set it right every time. It seems that this would be useful to have as a general parameter for Mahout jobs; is there agreement on this, and if so can I get some guidance on how to contribute?
Ryan On Aug 1, 2013, at 8:00, Matt Molek <[email protected]> wrote: > One trick to getting more mappers on a job when running from the command > line is to pass a '-Dmapred.max.split.size=xxxx' argument. The xxxx is a > size in bytes. So if you have some hypothetical 10MB input set, but you > want to force ~100 mappers, use '-Dmapred.max.split.size=1000000' > > > On Wed, Jul 31, 2013 at 4:57 AM, Fuhrmann Alpert, Galit > <[email protected]>wrote: > >> >> Hi, >> >> It sounds to me like this could be related to one of the Qs I've posted >> several days ago (is it?): >> My mahout clustering processes seem to be running very slow (several good >> hours on just ~1M items), and I'm wondering if there's anything that needs >> to be changed in setting/configuration. (and how?) >> I'm running on a large cluster and could potentially use thousands >> of nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.) >> are only using max 5 mappers (I tried it on several data sets). >> I've tried to define the number of mappers by something like: >> -Dmapred.map.tasks=100 but this didn't seem to have an effect, it still >> only uses <=5 mappers. >> Is there a different way to set the number of mappers/reducers for >> a mahout process? >> Or is there another configuration issue I need to consider? >> >> I'd definitely be happy to use such a parameter, does it not exist? >> (I'm running mahout as installed on the cluster) >> >> Is there currently a workaround, besides running a mahout jar as an hadoop >> job? >> When I originally tried to run a mahout jar that uses KMeansDriver (and >> that runs great on my local machine)- it did not even initiate a job on the >> hadoop cluster. It seemed to be running parallel but in fact it was running >> only on the local node. Is this a known issue? Is there a fix for >> this? (I ended up dropping it and calling mahout step by step from command >> line, but I'd be happy to know if there a fix for this). >> >> Thanks, >> >> Galit. >> >> -----Original Message----- >> From: Ryan Josal [mailto:[email protected]] >> Sent: Monday, July 29, 2013 9:33 PM >> To: Adam Baron >> Cc: Ryan Josal; [email protected] >> Subject: Re: Run more than one mapper for TestForest? >> >> If you're running mahout from the CLI, you'll have to modify the Hadoop >> config file or your env manually for each job. This is code I put in to my >> custom job executions so I didn't have to calculate and set that up every >> time. Maybe that's your best route in that position. You could just >> provide your own mahout jar and run it as you would any other Hadoop job >> and ignore the installed Mahout. I do think this could be a useful >> parameter for a number of standard mahout jobs though; I know I would use >> it. Does anyone in the mahout community see this as a generally useful >> feature for a Mahout job? >> >> Ryan >> >> On Jul 29, 2013, at 10:25, Adam Baron <[email protected]> wrote: >> >>> Ryan, >>> >>> Thanks for the fix, the code looks reasonable to me. Which version of >> Mahout will this be in? 0.9? >>> >>> Unfortunately, I'm using a large shared Hadoop cluster which is not >> administered by my team. So I'm not in a position push the latest from >> the Mahout dev trunk into our environment; the admins will only install >> official releases. >>> >>> Regards, >>> Adam >>> >>> On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal <[email protected]> wrote: >>>> Late reply, but for what it's still worth, since I've seen a couple >> other threads here on the topic of too few mappers, I added a parameter to >> set a minimum number of mappers. Some of my mahout jobs needed more >> mappers, but were not given many because of the small input file size. >>>> >>>> addOption("minMapTasks", "m", "Minimum number of map tasks to >>>> run", String.valueOf(1)); >>>> >>>> >>>> int minMapTasks = Integer.parseInt(getOption("minMapTasks")); >>>> int mapTasksThatWouldRun = (int) >> (vectorFileSizeBytes/getSplitSize()) + 1; >>>> log.info("map tasks min: " + minMapTasks + " current: " + >> mapTasksThatWouldRun); >>>> if (minMapTasks > mapTasksThatWouldRun) { >>>> String splitSizeBytes = >> String.valueOf(vectorFileSizeBytes/minMapTasks); >>>> log.info("Forcing mapred.max.split.size to " + >> splitSizeBytes + " to ensure minimum map tasks = " + minMapTasks); >>>> hadoopConf.set("mapred.max.split.size", splitSizeBytes); >>>> } >>>> >>>> // there is actually a private method in hadoop to calculate this >>>> private long getSplitSize() { >>>> long blockSize = hadoopConf.getLong("dfs.block.size", 64 * 1024 >> * 1024); >>>> long maxSize = hadoopConf.getLong("mapred.max.split.size", >> Long.MAX_VALUE); >>>> int minSize = hadoopConf.getInt("mapred.min.split.size", 1); >>>> long splitSize = Math.max(minSize, Math.min(maxSize, >> blockSize)); >>>> log.info(String.format("min: %,d block: %,d max: %,d split: >> %,d", minSize, blockSize, maxSize, splitSize)); >>>> return splitSize; >>>> } >>>> >>>> It seems like there should be a more straightforward way to do this, >> but it works for me and I've used it on a lot of jobs to set a minimum >> number of mappers. >>>> >>>> Ryan >>>> >>>> On Jul 5, 2013, at 2:00 PM, Adam Baron wrote: >>>> >>>>> I'm attempting to run >>>>> org.apache.mahout.classifier.df.mapreduce.TestForest >>>>> on a CSV with 200,000 rows that have 500,000 features per row. >>>>> However, TestForest is running extremely slow, likely because only >>>>> 1 mapper was assigned to the job. This seems strange because the >>>>> org.apache.mahout.classifier.df.mapreduce.BuildForest step on the >>>>> same data used 1772 mappers and took about 6 minutes. (BTW: I know >>>>> I >>>>> *shouldn't* use the same data set for the training and the testing >>>>> steps; this is purely a technical experiment to see if Mahout's >>>>> Random Forest can handle the data sizes we typically deal with). >>>>> >>>>> Any idea on how to get >>>>> org.apache.mahout.classifier.df.mapreduce.TestForest >>>>> to use more mappers? Glancing at the code (and thinking about what >>>>> is happening intuitively), it should be ripe for parallelization. >>>>> >>>>> Thanks, >>>>> Adam >>
