Oops, one of the settings should read "yarn.nodemanager.vmem-check-enabled". The blog post has a typo and a comment pointed that out as well.
Thanks, Calvin On Mon, Aug 18, 2014 at 4:45 PM, Calvin <[email protected]> wrote: > OK, I figured out exactly what was happening. > > I had set the configuration value "yarn.nodemanager.vmem-pmem-ratio" > to 10. Since there is no swap space available for use, every task > which is requesting 2 GB of memory is also requesting an additional 20 > GB of memory. This 20 GB isn't represented in the "Memory Used" column > on the YARN applications status page and thus it seemed like I was > underutilizing the YARN cluster (when in actuality I had allocated all > the memory available). > > (The cluster "underutilization" occurs regardless of using HDFS or > LocalFileSystem; I must have made this configuration change after > testing HDFS and before testing the local filesystem.) > > The solution is to set "yarn.nodemanager.vmem-pmem-ratio" to 1 (since > I have no swap) *and* "yarn.nodemanager.vmem-check.enabled" to false. > > Part of the reason why I had set such a high setting was due to > containers being killed because of virtual memory usage. The Cloudera > folks have a good blog post [1] on this topic (see #6) and I wish I > had read that sooner. > > With the above configuration values, I can now utilize the cluster at 100%. > > Thanks for everyone's input! > > Calvin > > [1] > http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/ > > On Fri, Aug 15, 2014 at 2:11 PM, java8964 <[email protected]> wrote: >> Interesting to know that. >> >> I also want to know what underline logic holding the force to only generate >> 25-35 parallelized containers, instead of up to 1300. >> >> Another suggestion I can give is following: >> >> 1) In your driver, generate a text file, including all your 1300 bz2 file >> names with absolute path. >> 2) In your MR job, use the NLineInputFormat, with default setting, each line >> content will trigger one mapper task. >> 3) In your mapper, key/value pair will be offset byte loc/line content, just >> start to process the file, as it should be available from the mount path in >> the local data nodes. >> 4) I assume that you are using Yarn. In this case, at least 1300 container >> requests will be issued to the cluster. You generate 1300 parallelized >> request, now it is up to the cluster to decide how many containers can be >> parallel run. >> >> Yong >> >>> Date: Fri, 15 Aug 2014 12:30:09 -0600 >> >>> Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems >>> From: [email protected] >>> To: [email protected] >> >>> >>> Thanks for the responses! >>> >>> To clarify, I'm not using any special FileSystem implementation. An >>> example input parameter to a MapReduce job would be something like >>> "-input file:///scratch/data". Thus I think (any clarification would >>> be helpful) Hadoop is then utilizing LocalFileSystem >>> (org.apache.hadoop.fs.LocalFileSystem). >>> >>> The input data is large enough and splittable (1300 .bz2 files, 274MB >>> each, 350GB total). Thus even if it the input data weren't splittable, >>> Hadoop should be able to parallelize up to 1300 map tasks if capacity >>> is available; in my case, I find that the Hadoop cluster is not fully >>> utilized (i.e., ~25-35 containers running when it can scale up to ~80 >>> containers) when not using HDFS, while achieving maximum use when >>> using HDFS. >>> >>> I'm wondering if Hadoop is "holding back" or throttling the I/O if >>> LocalFileSystem is being used, and what changes I can make to have the >>> Hadoop tasks scale. >>> >>> In the meantime, I'll take a look at the API calls that Harsh mentioned. >>> >>> >>> On Fri, Aug 15, 2014 at 10:15 AM, Harsh J <[email protected]> wrote: >>> > The split configurations in FIF mentioned earlier would work for local >>> > files >>> > as well. They aren't deemed unsplitable, just considered as one single >>> > block. >>> > >>> > If the FS in use has its advantages it's better to implement a proper >>> > interface to it making use of them, than to rely on the LFS by mounting >>> > it. >>> > This is what we do with HDFS. >>> > >>> > On Aug 15, 2014 8:52 PM, "java8964" <[email protected]> wrote: >>> >> >>> >> I believe that Calvin mentioned before that this parallel file system >>> >> mounted into local file system. >>> >> >>> >> In this case, will Hadoop just use java.io.File as local File system to >>> >> treat them as local file and not split the file? >>> >> >>> >> Just want to know the logic in hadoop handling the local file. >>> >> >>> >> One suggestion I can think is to split the files manually outside of >>> >> hadoop. For example, generate lots of small files as 128M or 256M size. >>> >> >>> >> In this case, each mapper will process one small file, so you can get >>> >> good >>> >> utilization of your cluster, assume you have a lot of small files. >>> >> >>> >> Yong >>> >> >>> >> > From: [email protected] >>> >> > Date: Fri, 15 Aug 2014 16:45:02 +0530 >>> >> > Subject: Re: hadoop/yarn and task parallelization on non-hdfs >>> >> > filesystems >>> >> > To: [email protected] >>> >> > >>> >> > Does your non-HDFS filesystem implement a getBlockLocations API, that >>> >> > MR relies on to know how to split files? >>> >> > >>> >> > The API is at >>> >> > >>> >> > http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus, >>> >> > long, long), and MR calls it at >>> >> > >>> >> > >>> >> > https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L392 >>> >> > >>> >> > If not, perhaps you can enforce a manual chunking by asking MR to use >>> >> > custom min/max split sizes values via config properties: >>> >> > >>> >> > >>> >> > https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L66 >>> >> > >>> >> > On Fri, Aug 15, 2014 at 10:16 AM, Calvin <[email protected]> wrote: >>> >> > > I've looked a bit into this problem some more, and from what >>> >> > > another >>> >> > > person has written, HDFS is tuned to scale appropriately [1] given >>> >> > > the >>> >> > > number of input splits, etc. >>> >> > > >>> >> > > In the case of utilizing the local filesystem (which is really a >>> >> > > network share on a parallel filesystem), the settings might be set >>> >> > > conservatively in order not to thrash the local disks or present a >>> >> > > bottleneck in processing. >>> >> > > >>> >> > > Since this isn't a big concern, I'd rather tune the settings to >>> >> > > efficiently utilize the local filesystem. >>> >> > > >>> >> > > Are there any pointers to where in the source code I could look in >>> >> > > order to tweak such parameters? >>> >> > > >>> >> > > Thanks, >>> >> > > Calvin >>> >> > > >>> >> > > [1] >>> >> > > >>> >> > > https://stackoverflow.com/questions/25269964/hadoop-yarn-and-task-parallelization-on-non-hdfs-filesystems >>> >> > > >>> >> > > On Tue, Aug 12, 2014 at 12:29 PM, Calvin <[email protected]> >>> >> > > wrote: >>> >> > >> Hi all, >>> >> > >> >>> >> > >> I've instantiated a Hadoop 2.4.1 cluster and I've found that >>> >> > >> running >>> >> > >> MapReduce applications will parallelize differently depending on >>> >> > >> what >>> >> > >> kind of filesystem the input data is on. >>> >> > >> >>> >> > >> Using HDFS, a MapReduce job will spawn enough containers to >>> >> > >> maximize >>> >> > >> use of all available memory. For example, a 3-node cluster with >>> >> > >> 172GB >>> >> > >> of memory with each map task allocating 2GB, about 86 application >>> >> > >> containers will be created. >>> >> > >> >>> >> > >> On a filesystem that isn't HDFS (like NFS or in my use case, a >>> >> > >> parallel filesystem), a MapReduce job will only allocate a subset >>> >> > >> of >>> >> > >> available tasks (e.g., with the same 3-node cluster, about 25-40 >>> >> > >> containers are created). Since I'm using a parallel filesystem, >>> >> > >> I'm >>> >> > >> not as concerned with the bottlenecks one would find if one were >>> >> > >> to >>> >> > >> use NFS. >>> >> > >> >>> >> > >> Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml) >>> >> > >> configuration that will allow me to effectively maximize resource >>> >> > >> utilization? >>> >> > >> >>> >> > >> Thanks, >>> >> > >> Calvin >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > Harsh J
