Re: hadoop/yarn and task parallelization on non-hdfs filesystems

Calvin Mon, 18 Aug 2014 15:48:12 -0700

Oops, one of the settings should read
"yarn.nodemanager.vmem-check-enabled". The blog post has a typo and a
comment pointed that out as well.


Thanks,
Calvin

On Mon, Aug 18, 2014 at 4:45 PM, Calvin <[email protected]> wrote:
> OK, I figured out exactly what was happening.
>
> I had set the configuration value "yarn.nodemanager.vmem-pmem-ratio"
> to 10. Since there is no swap space available for use, every task
> which is requesting 2 GB of memory is also requesting an additional 20
> GB of memory. This 20 GB isn't represented in the "Memory Used" column
> on the YARN applications status page and thus it seemed like I was
> underutilizing the YARN cluster (when in actuality I had allocated all
> the memory available).
>
> (The cluster "underutilization" occurs regardless of using HDFS or
> LocalFileSystem; I must have made this configuration change after
> testing HDFS and before testing the local filesystem.)
>
> The solution is to set  "yarn.nodemanager.vmem-pmem-ratio" to 1 (since
> I have no swap) *and* "yarn.nodemanager.vmem-check.enabled" to false.
>
> Part of the reason why I had set such a high setting was due to
> containers being killed because of virtual memory usage. The Cloudera
> folks have a good blog post [1] on this topic (see #6) and I wish I
> had read that sooner.
>
> With the above configuration values, I can now utilize the cluster at 100%.
>
> Thanks for everyone's input!
>
> Calvin
>
> [1] 
> http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
>
> On Fri, Aug 15, 2014 at 2:11 PM, java8964 <[email protected]> wrote:
>> Interesting to know that.
>>
>> I also want to know what underline logic holding the force to only generate
>> 25-35 parallelized containers, instead of up to 1300.
>>
>> Another suggestion I can give is following:
>>
>> 1) In your driver, generate a text file, including all your 1300 bz2 file
>> names with absolute path.
>> 2) In your MR job, use the NLineInputFormat, with default setting, each line
>> content will trigger one mapper task.
>> 3) In your mapper, key/value pair will be offset byte loc/line content, just
>> start to process the file, as it should be available from the mount path in
>> the local data nodes.
>> 4) I assume that you are using Yarn. In this case, at least 1300 container
>> requests will be issued to the cluster. You generate 1300 parallelized
>> request, now it is up to the cluster to decide how many containers can be
>> parallel run.
>>
>> Yong
>>
>>> Date: Fri, 15 Aug 2014 12:30:09 -0600
>>
>>> Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems
>>> From: [email protected]
>>> To: [email protected]
>>
>>>
>>> Thanks for the responses!
>>>
>>> To clarify, I'm not using any special FileSystem implementation. An
>>> example input parameter to a MapReduce job would be something like
>>> "-input file:///scratch/data". Thus I think (any clarification would
>>> be helpful) Hadoop is then utilizing LocalFileSystem
>>> (org.apache.hadoop.fs.LocalFileSystem).
>>>
>>> The input data is large enough and splittable (1300 .bz2 files, 274MB
>>> each, 350GB total). Thus even if it the input data weren't splittable,
>>> Hadoop should be able to parallelize up to 1300 map tasks if capacity
>>> is available; in my case, I find that the Hadoop cluster is not fully
>>> utilized (i.e., ~25-35 containers running when it can scale up to ~80
>>> containers) when not using HDFS, while achieving maximum use when
>>> using HDFS.
>>>
>>> I'm wondering if Hadoop is "holding back" or throttling the I/O if
>>> LocalFileSystem is being used, and what changes I can make to have the
>>> Hadoop tasks scale.
>>>
>>> In the meantime, I'll take a look at the API calls that Harsh mentioned.
>>>
>>>
>>> On Fri, Aug 15, 2014 at 10:15 AM, Harsh J <[email protected]> wrote:
>>> > The split configurations in FIF mentioned earlier would work for local
>>> > files
>>> > as well. They aren't deemed unsplitable, just considered as one single
>>> > block.
>>> >
>>> > If the FS in use has its advantages it's better to implement a proper
>>> > interface to it making use of them, than to rely on the LFS by mounting
>>> > it.
>>> > This is what we do with HDFS.
>>> >
>>> > On Aug 15, 2014 8:52 PM, "java8964" <[email protected]> wrote:
>>> >>
>>> >> I believe that Calvin mentioned before that this parallel file system
>>> >> mounted into local file system.
>>> >>
>>> >> In this case, will Hadoop just use java.io.File as local File system to
>>> >> treat them as local file and not split the file?
>>> >>
>>> >> Just want to know the logic in hadoop handling the local file.
>>> >>
>>> >> One suggestion I can think is to split the files manually outside of
>>> >> hadoop. For example, generate lots of small files as 128M or 256M size.
>>> >>
>>> >> In this case, each mapper will process one small file, so you can get
>>> >> good
>>> >> utilization of your cluster, assume you have a lot of small files.
>>> >>
>>> >> Yong
>>> >>
>>> >> > From: [email protected]
>>> >> > Date: Fri, 15 Aug 2014 16:45:02 +0530
>>> >> > Subject: Re: hadoop/yarn and task parallelization on non-hdfs
>>> >> > filesystems
>>> >> > To: [email protected]
>>> >> >
>>> >> > Does your non-HDFS filesystem implement a getBlockLocations API, that
>>> >> > MR relies on to know how to split files?
>>> >> >
>>> >> > The API is at
>>> >> >
>>> >> > http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,
>>> >> > long, long), and MR calls it at
>>> >> >
>>> >> >
>>> >> > https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L392
>>> >> >
>>> >> > If not, perhaps you can enforce a manual chunking by asking MR to use
>>> >> > custom min/max split sizes values via config properties:
>>> >> >
>>> >> >
>>> >> > https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L66
>>> >> >
>>> >> > On Fri, Aug 15, 2014 at 10:16 AM, Calvin <[email protected]> wrote:
>>> >> > > I've looked a bit into this problem some more, and from what
>>> >> > > another
>>> >> > > person has written, HDFS is tuned to scale appropriately [1] given
>>> >> > > the
>>> >> > > number of input splits, etc.
>>> >> > >
>>> >> > > In the case of utilizing the local filesystem (which is really a
>>> >> > > network share on a parallel filesystem), the settings might be set
>>> >> > > conservatively in order not to thrash the local disks or present a
>>> >> > > bottleneck in processing.
>>> >> > >
>>> >> > > Since this isn't a big concern, I'd rather tune the settings to
>>> >> > > efficiently utilize the local filesystem.
>>> >> > >
>>> >> > > Are there any pointers to where in the source code I could look in
>>> >> > > order to tweak such parameters?
>>> >> > >
>>> >> > > Thanks,
>>> >> > > Calvin
>>> >> > >
>>> >> > > [1]
>>> >> > >
>>> >> > > https://stackoverflow.com/questions/25269964/hadoop-yarn-and-task-parallelization-on-non-hdfs-filesystems
>>> >> > >
>>> >> > > On Tue, Aug 12, 2014 at 12:29 PM, Calvin <[email protected]>
>>> >> > > wrote:
>>> >> > >> Hi all,
>>> >> > >>
>>> >> > >> I've instantiated a Hadoop 2.4.1 cluster and I've found that
>>> >> > >> running
>>> >> > >> MapReduce applications will parallelize differently depending on
>>> >> > >> what
>>> >> > >> kind of filesystem the input data is on.
>>> >> > >>
>>> >> > >> Using HDFS, a MapReduce job will spawn enough containers to
>>> >> > >> maximize
>>> >> > >> use of all available memory. For example, a 3-node cluster with
>>> >> > >> 172GB
>>> >> > >> of memory with each map task allocating 2GB, about 86 application
>>> >> > >> containers will be created.
>>> >> > >>
>>> >> > >> On a filesystem that isn't HDFS (like NFS or in my use case, a
>>> >> > >> parallel filesystem), a MapReduce job will only allocate a subset
>>> >> > >> of
>>> >> > >> available tasks (e.g., with the same 3-node cluster, about 25-40
>>> >> > >> containers are created). Since I'm using a parallel filesystem,
>>> >> > >> I'm
>>> >> > >> not as concerned with the bottlenecks one would find if one were
>>> >> > >> to
>>> >> > >> use NFS.
>>> >> > >>
>>> >> > >> Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml)
>>> >> > >> configuration that will allow me to effectively maximize resource
>>> >> > >> utilization?
>>> >> > >>
>>> >> > >> Thanks,
>>> >> > >> Calvin
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Harsh J

Re: hadoop/yarn and task parallelization on non-hdfs filesystems

Reply via email to