Re: SparkDriver memory calculation mismatch

Elkhan Dadashov Sat, 12 Nov 2016 02:00:09 -0800

In my particular case (to make Spark launching asynchronous), i launch
Hadoop job, which consists of only 1 Spark job - which is launched via
SparkLauncher#startApplication().


My App --- Launches Map task() --> into Cluster
                                                                   Map Task
launches Spark job from Map YARN container --
SparkLauncher.startApplication() ---> New Child Process is spawned
(SparkSubmit is this child process)

I was not sure if in this case map task configs and Yarn configs impose any
restrictions into SparkSubmit process which is started as child process .

Because SparkLauncher#startApplication() launches SparkSubmit as new child
process of Map Yarn container.

If i understood it correctly, then driver will use default memory configs
of Spark (1g) or the value specified by the user via spark.driver.memory.

I did not use Spark since last year 1.5 version, now transitioning directly
to 2.0 version, will read about Unified Memory Manager.

Thanks, Owen.



On Sat, Nov 12, 2016 at 1:40 AM Sean Owen <so...@cloudera.com> wrote:

> Indeed, you get default values if you don't specify concrete values
> otherwise. Yes, you should see the docs for the version you're using.
>
> Note that there are different configs for the new 'unified' memory manager
> since 1.6, and so some older resources may be correctly explaining the
> older 'legacy' memory manager configs.
>
> Yes all containers would have to be smaller than YARN's max allowed size.
> The driver just consumes what the driver consumes; I don't know of any
> extra 'appmaster' component.
> What do you mean by 'launched by the map task'? jobs are launched by the
> driver only.
>
> On Sat, Nov 12, 2016 at 9:14 AM Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
>
> @Sean Owen,
>
> Thanks for your reply.
>
> I put the wrong link to the blog post. Here is the correct link
> <https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-4-memory-settings/>
> which describes Spark Memory settings on Yarn. I guess they have misused
> the terms Spark driver/BlockManager, and explained memory usage of driver
> falsely.
>
> 1) Then does that mean if nothing specified, then Spark will use defaults
> specified in Spark config site
> <http://spark.apache.org/docs/latest/running-on-yarn.html> ?
>
> 2) Let me clarify, if i understood it correctly:
>
> (due to Yarn restrictions)
>
> *Yarn-cluster mode*:
> SparkAppMaster+Driver Memory < Yarn container max size allocation
> SparkExecutor Memory < Yarn container max size allocation
>
> *Yarn-client mode* (assume Spark Job is launched from the map task):
> Driver memory is independent of any Yarn properties, only limited by
> machines memory.
> SparkAppMaster Memory < Yarn container max size allocation
> SparkExecutor Memory < Yarn container max size allocation
>
> Did i get it correctly ?
>
> 3) Any resource for Spark components memory calculations for Yarn cluster
> ? (other than this site which describes default config values
> http://spark.apache.org/docs/latest/running-on-yarn.html )
>
> Thanks.
>
> On Sat, Nov 12, 2016 at 12:24 AM, Sean Owen <so...@cloudera.com> wrote:
>
> If you're pointing at the 336MB, then it's not really related any of the
> items you cite here. This is the memory managed internally by MemoryStore.
> The blog post refers to the legacy memory manager. You can see a bit of how
> it works in the code, but this is the sum of the on-heap and off-heap
> memory it can manage. See the memory config docs, however, to understand
> what user-facing settings you can make; you don't really need to workk
> about this value.
>
> mapreduce settings are irrelevant to Spark.
> Spark doesn't pay attention to the YARN settings, but YARN does. It
> enforces them, yes. It is not exempt from YARN.
>
> 896MB is correct there. yarn-client mode does not ignore driver
> properties, no.
>
> On Sat, Nov 12, 2016 at 2:18 AM Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
>
> Hi,
>
> Spark website <http://spark.apache.org/docs/latest/running-on-yarn.html> 
> indicates
> default spark properties as like this:
> I did not override any properties in spark-defaults.conf file, but when I
> launch Spark in YarnClient mode:
>
> spark.driver.memory 1g
> spark.yarn.am.memory 512m
> spark.yarn.am.memoryOverhead : max(spark.yarn.am.memory * 0.10, 384m)
> spark.yarn.driver.memoryOverhead : max(spark.driver.memory * 0.10, 384m)
>
> I launch Spark job via SparkLauncher#startApplication() in *Yarn-client
> mode from the Map task of Hadoop job*.
>
> *My cluster settings*:
> yarn.scheduler.minimum-allocation-mb 256
> yarn.scheduler.maximum-allocation-mb 2048
> yarn.app.mapreduce.am.resource.mb 512
> mapreduce.map.memory.mb 640
> mapreduce.map.java.opts -Xmx400m
> yarn.app.mapreduce.am.command-opts -Xmx448m
>
> *Logs of Spark job*:
>
> INFO Client: Verifying our application has not requested more than the
> maximum memory capability of the cluster (2048 MB per container)
> INFO Client: Will allocate *AM container*, with 896 MB memory including
> 384 MB overhead
>
> INFO MemoryStore: MemoryStore started with capacity 366.3 MB
>
> ./application_1478727394310_0005/container_1478727394310_0005_01_000002/stderr:INFO:
> 16/11/09 14:18:42 INFO BlockManagerMasterEndpoint: Registering block
> manager <machine-ip>:57246 with *366.3* MB RAM, BlockManagerId(driver,
> <machine-ip>, 57246)
>
> *Questions*:
> 1) How is driver memory calculated ?
>
> How did Spark decide for 366 MB for driver based on properties described
> above ?
>
> I thought the memory allocation is based on this formula (
> https://www.altiscale.com/blog/spark-on-hadoop/ ):
>
> "Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction ,where
> memoryFraction=0.6, and safetyFraction=0.9. This is 1024MB x 0.6 x 0.9 =
> 552.96MB. However, 552.96MB is a little larger than the value as shown in
> the log. This is because of the runtime overhead imposed by Scala, which is
> usually around 3-7%, more or less. If you do the calculation using 982MB x
> 0.6 x 0.9, (982MB being approximately 4% of 1024) then you will derive the
> number 530.28MB, which is what is indicated in the log file after rounding
> up to 530.30MB."
>
> 2) If Spark job is launched from the Map task via
> SparkLauncher#startApplication() will driver memory respect
> (mapreduce.map.memory.mb and mapreduce.map.java.opts) OR
> (yarn.scheduler.maximum-allocation-mb) when launching Spark Job as child
> process ?
>
> The confusion is, as SparkSubmit is a new JVM process - because it is
> launched as child process of the map task, and it does not depend on Yarn
> configs. But not obeying any limits (if this is the case), will make things
> tricky on NodeManager reporting back memory usage.
>
> 3) Is this correct formula for calculating AM memory ?
>
> For AM it matches to this formula calculation (
> https://www.altiscale.com/blog/spark-on-hadoop/ ):how much memory to
> allocate to the AM: amMemory + amMemoryOverhead
> amMemoryOverhead is set to 384MB via spark.yarn.driver.memoryOverhead.
> args.amMemory is fixed at 512MB by Spark when it’s running in yarn-client
> mode. Adding 384MB of overhead to 512MB provides the 896MB figure requested
> by Spark.
>
> 4) For Spark Yarn-client mode, are all spark.driver properties ignored,
> and only spark.yarn.am properties used ?
>
> Thanks.
>
>
>
>
>
>
>
> --
>
> Best regards,
> Elkhan Dadashov
>
>

Re: SparkDriver memory calculation mismatch

Reply via email to