In my particular case (to make Spark launching asynchronous), i launch Hadoop job, which consists of only 1 Spark job - which is launched via SparkLauncher#startApplication().
My App --- Launches Map task() --> into Cluster Map Task launches Spark job from Map YARN container -- SparkLauncher.startApplication() ---> New Child Process is spawned (SparkSubmit is this child process) I was not sure if in this case map task configs and Yarn configs impose any restrictions into SparkSubmit process which is started as child process . Because SparkLauncher#startApplication() launches SparkSubmit as new child process of Map Yarn container. If i understood it correctly, then driver will use default memory configs of Spark (1g) or the value specified by the user via spark.driver.memory. I did not use Spark since last year 1.5 version, now transitioning directly to 2.0 version, will read about Unified Memory Manager. Thanks, Owen. On Sat, Nov 12, 2016 at 1:40 AM Sean Owen <so...@cloudera.com> wrote: > Indeed, you get default values if you don't specify concrete values > otherwise. Yes, you should see the docs for the version you're using. > > Note that there are different configs for the new 'unified' memory manager > since 1.6, and so some older resources may be correctly explaining the > older 'legacy' memory manager configs. > > Yes all containers would have to be smaller than YARN's max allowed size. > The driver just consumes what the driver consumes; I don't know of any > extra 'appmaster' component. > What do you mean by 'launched by the map task'? jobs are launched by the > driver only. > > On Sat, Nov 12, 2016 at 9:14 AM Elkhan Dadashov <elkhan8...@gmail.com> > wrote: > > @Sean Owen, > > Thanks for your reply. > > I put the wrong link to the blog post. Here is the correct link > <https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-4-memory-settings/> > which describes Spark Memory settings on Yarn. I guess they have misused > the terms Spark driver/BlockManager, and explained memory usage of driver > falsely. > > 1) Then does that mean if nothing specified, then Spark will use defaults > specified in Spark config site > <http://spark.apache.org/docs/latest/running-on-yarn.html> ? > > 2) Let me clarify, if i understood it correctly: > > (due to Yarn restrictions) > > *Yarn-cluster mode*: > SparkAppMaster+Driver Memory < Yarn container max size allocation > SparkExecutor Memory < Yarn container max size allocation > > *Yarn-client mode* (assume Spark Job is launched from the map task): > Driver memory is independent of any Yarn properties, only limited by > machines memory. > SparkAppMaster Memory < Yarn container max size allocation > SparkExecutor Memory < Yarn container max size allocation > > Did i get it correctly ? > > 3) Any resource for Spark components memory calculations for Yarn cluster > ? (other than this site which describes default config values > http://spark.apache.org/docs/latest/running-on-yarn.html ) > > Thanks. > > On Sat, Nov 12, 2016 at 12:24 AM, Sean Owen <so...@cloudera.com> wrote: > > If you're pointing at the 336MB, then it's not really related any of the > items you cite here. This is the memory managed internally by MemoryStore. > The blog post refers to the legacy memory manager. You can see a bit of how > it works in the code, but this is the sum of the on-heap and off-heap > memory it can manage. See the memory config docs, however, to understand > what user-facing settings you can make; you don't really need to workk > about this value. > > mapreduce settings are irrelevant to Spark. > Spark doesn't pay attention to the YARN settings, but YARN does. It > enforces them, yes. It is not exempt from YARN. > > 896MB is correct there. yarn-client mode does not ignore driver > properties, no. > > On Sat, Nov 12, 2016 at 2:18 AM Elkhan Dadashov <elkhan8...@gmail.com> > wrote: > > Hi, > > Spark website <http://spark.apache.org/docs/latest/running-on-yarn.html> > indicates > default spark properties as like this: > I did not override any properties in spark-defaults.conf file, but when I > launch Spark in YarnClient mode: > > spark.driver.memory 1g > spark.yarn.am.memory 512m > spark.yarn.am.memoryOverhead : max(spark.yarn.am.memory * 0.10, 384m) > spark.yarn.driver.memoryOverhead : max(spark.driver.memory * 0.10, 384m) > > I launch Spark job via SparkLauncher#startApplication() in *Yarn-client > mode from the Map task of Hadoop job*. > > *My cluster settings*: > yarn.scheduler.minimum-allocation-mb 256 > yarn.scheduler.maximum-allocation-mb 2048 > yarn.app.mapreduce.am.resource.mb 512 > mapreduce.map.memory.mb 640 > mapreduce.map.java.opts -Xmx400m > yarn.app.mapreduce.am.command-opts -Xmx448m > > *Logs of Spark job*: > > INFO Client: Verifying our application has not requested more than the > maximum memory capability of the cluster (2048 MB per container) > INFO Client: Will allocate *AM container*, with 896 MB memory including > 384 MB overhead > > INFO MemoryStore: MemoryStore started with capacity 366.3 MB > > ./application_1478727394310_0005/container_1478727394310_0005_01_000002/stderr:INFO: > 16/11/09 14:18:42 INFO BlockManagerMasterEndpoint: Registering block > manager <machine-ip>:57246 with *366.3* MB RAM, BlockManagerId(driver, > <machine-ip>, 57246) > > *Questions*: > 1) How is driver memory calculated ? > > How did Spark decide for 366 MB for driver based on properties described > above ? > > I thought the memory allocation is based on this formula ( > https://www.altiscale.com/blog/spark-on-hadoop/ ): > > "Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction ,where > memoryFraction=0.6, and safetyFraction=0.9. This is 1024MB x 0.6 x 0.9 = > 552.96MB. However, 552.96MB is a little larger than the value as shown in > the log. This is because of the runtime overhead imposed by Scala, which is > usually around 3-7%, more or less. If you do the calculation using 982MB x > 0.6 x 0.9, (982MB being approximately 4% of 1024) then you will derive the > number 530.28MB, which is what is indicated in the log file after rounding > up to 530.30MB." > > 2) If Spark job is launched from the Map task via > SparkLauncher#startApplication() will driver memory respect > (mapreduce.map.memory.mb and mapreduce.map.java.opts) OR > (yarn.scheduler.maximum-allocation-mb) when launching Spark Job as child > process ? > > The confusion is, as SparkSubmit is a new JVM process - because it is > launched as child process of the map task, and it does not depend on Yarn > configs. But not obeying any limits (if this is the case), will make things > tricky on NodeManager reporting back memory usage. > > 3) Is this correct formula for calculating AM memory ? > > For AM it matches to this formula calculation ( > https://www.altiscale.com/blog/spark-on-hadoop/ ):how much memory to > allocate to the AM: amMemory + amMemoryOverhead > amMemoryOverhead is set to 384MB via spark.yarn.driver.memoryOverhead. > args.amMemory is fixed at 512MB by Spark when it’s running in yarn-client > mode. Adding 384MB of overhead to 512MB provides the 896MB figure requested > by Spark. > > 4) For Spark Yarn-client mode, are all spark.driver properties ignored, > and only spark.yarn.am properties used ? > > Thanks. > > > > > > > > -- > > Best regards, > Elkhan Dadashov > >