I cut https://issues.apache.org/jira/browse/SPARK-10790 for this issue.
On Wed, Sep 23, 2015 at 8:38 PM, Jonathan Kelly <jonathaka...@gmail.com> wrote: > AHA! I figured it out, but it required some tedious remote debugging of > the Spark ApplicationMaster. (But now I understand the Spark codebase a > little better than before, so I guess I'm not too put out. =P) > > Here's what's happening... > > I am setting spark.dynamicAllocation.minExecutors=1 but am not setting > spark.dynamicAllocation.initialExecutors, so it's remaining at the default > of spark.dynamicAllocation.minExecutors. However, ExecutorAllocationManager > doesn't actually request any executors while the application is still > initializing (see comment here > <https://github.com/apache/spark/blob/v1.5.0/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L292>), > but it still sets numExecutorsTarget to > spark.dynamicAllocation.initialExecutors (i.e., 1). > > The JavaWordCount example I've been trying to run is only operating on a > very small file, so its first stage only has a single task and thus should > request a single executor once the polling loop comes along. > > Then on this line > <https://github.com/apache/spark/blob/v1.5.0/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L308>, > it returns numExecutorsTarget (1) - oldNumExecutorsTarget (still 1, even > though there aren't any executors running yet) = 0, for the number of > executors it should request. Then the app hangs forever because it never > requests any executors. > > I verified this further by setting > spark.dynamicAllocation.minExecutors=100 and trying to run my SparkPi > example I mentioned earlier (which runs 100 tasks in its first stage > because that's the number I'm passing to the driver). Then it would hang in > the same way as my JavaWordCount example. If I run it again, passing 101 > (so that it has 101 tasks), it works, and if I pass 99, it hangs again. > > So it seems that I have found a bug in that if you set > spark.dynamicAllocation.minExecutors (or, presumably, > spark.dynamicAllocation.initialExecutors), and the number of tasks in your > first stage is less than or equal to this min/init number of executors, it > won't actually request any executors and will just hang indefinitely. > > I can't seem to find a JIRA for this, so shall I file one, or has anybody > else seen anything like this? > > ~ Jonathan > > On Wed, Sep 23, 2015 at 7:08 PM, Jonathan Kelly <jonathaka...@gmail.com> > wrote: > >> Another update that doesn't make much sense: >> >> The SparkPi example does work on yarn-cluster mode with dynamicAllocation. >> >> That is, the following command works (as well as with yarn-client mode): >> >> spark-submit --deploy-mode cluster --class >> org.apache.spark.examples.SparkPi spark-examples.jar 100 >> >> But the following one does not work (nor does it work for yarn-client >> mode): >> >> spark-submit --deploy-mode cluster --class >> org.apache.spark.examples.JavaWordCount spark-examples.jar >> /tmp/word-count-input.txt >> >> So this JavaWordCount example hangs on requesting executors, while >> SparkPi and spark-shell do work. >> >> ~ Jonathan >> >> On Wed, Sep 23, 2015 at 6:22 PM, Jonathan Kelly <jonathaka...@gmail.com> >> wrote: >> >>> Thanks for the quick response! >>> >>> spark-shell is indeed using yarn-client. I forgot to mention that I also >>> have "spark.master yarn-client" in my spark-defaults.conf file too. >>> >>> The working spark-shell and my non-working example application both >>> display spark.scheduler.mode=FIFO on the Spark UI. Is that what you are >>> asking about? I haven't actually messed around with different scheduler >>> modes yet. >>> >>> One more thing I should mention is that the YARN ResourceManager tells >>> me the following on my 5-node cluster, with one node being the master and >>> not running a NodeManager: >>> Memory Used: 1.50 GB (this is the running ApplicationMaster that's >>> waiting and waiting for the executors to start up) >>> Memory Total: 45 GB (11.25 from each of the 4 slave nodes) >>> VCores Used: 1 >>> VCores Total: 32 >>> Active Nodes: 4 >>> >>> ~ Jonathan >>> >>> On Wed, Sep 23, 2015 at 6:10 PM, Andrew Duffy <andrewedu...@gmail.com> >>> wrote: >>> >>>> What pool is the spark shell being put into? (You can see this through >>>> the YARN UI under scheduler) >>>> >>>> Are you certain you're starting spark-shell up on YARN? By default it >>>> uses a local spark executor, so if it "just works" then it's because it's >>>> not using dynamic allocation. >>>> >>>> >>>> On Wed, Sep 23, 2015 at 18:04 Jonathan Kelly <jonathaka...@gmail.com> >>>> wrote: >>>> >>>>> I'm running into a problem with YARN dynamicAllocation on Spark 1.5.0 >>>>> after using it successfully on an identically configured cluster with >>>>> Spark >>>>> 1.4.1. >>>>> >>>>> I'm getting the dreaded warning "YarnClusterScheduler: Initial job has >>>>> not accepted any resources; check your cluster UI to ensure that workers >>>>> are registered and have sufficient resources", though there's nothing else >>>>> running on my cluster, and the nodes should have plenty of resources to >>>>> run >>>>> my application. >>>>> >>>>> Here are the applicable properties in spark-defaults.conf: >>>>> spark.dynamicAllocation.enabled true >>>>> spark.dynamicAllocation.minExecutors 1 >>>>> spark.shuffle.service.enabled true >>>>> >>>>> When trying out my example application (just the JavaWordCount example >>>>> that comes with Spark), I had not actually set spark.executor.memory or >>>>> any >>>>> CPU core-related properties, but setting the spark.executor.memory to a >>>>> low >>>>> value like 64m doesn't help either. >>>>> >>>>> I've tried a 5-node cluster and 1-node cluster of m3.xlarges, so each >>>>> node has 15.0GB and 4 cores. >>>>> >>>>> I've also tried both yarn-cluster and yarn-client mode and get the >>>>> same behavior for both, except that for yarn-client mode the application >>>>> never even shows up in the YARN ResourceManager. However, spark-shell >>>>> seems >>>>> to work just fine (when I run commands, it starts up executors dynamically >>>>> just fine), which makes no sense to me. >>>>> >>>>> What settings/logs should I look at to debug this, and what more >>>>> information can I provide? Your help would be very much appreciated! >>>>> >>>>> Thanks, >>>>> Jonathan >>>>> >>>> >>> >> >