I cut https://issues.apache.org/jira/browse/SPARK-10790 for this issue.

On Wed, Sep 23, 2015 at 8:38 PM, Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> AHA! I figured it out, but it required some tedious remote debugging of
> the Spark ApplicationMaster. (But now I understand the Spark codebase a
> little better than before, so I guess I'm not too put out. =P)
>
> Here's what's happening...
>
> I am setting spark.dynamicAllocation.minExecutors=1 but am not setting
> spark.dynamicAllocation.initialExecutors, so it's remaining at the default
> of spark.dynamicAllocation.minExecutors. However, ExecutorAllocationManager
> doesn't actually request any executors while the application is still
> initializing (see comment here
> <https://github.com/apache/spark/blob/v1.5.0/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L292>),
> but it still sets numExecutorsTarget to
> spark.dynamicAllocation.initialExecutors (i.e., 1).
>
> The JavaWordCount example I've been trying to run is only operating on a
> very small file, so its first stage only has a single task and thus should
> request a single executor once the polling loop comes along.
>
> Then on this line
> <https://github.com/apache/spark/blob/v1.5.0/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L308>,
> it returns numExecutorsTarget (1) - oldNumExecutorsTarget (still 1, even
> though there aren't any executors running yet) = 0, for the number of
> executors it should request. Then the app hangs forever because it never
> requests any executors.
>
> I verified this further by setting
> spark.dynamicAllocation.minExecutors=100 and trying to run my SparkPi
> example I mentioned earlier (which runs 100 tasks in its first stage
> because that's the number I'm passing to the driver). Then it would hang in
> the same way as my JavaWordCount example. If I run it again, passing 101
> (so that it has 101 tasks), it works, and if I pass 99, it hangs again.
>
> So it seems that I have found a bug in that if you set
> spark.dynamicAllocation.minExecutors (or, presumably,
> spark.dynamicAllocation.initialExecutors), and the number of tasks in your
> first stage is less than or equal to this min/init number of executors, it
> won't actually request any executors and will just hang indefinitely.
>
> I can't seem to find a JIRA for this, so shall I file one, or has anybody
> else seen anything like this?
>
> ~ Jonathan
>
> On Wed, Sep 23, 2015 at 7:08 PM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
>
>> Another update that doesn't make much sense:
>>
>> The SparkPi example does work on yarn-cluster mode with dynamicAllocation.
>>
>> That is, the following command works (as well as with yarn-client mode):
>>
>> spark-submit --deploy-mode cluster --class
>> org.apache.spark.examples.SparkPi spark-examples.jar 100
>>
>> But the following one does not work (nor does it work for yarn-client
>> mode):
>>
>> spark-submit --deploy-mode cluster --class
>> org.apache.spark.examples.JavaWordCount spark-examples.jar
>> /tmp/word-count-input.txt
>>
>> So this JavaWordCount example hangs on requesting executors, while
>> SparkPi and spark-shell do work.
>>
>> ~ Jonathan
>>
>> On Wed, Sep 23, 2015 at 6:22 PM, Jonathan Kelly <jonathaka...@gmail.com>
>> wrote:
>>
>>> Thanks for the quick response!
>>>
>>> spark-shell is indeed using yarn-client. I forgot to mention that I also
>>> have "spark.master yarn-client" in my spark-defaults.conf file too.
>>>
>>> The working spark-shell and my non-working example application both
>>> display spark.scheduler.mode=FIFO on the Spark UI. Is that what you are
>>> asking about? I haven't actually messed around with different scheduler
>>> modes yet.
>>>
>>> One more thing I should mention is that the YARN ResourceManager tells
>>> me the following on my 5-node cluster, with one node being the master and
>>> not running a NodeManager:
>>> Memory Used: 1.50 GB (this is the running ApplicationMaster that's
>>> waiting and waiting for the executors to start up)
>>> Memory Total: 45 GB (11.25 from each of the 4 slave nodes)
>>> VCores Used: 1
>>> VCores Total: 32
>>> Active Nodes: 4
>>>
>>> ~ Jonathan
>>>
>>> On Wed, Sep 23, 2015 at 6:10 PM, Andrew Duffy <andrewedu...@gmail.com>
>>> wrote:
>>>
>>>> What pool is the spark shell being put into? (You can see this through
>>>> the YARN UI under scheduler)
>>>>
>>>> Are you certain you're starting spark-shell up on YARN? By default it
>>>> uses a local spark executor, so if it "just works" then it's because it's
>>>> not using dynamic allocation.
>>>>
>>>>
>>>> On Wed, Sep 23, 2015 at 18:04 Jonathan Kelly <jonathaka...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm running into a problem with YARN dynamicAllocation on Spark 1.5.0
>>>>> after using it successfully on an identically configured cluster with 
>>>>> Spark
>>>>> 1.4.1.
>>>>>
>>>>> I'm getting the dreaded warning "YarnClusterScheduler: Initial job has
>>>>> not accepted any resources; check your cluster UI to ensure that workers
>>>>> are registered and have sufficient resources", though there's nothing else
>>>>> running on my cluster, and the nodes should have plenty of resources to 
>>>>> run
>>>>> my application.
>>>>>
>>>>> Here are the applicable properties in spark-defaults.conf:
>>>>> spark.dynamicAllocation.enabled  true
>>>>> spark.dynamicAllocation.minExecutors 1
>>>>> spark.shuffle.service.enabled true
>>>>>
>>>>> When trying out my example application (just the JavaWordCount example
>>>>> that comes with Spark), I had not actually set spark.executor.memory or 
>>>>> any
>>>>> CPU core-related properties, but setting the spark.executor.memory to a 
>>>>> low
>>>>> value like 64m doesn't help either.
>>>>>
>>>>> I've tried a 5-node cluster and 1-node cluster of m3.xlarges, so each
>>>>> node has 15.0GB and 4 cores.
>>>>>
>>>>> I've also tried both yarn-cluster and yarn-client mode and get the
>>>>> same behavior for both, except that for yarn-client mode the application
>>>>> never even shows up in the YARN ResourceManager. However, spark-shell 
>>>>> seems
>>>>> to work just fine (when I run commands, it starts up executors dynamically
>>>>> just fine), which makes no sense to me.
>>>>>
>>>>> What settings/logs should I look at to debug this, and what more
>>>>> information can I provide? Your help would be very much appreciated!
>>>>>
>>>>> Thanks,
>>>>> Jonathan
>>>>>
>>>>
>>>
>>
>

Reply via email to