[jira] [Updated] (YARN-10941) Wrong Yarn node label mapping with AWS EMR machine types

Agam (Jira) Mon, 13 Sep 2021 06:03:28 -0700


     [ 
https://issues.apache.org/jira/browse/YARN-10941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Agam updated YARN-10941:
------------------------
    Description: 
Does anyone have experience with Yarn node labels on AWS EMR? If so you please 
share your thoughts. We want to run All the Spark executors on Task(Spot) 
machine and all the Spark ApplicationMaster/Driver on Core(on-Demand) machine. 
Previously we were running Spark executors and Spark Driver all on the CORE 
machine(on-demand).

In order to achieve this, we have created the "TASK" yarn node label as a part 
of a custom AWS EMR Bootstrap action. And Have mapped the same "TASK" yarn 
label when any Spot instance is registered with AWS EMR in a separate bootstrap 
action. As "CORE" is the default yarn node label expression, so we are simply 
mapping it with an on-demand instance upon registration of the node in the 
bootstrap action.

We are using `"spark.yarn.executor.nodeLabelExpression": "TASK"` spark conf to 
launch spark executors on Task nodes.

So.. we are facing the problem of the wrong mapping of the Yarn node label with 
the appropriate machine i.e For a short duration of time(around 1-2 mins) the 
"TASK" yarn node label is mapped with on-demand instances and "CORE" yarn node 
label is mapped with spot instance. So During this short duration of wrong 
labeling Yarn launches Spark executors on On-demand instances and Spark drivers 
on Spot instances.

This wrong mapping of labels with corresponding machine type persists till the 
bootstrap actions are complete and after that, the mapping is automatically 
resolved to its correct state.

The script we are running as a part of the bootstrap action:

This script is run on all new machines to assign a label to that machine. The 
script is being run as a background PID as the yarn is only available after all 
custom bootstrap actions are completed

This command is being run on the Master instance to create a new TASK yarn node 
label at the time of cluster creation.

Does anyone have clue to prevent this wrong mapping of labels?

  was:
Does anyone have experience with Yarn node labels on AWS EMR? If so you please 
share your thoughts. We want to run All the Spark executors on Task(Spot) 
machine and all the Spark ApplicationMaster/Driver on Core(on-Demand) machine. 
Previously we were running Spark executors and Spark Driver all on the CORE 
machine(on-demand).

In order to achieve this, we have created the "TASK" yarn node label as a part 
of a custom AWS EMR Bootstrap action. And Have mapped the same "TASK" yarn 
label when any Spot instance is registered with AWS EMR in a separate bootstrap 
action. As "CORE" is the default yarn node label expression, so we are simply 
mapping it with an on-demand instance upon registration of the node in the 
bootstrap action.

We are using `"spark.yarn.executor.nodeLabelExpression": "TASK"` spark conf to 
launch spark executors on Task nodes.

So.. we are facing the problem of the wrong mapping of the Yarn node label with 
the appropriate machine i.e For a short duration of time(around 1-2 mins) the 
"TASK" yarn node label is mapped with on-demand instances and "CORE" yarn node 
label is mapped with spot instance. So During this short duration of wrong 
labeling Yarn launches Spark executors on On-demand instances and Spark drivers 
on Spot instances.

This wrong mapping of labels with corresponding machine type persists till the 
bootstrap actions are complete and after that, the mapping is automatically 
resolved to its correct state.

The script we are running as a part of the bootstrap action:

This script is run on all new machines to assign a label to that machine. The 
script is being run as a background PID as the yarn is only available after all 
custom bootstrap actions are completed

```

??#!/usr/bin/env bash??

??set -ex??

??function waitTillYarnComesUp()??

??{ IS_YARN_EXIST=$(which yarn | grep -i yarn | wc -l) while [ $IS_YARN_EXIST 
!= '1' ] do echo "Yarn not exist" sleep 15 IS_YARN_EXIST=$(which yarn | grep -i 
yarn | wc -l) done echo "Yarn exist.." }??

??function waitTillTaskLabelSyncs()??

??{ LABEL_EXIST=$(yarn cluster --list-node-labels | grep -i TASK | wc -l) while 
[ $LABEL_EXIST -eq 0 ] do sleep 15 LABEL_EXIST=$(yarn cluster 
--list-node-labels | grep -i TASK | wc -l) done }??

??function getHostInstanceTypeAndApplyLabel() {??
?? HOST_IP=$(curl [http://169.254.169.254/latest/meta-data/local-hostname])??
?? echo "host ip is ${HOST_IP}"??
?? INSTANCE_TYPE=$(curl 
[http://169.254.169.254/latest/meta-data/instance-life-cycle])??
?? echo "instance type is ${INSTANCE_TYPE}"??
?? PORT_NUMBER=8041??
?? spot="spot"??
?? onDemand="on-demand"??

??if [ $INSTANCE_TYPE == $spot ]; then??
?? yarn rmadmin -replaceLabelsOnNode "${HOST_IP}:${PORT_NUMBER}=TASK"??
?? elif [ $INSTANCE_TYPE == $onDemand ]??
?? then??
?? yarn rmadmin -replaceLabelsOnNode "${HOST_IP}:${PORT_NUMBER}=CORE"??

??fi??
?? }??

??waitTillYarnComesUp??
 # ??holding for resource manager sync??
?? sleep 100??
?? waitTillTaskLabelSyncs??
?? getHostInstanceTypeAndApplyLabel??
?? exit 0??

```

 

??`yarn rmadmin -addToClusterNodeLabels "TASK(exclusive=false)"`??

This command is being run on the Master instance to create a new TASK yarn node 
label at the time of cluster creation.

Does anyone have clue to prevent this wrong mapping of labels?


> Wrong Yarn node label mapping with AWS EMR machine types
> --------------------------------------------------------
>
>                 Key: YARN-10941
>                 URL: https://issues.apache.org/jira/browse/YARN-10941
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 2.10.1
>            Reporter: Agam
>            Priority: Major
>
> Does anyone have experience with Yarn node labels on AWS EMR? If so you 
> please share your thoughts. We want to run All the Spark executors on 
> Task(Spot) machine and all the Spark ApplicationMaster/Driver on 
> Core(on-Demand) machine. Previously we were running Spark executors and Spark 
> Driver all on the CORE machine(on-demand).
> In order to achieve this, we have created the "TASK" yarn node label as a 
> part of a custom AWS EMR Bootstrap action. And Have mapped the same "TASK" 
> yarn label when any Spot instance is registered with AWS EMR in a separate 
> bootstrap action. As "CORE" is the default yarn node label expression, so we 
> are simply mapping it with an on-demand instance upon registration of the 
> node in the bootstrap action.
> We are using `"spark.yarn.executor.nodeLabelExpression": "TASK"` spark conf 
> to launch spark executors on Task nodes.
> So.. we are facing the problem of the wrong mapping of the Yarn node label 
> with the appropriate machine i.e For a short duration of time(around 1-2 
> mins) the "TASK" yarn node label is mapped with on-demand instances and 
> "CORE" yarn node label is mapped with spot instance. So During this short 
> duration of wrong labeling Yarn launches Spark executors on On-demand 
> instances and Spark drivers on Spot instances.
> This wrong mapping of labels with corresponding machine type persists till 
> the bootstrap actions are complete and after that, the mapping is 
> automatically resolved to its correct state.
> The script we are running as a part of the bootstrap action:
> This script is run on all new machines to assign a label to that machine. The 
> script is being run as a background PID as the yarn is only available after 
> all custom bootstrap actions are completed
> This command is being run on the Master instance to create a new TASK yarn 
> node label at the time of cluster creation.
> Does anyone have clue to prevent this wrong mapping of labels?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10941) Wrong Yarn node label mapping with AWS EMR machine types

Reply via email to