[ 
https://issues.apache.org/jira/browse/YARN-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-8849:
------------------------------
    Description: 
Traditionally, YARN workload simulation is performed using SLS (Scheduler Load 
Simulator) which is packaged with YARN. It Essentially, starts a full fledged 
*ResourceManager*, but runs simulators for the *NodeManager* and the 
*ApplicationMaster* Containers. These simulators are lightweight and run in a 
threadpool. The NM simulators do not open any external ports and send 
(in-process) heartbeats to the ResourceManager.

There are a couple of drawbacks with using the SLS:
* It might be difficult to simulate really large clusters without having access 
to a very beefy box - since the NMs are launched as tasks in a threadpool, and 
each NM has to send periodic heartbeats to the RM.
* Certain features (like YARN-1011) requires changes to the NodeManager - 
aspects such as queuing and selectively killing containers have to be 
incorporate into the existing NM Simulator which might make the simulator a bit 
heavy weight - there is a need for locking and synchronization.
* Since the NM and AM are simulations, only the Scheduler is faithfully tested 
- it does not really perform an end-2-end test of a cluster.

Therefore, drawing inspiration from 
[Dynamometer|https://github.com/linkedin/dynamometer], we propose a framework 
for YARN deployable YARN cluster - *DynoYARN* - for testing, with the following 
features:
* The NM already has hooks to plug-in custom *ContainerExecutor* and 
*NodeResourceMonitor*. If we can plug-in a custom *ContainersMonitorImpl*'s 
Monitoring thread (and other modules like the LocalizationService), We can 
probably inject an Executor that does not actually launch containers and a Node 
and Container resource monitor that reports synthetic pre-specified Utilization 
metrics back to the RM.
* Since we are launching fake containers, we cannot run normal AM containers. 
We can therefore, use *Unmanaged AM*'s to launch synthetic jobs.

Essentially, a test workflow would look like this:
* Launch a DynoYARN cluster.
* Use the Unmanaged AM feature to directly negotiate with the DynaYARN Resource 
Manager for container tokens.
* Use the container tokens from the RM to directly ask the DynoYARN Node 
Managers to start fake containers.
* The DynoYARN NodeManagers will start the fake containers and report to the 
DynoYARN Resource Manager synthetically generated resource utilization for the 
containers (which will be injected via the *ContainerLaunchContext* and parsed 
by the plugged-in Container Executor).
* The Scheduler will use the utilization report to schedule containers - we 
will be able to test allocation of *Opportunistic* containers based on resource 
utilization.
* Since the DynoYARN Node Managers run the actual code paths, all preemption 
and queuing logic will be faithfully executed.


  was:
Traditionally, YARN workload simulation is performed using SLS (Scheduler Load 
Simulator) which is packaged with YARN. It Essentially, starts a full fledged 
*ResourceManager*, but runs simulators for the *NodeManager* and the 
*ApplicationMaster* Containers. These simulators are lightweight and run in a 
threadpool. The NM simulators do not open any external ports and send 
(in-process) heartbeats to the ResourceManager.

There are a couple of drawbacks with using the SLS:
* It might be difficult to simulate really large clusters without having access 
to a very beefy box - since the NMs are launched as tasks in a threadpool, and 
each NM has to send periodic heartbeats to the RM.
* Certain features (like YARN-1011) requires changes to the NodeManager - 
aspects such as queuing and selectively killing containers have to be 
incorporate into the existing NM Simulator which might make the simulator a bit 
heavy weight - there is a need for locking and synchronization.
* Since the NM and AM are simulations, only the Scheduler is faithfully tested 
- it does not really perform an end-2-end test of a cluster.

Therefore, drawing inspiration from 
[Dynamometer|https://github.com/linkedin/dynamometer], we propose a framework 
for YARN deployable YARN cluster - *DynoYARN* - for testing, with the following 
features:
* The NM already has hooks to plug-in custom *ContainerExecutor* and 
*NodeResourceMonitor*. If we can plug-in a custom *ContainersMonitorImpl*'s 
Monitoring thread (and other modules like the LocalizationService), We can 
probably inject an Executor that does not actually launch containers and a Node 
and Container resource monitor that reports synthetic pre-specified Utilization 
metrics back to the RM.
* Since we are launching fake containers, we cannot run normal AM containers. 
We can therefore, use *Unmanaged AM*'s to launch synthetic jobs.

Essentially, a test workflow would look like this:
* Launch a DynoYARN cluster.
* Use the Unmanaged AM feature to directly negotiate with the DynaYARN Resource 
Manager for container tokens.
* Use the container tokens from the RM to directly ask the DynoYARN Node 
Managers to start fake containers.
* The DynoYARN NodeManagers will start the fake containers and report to the 
DynoYARN Resource Manager synthetically generated resource utilization for the 
containers (which will be injected via the *ContainerLaunchContext* and parsed 
by the plugged-in Container Executor).
* The Scheduler will use the utilization report to schedule containers - we 
will be able to test allocation of {{Opportunistic}} containers based on 
resource utilization.
* Since the DynoYARN Node Managers run the actual code paths, all preemption 
and queuing logic will be faithfully executed.



> DynoYARN: A simulation and testing infrastructure for YARN clusters
> -------------------------------------------------------------------
>
>                 Key: YARN-8849
>                 URL: https://issues.apache.org/jira/browse/YARN-8849
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Arun Suresh
>            Priority: Major
>
> Traditionally, YARN workload simulation is performed using SLS (Scheduler 
> Load Simulator) which is packaged with YARN. It Essentially, starts a full 
> fledged *ResourceManager*, but runs simulators for the *NodeManager* and the 
> *ApplicationMaster* Containers. These simulators are lightweight and run in a 
> threadpool. The NM simulators do not open any external ports and send 
> (in-process) heartbeats to the ResourceManager.
> There are a couple of drawbacks with using the SLS:
> * It might be difficult to simulate really large clusters without having 
> access to a very beefy box - since the NMs are launched as tasks in a 
> threadpool, and each NM has to send periodic heartbeats to the RM.
> * Certain features (like YARN-1011) requires changes to the NodeManager - 
> aspects such as queuing and selectively killing containers have to be 
> incorporate into the existing NM Simulator which might make the simulator a 
> bit heavy weight - there is a need for locking and synchronization.
> * Since the NM and AM are simulations, only the Scheduler is faithfully 
> tested - it does not really perform an end-2-end test of a cluster.
> Therefore, drawing inspiration from 
> [Dynamometer|https://github.com/linkedin/dynamometer], we propose a framework 
> for YARN deployable YARN cluster - *DynoYARN* - for testing, with the 
> following features:
> * The NM already has hooks to plug-in custom *ContainerExecutor* and 
> *NodeResourceMonitor*. If we can plug-in a custom *ContainersMonitorImpl*'s 
> Monitoring thread (and other modules like the LocalizationService), We can 
> probably inject an Executor that does not actually launch containers and a 
> Node and Container resource monitor that reports synthetic pre-specified 
> Utilization metrics back to the RM.
> * Since we are launching fake containers, we cannot run normal AM containers. 
> We can therefore, use *Unmanaged AM*'s to launch synthetic jobs.
> Essentially, a test workflow would look like this:
> * Launch a DynoYARN cluster.
> * Use the Unmanaged AM feature to directly negotiate with the DynaYARN 
> Resource Manager for container tokens.
> * Use the container tokens from the RM to directly ask the DynoYARN Node 
> Managers to start fake containers.
> * The DynoYARN NodeManagers will start the fake containers and report to the 
> DynoYARN Resource Manager synthetically generated resource utilization for 
> the containers (which will be injected via the *ContainerLaunchContext* and 
> parsed by the plugged-in Container Executor).
> * The Scheduler will use the utilization report to schedule containers - we 
> will be able to test allocation of *Opportunistic* containers based on 
> resource utilization.
> * Since the DynoYARN Node Managers run the actual code paths, all preemption 
> and queuing logic will be faithfully executed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to