[ 
https://issues.apache.org/jira/browse/HDFS-9723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoyu Yao updated HDFS-9723:
-----------------------------
    Description: 
HDFS namenode handles RPC requests from DFS clients and internal processing 
from datanodes. It has been a recurring pain that some bad jobs overwhelm the 
namenode and bring the whole cluster down. FCQ (Fair Call Queue) by HADOOP-9640 
is the one of the existing efforts added since Hadoop 2.4 to address this 
issue. 

In current FCQ implementation, incoming RPC calls are scheduled based on the 
number of recent RPC calls of different users with a time-decayed scheduler. 
This works well when there is a clear mapping between users and their RPC calls 
from different jobs. However, this may not work effectively when it is hard to 
track calls to a specific caller in a chain of operations from the workflow 
(e.g.Oozie -> Hive -> Yarn). It is not feasible for operators/administrators to 
throttle all the hive jobs because of one “bad” query.

This JIRA proposed to leverage RPC caller context information (such as 
callerType: caller Id from TEZ-2851) available with HDFS-9184 as an alternative 
to existing UGI (or user name when delegation token is not available) based 
Identify Provider to improve effectiveness Hadoop RPC Fair Call Queue 
(HADOOP-9640) for better namenode throttling in multi-tenancy cluster 
deployment.  

  was:
HDFS namenode handles RPC requests from DFS clients and internal processing 
from datanodes. It has been a recurring pain that some bad jobs overwhelm the 
namenode and bring the whole cluster down. FCQ (Fair Call Queue) by HADOOP-9640 
is the one of the existing efforts added since Hadoop 2.4 to address this 
issue. 

In current FCQ implementation, incoming RPC calls are scheduled based on the 
number of recent RPC calls (1000) of different users with a time-decayed 
scheduler. This works well when there is a clear mapping between users and 
their RPC calls from different jobs. However, this may not work effectively 
when it is hard to track calls to a specific caller in a chain of operations 
from the workflow (e.g.Oozie -> Hive -> Yarn). It is not feasible for 
operators/administrators to throttle all the hive jobs because of one “bad” 
query.

This JIRA proposed to leverage RPC caller context information (such as 
callerType: caller Id from TEZ-2851) available with HDFS-9184 as an alternative 
to existing UGI (or user name when delegation token is not available) based 
Identify Provider to improve effectiveness Hadoop RPC Fair Call Queue 
(HADOOP-9640) for better namenode throttling in multi-tenancy cluster 
deployment.  


> Improve Namenode Throttling Against Bad Jobs with FCQ and CallerContext
> -----------------------------------------------------------------------
>
>                 Key: HDFS-9723
>                 URL: https://issues.apache.org/jira/browse/HDFS-9723
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Xiaoyu Yao
>            Assignee: Xiaoyu Yao
>
> HDFS namenode handles RPC requests from DFS clients and internal processing 
> from datanodes. It has been a recurring pain that some bad jobs overwhelm the 
> namenode and bring the whole cluster down. FCQ (Fair Call Queue) by 
> HADOOP-9640 is the one of the existing efforts added since Hadoop 2.4 to 
> address this issue. 
> In current FCQ implementation, incoming RPC calls are scheduled based on the 
> number of recent RPC calls of different users with a time-decayed scheduler. 
> This works well when there is a clear mapping between users and their RPC 
> calls from different jobs. However, this may not work effectively when it is 
> hard to track calls to a specific caller in a chain of operations from the 
> workflow (e.g.Oozie -> Hive -> Yarn). It is not feasible for 
> operators/administrators to throttle all the hive jobs because of one “bad” 
> query.
> This JIRA proposed to leverage RPC caller context information (such as 
> callerType: caller Id from TEZ-2851) available with HDFS-9184 as an 
> alternative to existing UGI (or user name when delegation token is not 
> available) based Identify Provider to improve effectiveness Hadoop RPC Fair 
> Call Queue (HADOOP-9640) for better namenode throttling in multi-tenancy 
> cluster deployment.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to