[jira] [Assigned] (KUDU-3135) Add Client Metadata Tokens

2021-07-13 Thread Grant Henke (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke reassigned KUDU-3135:
-

Assignee: (was: Grant Henke)

> Add Client Metadata Tokens
> --
>
> Key: KUDU-3135
> URL: https://issues.apache.org/jira/browse/KUDU-3135
> Project: Kudu
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 1.12.0
>Reporter: Grant Henke
>Priority: Major
>  Labels: roadmap-candidate, scalability
>
> Currently when a distributed task is done using the Kudu client, the 
> driver/coordinator client needs to open the table to request its current 
> metadata and locations. Then it can distribute the work to tasks/executors on 
> remote nodes. In the case of reading data, often ScanTokens are used to 
> distribute the work, and in the case of writing data perhaps just the table 
> name is required.
> The problem is that each parallel task then also needs to open the table to 
> request the metadata for the table. Using Spark as an example, this happens 
> when deserializing the scan tokens in KuduRDD 
> ([here|https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala#L107-L108])
>  or when writing rows using the KuduContext 
> ([here|https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala#L466]).
>  This results in a large burst of metadata requests to the leader Kudu master 
> all at once. Given the Kudu master is only a single server and requests can't 
> be served from the follower masters, this effectively limits the amount of 
> parallel tasks that can run in a large Kudu deployment. Even if the follower 
> masters could service the requests, that still limits scalability in very 
> large clusters given most deployments would only have 3-5 masters.
> Adding a metadata token, similar to a scan token, would be a useful way to 
> allow the single driver to fetch all the metadata required for the parallel 
> tasks. The tokens can be serialized and then passed to each task in a similar 
> fashion to scan tokens.
> Of course in a pessimistic case, something may change between generation of 
> the token and the start of the task. In that case a request would need to be 
> sent to get the updated metadata. However, that scenario should be rare and 
> likely would not result in all of the requests happening at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-3135) Add Client Metadata Tokens

2020-06-02 Thread Grant Henke (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke reassigned KUDU-3135:
-

Assignee: Grant Henke

> Add Client Metadata Tokens
> --
>
> Key: KUDU-3135
> URL: https://issues.apache.org/jira/browse/KUDU-3135
> Project: Kudu
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 1.12.0
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Major
>  Labels: roadmap-candidate, scalability
>
> Currently when a distributed task is done using the Kudu client, the 
> driver/coordinator client needs to open the table to request its current 
> metadata and locations. Then it can distribute the work to tasks/executors on 
> remote nodes. In the case of reading data, often ScanTokens are used to 
> distribute the work, and in the case of writing data perhaps just the table 
> name is required.
> The problem is that each parallel task then also needs to open the table to 
> request the metadata for the table. Using Spark as an example, this happens 
> when deserializing the scan tokens in KuduRDD 
> ([here|https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala#L107-L108])
>  or when writing rows using the KuduContext 
> ([here|https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala#L466]).
>  This results in a large burst of metadata requests to the leader Kudu master 
> all at once. Given the Kudu master is only a single server and requests can't 
> be served from the follower masters, this effectively limits the amount of 
> parallel tasks that can run in a large Kudu deployment. Even if the follower 
> masters could service the requests, that still limits scalability in very 
> large clusters given most deployments would only have 3-5 masters.
> Adding a metadata token, similar to a scan token, would be a useful way to 
> allow the single driver to fetch all the metadata required for the parallel 
> tasks. The tokens can be serialized and then passed to each task in a similar 
> fashion to scan tokens.
> Of course in a pessimistic case, something may change between generation of 
> the token and the start of the task. In that case a request would need to be 
> sent to get the updated metadata. However, that scenario should be rare and 
> likely would not result in all of the requests happening at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)