[jira] [Assigned] (KUDU-3135) Add Client Metadata Tokens
[ https://issues.apache.org/jira/browse/KUDU-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Henke reassigned KUDU-3135: - Assignee: (was: Grant Henke) > Add Client Metadata Tokens > -- > > Key: KUDU-3135 > URL: https://issues.apache.org/jira/browse/KUDU-3135 > Project: Kudu > Issue Type: Improvement > Components: client >Affects Versions: 1.12.0 >Reporter: Grant Henke >Priority: Major > Labels: roadmap-candidate, scalability > > Currently when a distributed task is done using the Kudu client, the > driver/coordinator client needs to open the table to request its current > metadata and locations. Then it can distribute the work to tasks/executors on > remote nodes. In the case of reading data, often ScanTokens are used to > distribute the work, and in the case of writing data perhaps just the table > name is required. > The problem is that each parallel task then also needs to open the table to > request the metadata for the table. Using Spark as an example, this happens > when deserializing the scan tokens in KuduRDD > ([here|https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala#L107-L108]) > or when writing rows using the KuduContext > ([here|https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala#L466]). > This results in a large burst of metadata requests to the leader Kudu master > all at once. Given the Kudu master is only a single server and requests can't > be served from the follower masters, this effectively limits the amount of > parallel tasks that can run in a large Kudu deployment. Even if the follower > masters could service the requests, that still limits scalability in very > large clusters given most deployments would only have 3-5 masters. > Adding a metadata token, similar to a scan token, would be a useful way to > allow the single driver to fetch all the metadata required for the parallel > tasks. The tokens can be serialized and then passed to each task in a similar > fashion to scan tokens. > Of course in a pessimistic case, something may change between generation of > the token and the start of the task. In that case a request would need to be > sent to get the updated metadata. However, that scenario should be rare and > likely would not result in all of the requests happening at the same time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-3135) Add Client Metadata Tokens
[ https://issues.apache.org/jira/browse/KUDU-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Henke reassigned KUDU-3135: - Assignee: Grant Henke > Add Client Metadata Tokens > -- > > Key: KUDU-3135 > URL: https://issues.apache.org/jira/browse/KUDU-3135 > Project: Kudu > Issue Type: Improvement > Components: client >Affects Versions: 1.12.0 >Reporter: Grant Henke >Assignee: Grant Henke >Priority: Major > Labels: roadmap-candidate, scalability > > Currently when a distributed task is done using the Kudu client, the > driver/coordinator client needs to open the table to request its current > metadata and locations. Then it can distribute the work to tasks/executors on > remote nodes. In the case of reading data, often ScanTokens are used to > distribute the work, and in the case of writing data perhaps just the table > name is required. > The problem is that each parallel task then also needs to open the table to > request the metadata for the table. Using Spark as an example, this happens > when deserializing the scan tokens in KuduRDD > ([here|https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala#L107-L108]) > or when writing rows using the KuduContext > ([here|https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala#L466]). > This results in a large burst of metadata requests to the leader Kudu master > all at once. Given the Kudu master is only a single server and requests can't > be served from the follower masters, this effectively limits the amount of > parallel tasks that can run in a large Kudu deployment. Even if the follower > masters could service the requests, that still limits scalability in very > large clusters given most deployments would only have 3-5 masters. > Adding a metadata token, similar to a scan token, would be a useful way to > allow the single driver to fetch all the metadata required for the parallel > tasks. The tokens can be serialized and then passed to each task in a similar > fashion to scan tokens. > Of course in a pessimistic case, something may change between generation of > the token and the start of the task. In that case a request would need to be > sent to get the updated metadata. However, that scenario should be rare and > likely would not result in all of the requests happening at the same time. -- This message was sent by Atlassian Jira (v8.3.4#803005)