rdblue commented on issue #430: Support bucket table for Iceberg
URL:
https://github.com/apache/incubator-iceberg/issues/430#issuecomment-533360026
Thanks @jerryshao! I had a look at the doc and made some comments.
The main thing is that Iceberg already supports bucketing and has solved
many of the challenges you identified, like schema evolution. There are two
remaining problems:
1. Writing requires users to [cluster data into buckets using a
UDF](https://github.com/apache/incubator-iceberg/issues/274).
2. Bucketed joins can't take advantage of Iceberg bucket values.
For #1, we need to allow Iceberg to control the `requiredChildDistribution`
and `requiredChildOrdering` returned by
[`WriteToDataSourceV2Exec`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L318).
Here's [a gist that shows that we use to automatically insert
distribution/ordering
requirements](https://gist.github.com/rdblue/468cff86ffcdd07dcea55520ab9c267c)
that allow automatic bucketing writes. That also depends on #317.
We also need a [FunctionCatalog](https://github.com/apache/spark/pull/24559)
that allows us to return Iceberg transforms as UDFs that Spark can use.
For #2, we are planning to add support for Spark to be able to use bucket
values to speed up joins. We aren't quite sure how to do this yet, but we know
that Spark will need to recognize that a table is bucketed (using the Table's
partitioning), get the bucket function from the table's catalog (using
FunctionCatalog) and use that function to prepare data for the other side of
the join. If the other side of the join uses the same partition function, then
we can avoid a shuffle for that side of the join as well.
Hopefully this short write-up and the comments I left on the doc give you an
idea of the current status of bucketed joins. Thanks for working on this!
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
With regards,
Apache Git Services
-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org