[GitHub] [incubator-iceberg] rdblue commented on issue #430: Support bucket table for Iceberg

2019-09-24 Thread GitBox
rdblue commented on issue #430: Support bucket table for Iceberg
URL: 
https://github.com/apache/incubator-iceberg/issues/430#issuecomment-534654396
 
 
   @jerryshao, yes that's correct.
   
   That's why we need to expose the transformation functions to Spark via 
FunctionCatalog, and add the ability for DSv2 sources to set distribution and 
ordering requirements with those functions.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #430: Support bucket table for Iceberg

2019-09-19 Thread GitBox
rdblue commented on issue #430: Support bucket table for Iceberg
URL: 
https://github.com/apache/incubator-iceberg/issues/430#issuecomment-533360026
 
 
   Thanks @jerryshao! I had a look at the doc and made some comments.
   
   The main thing is that Iceberg already supports bucketing and has solved 
many of the challenges you identified, like schema evolution. There are two 
remaining problems:
   1. Writing requires users to [cluster data into buckets using a 
UDF](https://github.com/apache/incubator-iceberg/issues/274).
   2. Bucketed joins can't take advantage of Iceberg bucket values.
   
   For #1, we need to allow Iceberg to control the `requiredChildDistribution` 
and `requiredChildOrdering` returned by 
[`WriteToDataSourceV2Exec`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L318).
 Here's [a gist that shows that we use to automatically insert 
distribution/ordering 
requirements](https://gist.github.com/rdblue/468cff86ffcdd07dcea55520ab9c267c) 
that allow automatic bucketing writes. That also depends on #317.
   
   We also need a [FunctionCatalog](https://github.com/apache/spark/pull/24559) 
that allows us to return Iceberg transforms as UDFs that Spark can use.
   
   For #2, we are planning to add support for Spark to be able to use bucket 
values to speed up joins. We aren't quite sure how to do this yet, but we know 
that Spark will need to recognize that a table is bucketed (using the Table's 
partitioning), get the bucket function from the table's catalog (using 
FunctionCatalog) and use that function to prepare data for the other side of 
the join. If the other side of the join uses the same partition function, then 
we can avoid a shuffle for that side of the join as well.
   
   Hopefully this short write-up and the comments I left on the doc give you an 
idea of the current status of bucketed joins. Thanks for working on this!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #430: Support bucket table for Iceberg

2019-09-05 Thread GitBox
rdblue commented on issue #430: Support bucket table for Iceberg
URL: 
https://github.com/apache/incubator-iceberg/issues/430#issuecomment-528642221
 
 
   @jerryshao, thanks for posting this! I'll take a look as soon as I can, but 
I'm going to be at a conference next week so it may not be quick.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org