Re: Update Spark 3.3 release window?

2021-10-27 Thread Sean Owen
Seems fine to me - as good a placeholder as anything. Would that be about time to call 2.x end-of-life? On Wed, Oct 27, 2021 at 9:36 PM Hyukjin Kwon wrote: > Hi all, > > Spark 3.2. is out. Shall we update the release window > https://spark.apache.org/versioning-policy.html? > I am thinking of

Update Spark 3.3 release window?

2021-10-27 Thread Hyukjin Kwon
Hi all, Spark 3.2. is out. Shall we update the release window https://spark.apache.org/versioning-policy.html? I am thinking of Mid March 2022 (5 months after the 3.2 release) for code freeze and onward.

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Ryan Blue
The transform expressions in v2 are logical, not concrete implementations. Even days may have different implementations -- the only expectation is that the partitions are day-sized. For example, you could use a transform that splits days at UTC 00:00, or uses some other day boundary. Because the

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Chao Sun
Thanks Wenchen, this is a good question. `BucketTransform` and others currently have no semantic meaning, and I think we should bind them to v2 functions as part of the SPIP. My current proposal is: During query analysis, Spark will try to resolve `XXXTransform`s (in `V2ExpressionUtils`) into

Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-10-27 Thread L . C . Hsieh
Thanks for the initial feedback. I think previously the community is busy on the works related to Spark 3.2 release. As 3.2 release was done, I'd like to bring this up to the surface again and seek for more discussion and feedback. Thanks. On 2021/06/25 15:49:49, huaxin gao wrote: > I

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread L . C . Hsieh
+1 for the SPIP. This is a great improvement and optimization! On 2021/10/26 19:01:03, Erik Krogen wrote: > It's great to see this SPIP going live. Once this is complete, it will > really help Spark to play nicely with a broader data ecosystem (Hive, > Iceberg, Trino, etc.), and it's great to

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Wenchen Fan
`BucketTransform` is a builtin partition transform in Spark, instead of a UDF from `FunctionCatalog`. Will Iceberg use UDF from `FunctionCatalog` to represent its bucket transform, or use the Spark builtin `BucketTransform`? I'm asking this because other v2 sources may also use the builtin

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Ryan Blue
Two v2 sources may return different bucket IDs for the same value, and this breaks the phase 1 split-wise join. This is why the FunctionCatalog included a canonicalName method (docs

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Wenchen Fan
IIUC, the general idea is to let each input split report its partition value, and Spark can perform the join in two phases: 1. join the input splits from left and right tables according to their partitions values and join keys, at the driver side. 2. for each joined input splits pair (or a group