[jira] [Updated] (SPARK-27785) Introduce .joinWith() overloads for typed inner joins of 3 or more tables

2020-03-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27785:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Introduce .joinWith() overloads for typed inner joins of 3 or more tables
> -
>
> Key: SPARK-27785
> URL: https://issues.apache.org/jira/browse/SPARK-27785
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Josh Rosen
>Priority: Major
>
> Today it's rather painful to do a typed dataset join of more than two tables: 
> {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining 
> on a third inner join requires users to specify a complicated join condition 
> (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), 
> resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become 
> even more painful if you want to layer on a fourth join. Using {{.map()}} to 
> flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.
> To simplify this use case, I propose to introduce a new set of overloads of 
> {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
> reasonable number (say, 6). For example:
> {code:java}
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   condition: Column
> ): Dataset[(T, T1, T2)]
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   ds3: Dataset[T3],
>   condition: Column
> ): Dataset[(T, T1, T2, T3)]{code}
> I propose to do this only for inner joins (consistent with the default join 
> type for {{joinWith}} in case joins are not specified).
> I haven't though about this too much yet and am not committed to the API 
> proposed above (it's just my initial idea), so I'm open to suggestions for 
> alternative typed APIs for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27785) Introduce .joinWith() overloads for typed inner joins of 3 or more tables

2019-05-20 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27785:
---
Description: 
Today it's rather painful to do a typed dataset join of more than two tables: 
{{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining on 
a third inner join requires users to specify a complicated join condition 
(referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), 
resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become 
even more painful if you want to layer on a fourth join. Using {{.map()}} to 
flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.

To simplify this use case, I propose to introduce a new set of overloads of 
{{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
reasonable number (say, 6). For example:
{code:java}
Dataset[T].joinWith[T1, T2](
  ds1: Dataset[T1],
  ds2: Dataset[T2]
): Dataset[(T, T1, T2)]

Dataset[T].joinWith[T1, T2](
  ds1: Dataset[T1],
  ds2: Dataset[T2],
  ds3: Dataset[T3]
): Dataset[(T, T1, T2, T3)]{code}
I propose to do this only for inner joins (consistent with the default join 
type for {{joinWith}} in case joins are not specified).

I haven't though about this too much yet and am not committed to the API 
proposed above (it's just my initial idea), so I'm open to suggestions for 
alternative typed APIs for this.

  was:
Today it's rather painful to do a typed dataset join of more than two tables: 
{{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining on 
a third inner join requires users to specify a complicated join condition 
(referencing {{_1}}), resulting a doubly-nested schema like {{Dataset[((A, B), 
C)]}}. Things become even more painful if you want to layer on a fourth join. 
Using {{.map()}} to flatten the data into {{Dataset[(A, B, C)]}} has a 
performance penalty, too.

To simplify this use case, I propose to introduce a new set of overloads of 
{{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
reasonable number (say, 6). For example:
{code:java}
Dataset[T].joinWith[T1, T2](
  ds1: Dataset[T1],
  ds2: Dataset[T2]
): Dataset[(T, T1, T2)]

Dataset[T].joinWith[T1, T2](
  ds1: Dataset[T1],
  ds2: Dataset[T2],
  ds3: Dataset[T3]
): Dataset[(T, T1, T2, T3)]{code}
I propose to do this only for inner joins (consistent with the default join 
type for {{joinWith}} in case joins are not specified).

I haven't though about this too much yet and am not committed to the API 
proposed above (it's just my initial idea), so I'm open to suggestions for 
alternative typed APIs for this.


> Introduce .joinWith() overloads for typed inner joins of 3 or more tables
> -
>
> Key: SPARK-27785
> URL: https://issues.apache.org/jira/browse/SPARK-27785
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> Today it's rather painful to do a typed dataset join of more than two tables: 
> {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining 
> on a third inner join requires users to specify a complicated join condition 
> (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), 
> resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become 
> even more painful if you want to layer on a fourth join. Using {{.map()}} to 
> flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.
> To simplify this use case, I propose to introduce a new set of overloads of 
> {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
> reasonable number (say, 6). For example:
> {code:java}
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2]
> ): Dataset[(T, T1, T2)]
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   ds3: Dataset[T3]
> ): Dataset[(T, T1, T2, T3)]{code}
> I propose to do this only for inner joins (consistent with the default join 
> type for {{joinWith}} in case joins are not specified).
> I haven't though about this too much yet and am not committed to the API 
> proposed above (it's just my initial idea), so I'm open to suggestions for 
> alternative typed APIs for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27785) Introduce .joinWith() overloads for typed inner joins of 3 or more tables

2019-05-20 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27785:
---
Description: 
Today it's rather painful to do a typed dataset join of more than two tables: 
{{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining on 
a third inner join requires users to specify a complicated join condition 
(referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), 
resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become 
even more painful if you want to layer on a fourth join. Using {{.map()}} to 
flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.

To simplify this use case, I propose to introduce a new set of overloads of 
{{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
reasonable number (say, 6). For example:
{code:java}
Dataset[T].joinWith[T1, T2](
  ds1: Dataset[T1],
  ds2: Dataset[T2],
  condition: Column
): Dataset[(T, T1, T2)]

Dataset[T].joinWith[T1, T2](
  ds1: Dataset[T1],
  ds2: Dataset[T2],
  ds3: Dataset[T3],
  condition: Column
): Dataset[(T, T1, T2, T3)]{code}
I propose to do this only for inner joins (consistent with the default join 
type for {{joinWith}} in case joins are not specified).

I haven't though about this too much yet and am not committed to the API 
proposed above (it's just my initial idea), so I'm open to suggestions for 
alternative typed APIs for this.

  was:
Today it's rather painful to do a typed dataset join of more than two tables: 
{{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining on 
a third inner join requires users to specify a complicated join condition 
(referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), 
resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become 
even more painful if you want to layer on a fourth join. Using {{.map()}} to 
flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.

To simplify this use case, I propose to introduce a new set of overloads of 
{{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
reasonable number (say, 6). For example:
{code:java}
Dataset[T].joinWith[T1, T2](
  ds1: Dataset[T1],
  ds2: Dataset[T2]
): Dataset[(T, T1, T2)]

Dataset[T].joinWith[T1, T2](
  ds1: Dataset[T1],
  ds2: Dataset[T2],
  ds3: Dataset[T3]
): Dataset[(T, T1, T2, T3)]{code}
I propose to do this only for inner joins (consistent with the default join 
type for {{joinWith}} in case joins are not specified).

I haven't though about this too much yet and am not committed to the API 
proposed above (it's just my initial idea), so I'm open to suggestions for 
alternative typed APIs for this.


> Introduce .joinWith() overloads for typed inner joins of 3 or more tables
> -
>
> Key: SPARK-27785
> URL: https://issues.apache.org/jira/browse/SPARK-27785
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> Today it's rather painful to do a typed dataset join of more than two tables: 
> {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining 
> on a third inner join requires users to specify a complicated join condition 
> (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), 
> resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become 
> even more painful if you want to layer on a fourth join. Using {{.map()}} to 
> flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.
> To simplify this use case, I propose to introduce a new set of overloads of 
> {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
> reasonable number (say, 6). For example:
> {code:java}
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   condition: Column
> ): Dataset[(T, T1, T2)]
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   ds3: Dataset[T3],
>   condition: Column
> ): Dataset[(T, T1, T2, T3)]{code}
> I propose to do this only for inner joins (consistent with the default join 
> type for {{joinWith}} in case joins are not specified).
> I haven't though about this too much yet and am not committed to the API 
> proposed above (it's just my initial idea), so I'm open to suggestions for 
> alternative typed APIs for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27785) Introduce .joinWith() overloads for typed inner joins of 3 or more tables

2019-05-20 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27785:
---
Summary: Introduce .joinWith() overloads for typed inner joins of 3 or more 
tables  (was: Introduce .joinWith() overload for inner join of 3 or more tables)

> Introduce .joinWith() overloads for typed inner joins of 3 or more tables
> -
>
> Key: SPARK-27785
> URL: https://issues.apache.org/jira/browse/SPARK-27785
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> Today it's rather painful to do a typed dataset join of more than two tables: 
> {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining 
> on a third inner join requires users to specify a complicated join condition 
> (referencing {{_1}}), resulting a doubly-nested schema like {{Dataset[((A, 
> B), C)]}}. Things become even more painful if you want to layer on a fourth 
> join. Using {{.map()}} to flatten the data into {{Dataset[(A, B, C)]}} has a 
> performance penalty, too.
> To simplify this use case, I propose to introduce a new set of overloads of 
> {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
> reasonable number (say, 6). For example:
> {code:java}
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2]
> ): Dataset[(T, T1, T2)]
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   ds3: Dataset[T3]
> ): Dataset[(T, T1, T2, T3)]{code}
> I propose to do this only for inner joins (consistent with the default join 
> type for {{joinWith}} in case joins are not specified).
> I haven't though about this too much yet and am not committed to the API 
> proposed above (it's just my initial idea), so I'm open to suggestions for 
> alternative typed APIs for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org