[jira] [Assigned] (SPARK-26739) Standardized Join Types for DataFrames

2019-03-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26739:


Assignee: (was: Apache Spark)

> Standardized Join Types for DataFrames
> --
>
> Key: SPARK-26739
> URL: https://issues.apache.org/jira/browse/SPARK-26739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Skyler Lehan
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> h3. *Q1.* What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Currently, in the join functions on 
> [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
>  the join types are defined via a string parameter called joinType. In order 
> for a developer to know which joins are possible, they must look up the API 
> call for join. While this works fine, it can cause the developer to make a 
> typo resulting in improper joins and/or unexpected errors that aren't evident 
> at compile time. The objective of this improvement would be to allow 
> developers to use a common definition for join types (by enum or constants) 
> called JoinTypes. This would contain the possible joins and remove the 
> possibility of a typo. It would also allow Spark to alter the names of the 
> joins in the future without impacting end-users.
> h3. *Q2.* What problem is this proposal NOT designed to solve?
> The problem this solves is extremely narrow, it would not solve anything 
> other than providing a common definition for join types.
> h3. *Q3.* How is it done today, and what are the limits of current practice?
> Currently, developers must join two DataFrames like so:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> "left_outer")
> {code}
> Where they manually type the join type. As stated above, this:
>  * Requires developers to manually type in the join
>  * Can cause possibility of typos
>  * Restricts renaming of join types as its a literal string
>  * Does not restrict and/or compile check the join type being used, leading 
> to runtime errors
> h3. *Q4.* What is new in your approach and why do you think it will be 
> successful?
> The new approach would use constants or *more preferably an enum*, something 
> like this:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> JoinType.LEFT_OUTER)
> {code}
> This would provide:
>  * In code reference/definitions of the possible join types
>  ** This subsequently allows the addition of scaladoc of what each join type 
> does and how it operates
>  * Removes possibilities of a typo on the join type
>  * Provides compile time checking of the join type (only if an enum is used)
> To clarify, if JoinType is a constant, it would just fill in the joinType 
> string parameter for users. If an enum is used, it would restrict the domain 
> of possible join types to whatever is defined in the future JoinType enum. 
> The enum is preferred, however it would take longer to implement.
> h3. *Q5.* Who cares? If you are successful, what difference will it make?
> Developers using Apache Spark will care. This will make the join function 
> easier to wield and lead to less runtime errors. It will save time by 
> bringing join type validation at compile time. It will also provide in code 
> reference to the join types, which saves the developer time of having to look 
> up and navigate the multiple join functions to find the possible join types. 
> In addition to that, the resulting constants/enum would have documentation on 
> how that join type works.
> h3. *Q6.* What are the risks?
> Users of Apache Spark who currently use strings to define their join types 
> could be impacted if an enum is chosen as the common definition. This risk 
> can be mitigated by using string constants. The string constants would be the 
> exact same string as the string literals used today. For example:
> {code:java}
> JoinType.INNER = "inner"
> {code}
> If an enum is still the preferred way of defining the join types, new join 
> functions could be added that take in these enums and the join calls that 
> contain string parameters for joinType could be deprecated. This would give 
> developers a chance to change over to the new join types.
> h3. *Q7.* How long will it take?
> A few days for a seasoned Spark developer.
> h3. *Q8.* What are the mid-term and final "exams" to check for success?
> Mid-term exam would be the addition of a common definition of the join types 
> and additional join functions that take in the join type enum/constant. The 
> final exam would be working tests written to check the functionality of these 
> new join functions and the join functions that take a string for joinType 

[jira] [Assigned] (SPARK-26739) Standardized Join Types for DataFrames

2019-03-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26739:


Assignee: Apache Spark

> Standardized Join Types for DataFrames
> --
>
> Key: SPARK-26739
> URL: https://issues.apache.org/jira/browse/SPARK-26739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Skyler Lehan
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> h3. *Q1.* What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Currently, in the join functions on 
> [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
>  the join types are defined via a string parameter called joinType. In order 
> for a developer to know which joins are possible, they must look up the API 
> call for join. While this works fine, it can cause the developer to make a 
> typo resulting in improper joins and/or unexpected errors that aren't evident 
> at compile time. The objective of this improvement would be to allow 
> developers to use a common definition for join types (by enum or constants) 
> called JoinTypes. This would contain the possible joins and remove the 
> possibility of a typo. It would also allow Spark to alter the names of the 
> joins in the future without impacting end-users.
> h3. *Q2.* What problem is this proposal NOT designed to solve?
> The problem this solves is extremely narrow, it would not solve anything 
> other than providing a common definition for join types.
> h3. *Q3.* How is it done today, and what are the limits of current practice?
> Currently, developers must join two DataFrames like so:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> "left_outer")
> {code}
> Where they manually type the join type. As stated above, this:
>  * Requires developers to manually type in the join
>  * Can cause possibility of typos
>  * Restricts renaming of join types as its a literal string
>  * Does not restrict and/or compile check the join type being used, leading 
> to runtime errors
> h3. *Q4.* What is new in your approach and why do you think it will be 
> successful?
> The new approach would use constants or *more preferably an enum*, something 
> like this:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> JoinType.LEFT_OUTER)
> {code}
> This would provide:
>  * In code reference/definitions of the possible join types
>  ** This subsequently allows the addition of scaladoc of what each join type 
> does and how it operates
>  * Removes possibilities of a typo on the join type
>  * Provides compile time checking of the join type (only if an enum is used)
> To clarify, if JoinType is a constant, it would just fill in the joinType 
> string parameter for users. If an enum is used, it would restrict the domain 
> of possible join types to whatever is defined in the future JoinType enum. 
> The enum is preferred, however it would take longer to implement.
> h3. *Q5.* Who cares? If you are successful, what difference will it make?
> Developers using Apache Spark will care. This will make the join function 
> easier to wield and lead to less runtime errors. It will save time by 
> bringing join type validation at compile time. It will also provide in code 
> reference to the join types, which saves the developer time of having to look 
> up and navigate the multiple join functions to find the possible join types. 
> In addition to that, the resulting constants/enum would have documentation on 
> how that join type works.
> h3. *Q6.* What are the risks?
> Users of Apache Spark who currently use strings to define their join types 
> could be impacted if an enum is chosen as the common definition. This risk 
> can be mitigated by using string constants. The string constants would be the 
> exact same string as the string literals used today. For example:
> {code:java}
> JoinType.INNER = "inner"
> {code}
> If an enum is still the preferred way of defining the join types, new join 
> functions could be added that take in these enums and the join calls that 
> contain string parameters for joinType could be deprecated. This would give 
> developers a chance to change over to the new join types.
> h3. *Q7.* How long will it take?
> A few days for a seasoned Spark developer.
> h3. *Q8.* What are the mid-term and final "exams" to check for success?
> Mid-term exam would be the addition of a common definition of the join types 
> and additional join functions that take in the join type enum/constant. The 
> final exam would be working tests written to check the functionality of these 
> new join functions and the join functions that