[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198340472 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -109,6 +134,20 @@ class JDBCOptions( s"When reading JDBC data sources, users need to specify all or none for the following " + s"options: '$JDBC_PARTITION_COLUMN', '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND', " + s"and '$JDBC_NUM_PARTITIONS'") + + require(!(query.isDefined && partitionColumn.isDefined), +s""" + |Options '$JDBC_QUERY_STRING' and '$JDBC_PARTITION_COLUMN' can not be specified together. + |Please define the query using `$JDBC_TABLE_NAME` option instead and make sure to qualify + |the partition columns using the supplied subquery alias to resolve any ambiguity. + |Example : + |spark.read.format("jdbc") + |.option("dbtable", "(select c1, c2 from t1) as subq") + |.option("partitionColumn", "subq.c1" + |.load() + """.stripMargin + ) --- End diff -- aha, ok. Thanks for the kind explanation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21590 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198315709 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -109,6 +134,20 @@ class JDBCOptions( s"When reading JDBC data sources, users need to specify all or none for the following " + s"options: '$JDBC_PARTITION_COLUMN', '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND', " + s"and '$JDBC_NUM_PARTITIONS'") + + require(!(query.isDefined && partitionColumn.isDefined), +s""" + |Options '$JDBC_QUERY_STRING' and '$JDBC_PARTITION_COLUMN' can not be specified together. + |Please define the query using `$JDBC_TABLE_NAME` option instead and make sure to qualify + |the partition columns using the supplied subquery alias to resolve any ambiguity. + |Example : + |spark.read.format("jdbc") + |.option("dbtable", "(select c1, c2 from t1) as subq") + |.option("partitionColumn", "subq.c1" + |.load() + """.stripMargin + ) --- End diff -- @maropu Currently we disallow it to be on the safe side. Lets take your example. When using the query option to pass on the query, we basically expect the users to supply ```SQL select c0 p0, c1 p1, c2 p2 from t where c0 > 1 ``` In spark , we will parentesize the query and add in an alias to confirm to the table subquery syntax. Given the user input the above query, he could decide to qualify the partition column names with the table name. So he could do the following : ``` SQL al df = spark.read .format("jdbc") .option("driver", "org.postgresql.Driver") .option("url", "jdbc:postgresql://localhost:5432/postgres?user=maropu") .option("query", "select c0 p0, c1 p1, c2 p2 from t where c0 > 1") .option("partitionColumn", "t.p2") ==> User qualifies the column names. .option("lowerBound", "1") .option("upperBound", "3") .option("numPartitions", "2") .load() ``` In this case we will end up generating the query of the following form - ``` SQL select * from (select c0 p0, c1 p1, c2 p2 from t where c0 > 1) __SPARK_GEN_ALIAS where t.p2 >= 1 and t.p2 <=3 ``` However this would be an invalid query. In the query option, its possible to specify a complex query involving joins. Thats the reason, we disallow it to be in safe side. In the dbtable option, users are responsible to explicitly specify the alias and would now how to qualify the partition columns. Lets see if we can improve this in future. If you have some ideas, please let us know. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198314687 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -109,6 +134,20 @@ class JDBCOptions( s"When reading JDBC data sources, users need to specify all or none for the following " + s"options: '$JDBC_PARTITION_COLUMN', '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND', " + s"and '$JDBC_NUM_PARTITIONS'") + + require(!(query.isDefined && partitionColumn.isDefined), +s""" + |Options '$JDBC_QUERY_STRING' and '$JDBC_PARTITION_COLUMN' can not be specified together. + |Please define the query using `$JDBC_TABLE_NAME` option instead and make sure to qualify + |the partition columns using the supplied subquery alias to resolve any ambiguity. + |Example : + |spark.read.format("jdbc") + |.option("dbtable", "(select c1, c2 from t1) as subq") + |.option("partitionColumn", "subq.c1" + |.load() + """.stripMargin + ) --- End diff -- @dilipbiswal Could you reply @maropu with an example? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198313914 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala --- @@ -86,7 +86,8 @@ class JdbcRelationProvider extends CreatableRelationProvider case SaveMode.ErrorIfExists => throw new AnalysisException( - s"Table or view '${options.table}' already exists. SaveMode: ErrorIfExists.") + s"Table or view '${options.table}' already exists. " + +s"SaveMode: ErrorIfExists.") --- End diff -- No change, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198027023 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -174,3 +209,25 @@ object JDBCOptions { val JDBC_TXN_ISOLATION_LEVEL = newOption("isolationLevel") val JDBC_SESSION_INIT_STATEMENT = newOption("sessionInitStatement") } + +class JdbcOptionsInWrite( --- End diff -- That depends on comitter's decisions: @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198022351 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,32 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + // table name or a table subquery. + val tableOrQuery = (parameters.get(JDBC_TABLE_NAME), parameters.get(JDBC_QUERY_STRING)) match { +case (Some(name), Some(subquery)) => + throw new IllegalArgumentException( +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified at the same time." + ) +case (None, None) => + throw new IllegalArgumentException( +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." --- End diff -- @maropu Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198022370 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,32 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + // table name or a table subquery. + val tableOrQuery = (parameters.get(JDBC_TABLE_NAME), parameters.get(JDBC_QUERY_STRING)) match { +case (Some(name), Some(subquery)) => + throw new IllegalArgumentException( +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified at the same time." + ) +case (None, None) => + throw new IllegalArgumentException( +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) +case (Some(name), None) => + if (name.isEmpty) { +throw new IllegalArgumentException(s"Option '${JDBC_TABLE_NAME}' can not be empty.") --- End diff -- @maropu OK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198022388 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,32 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + // table name or a table subquery. + val tableOrQuery = (parameters.get(JDBC_TABLE_NAME), parameters.get(JDBC_QUERY_STRING)) match { +case (Some(name), Some(subquery)) => + throw new IllegalArgumentException( +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified at the same time." + ) +case (None, None) => + throw new IllegalArgumentException( +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) +case (Some(name), None) => + if (name.isEmpty) { +throw new IllegalArgumentException(s"Option '${JDBC_TABLE_NAME}' can not be empty.") + } else { +name.trim + } +case (None, Some(subquery)) => + if (subquery.isEmpty) { +throw new IllegalArgumentException(s"Option `${JDBC_QUERY_STRING}` can not be empty.") --- End diff -- @maropu OK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198022295 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -174,3 +209,25 @@ object JDBCOptions { val JDBC_TXN_ISOLATION_LEVEL = newOption("isolationLevel") val JDBC_SESSION_INIT_STATEMENT = newOption("sessionInitStatement") } + +class JdbcOptionsInWrite( +@transient override val parameters: CaseInsensitiveMap[String]) + extends JDBCOptions(parameters) { + + import JDBCOptions._ + + def this(parameters: Map[String, String]) = this(CaseInsensitiveMap(parameters)) + + def this(url: String, table: String, parameters: Map[String, String]) = { +this(CaseInsensitiveMap(parameters ++ Map( + JDBCOptions.JDBC_URL -> url, + JDBCOptions.JDBC_TABLE_NAME -> table))) + } + + require( +parameters.get(JDBC_TABLE_NAME).isDefined, +s"Option '${JDBCOptions.JDBC_TABLE_NAME}' is required. " + + s"Option '${JDBCOptions.JDBC_QUERY_STRING}' is not applicable while writing.") + + val destinationTable = parameters(JDBC_TABLE_NAME) +} --- End diff -- @maropu I had it as table and refactored it just before i pushed :-). I will change it back. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198022202 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -150,6 +183,7 @@ class JDBCOptions( } object JDBCOptions { --- End diff -- @maropu Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198022168 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,32 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + // table name or a table subquery. + val tableOrQuery = (parameters.get(JDBC_TABLE_NAME), parameters.get(JDBC_QUERY_STRING)) match { +case (Some(name), Some(subquery)) => + throw new IllegalArgumentException( +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified at the same time." + ) +case (None, None) => + throw new IllegalArgumentException( +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) +case (Some(name), None) => + if (name.isEmpty) { +throw new IllegalArgumentException(s"Option '${JDBC_TABLE_NAME}' can not be empty.") + } else { +name.trim + } +case (None, Some(subquery)) => + if (subquery.isEmpty) { +throw new IllegalArgumentException(s"Option `${JDBC_QUERY_STRING}` can not be empty.") + } else { +s"(${subquery}) __SPARK_GEN_JDBC_SUBQUERY_NAME_${curId.getAndIncrement()}" + } + } - // --- End diff -- @maropu OK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198022141 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -174,3 +209,25 @@ object JDBCOptions { val JDBC_TXN_ISOLATION_LEVEL = newOption("isolationLevel") val JDBC_SESSION_INIT_STATEMENT = newOption("sessionInitStatement") } + +class JdbcOptionsInWrite( --- End diff -- @maropu Can i take this on as a follow-up ? The reason is i am not fully familiar with all the options. I need to study those a bit more before i refactor them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198021171 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,32 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + // table name or a table subquery. + val tableOrQuery = (parameters.get(JDBC_TABLE_NAME), parameters.get(JDBC_QUERY_STRING)) match { +case (Some(name), Some(subquery)) => + throw new IllegalArgumentException( +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified at the same time." + ) +case (None, None) => + throw new IllegalArgumentException( +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) +case (Some(name), None) => + if (name.isEmpty) { +throw new IllegalArgumentException(s"Option '${JDBC_TABLE_NAME}' can not be empty.") --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198021225 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,32 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + // table name or a table subquery. + val tableOrQuery = (parameters.get(JDBC_TABLE_NAME), parameters.get(JDBC_QUERY_STRING)) match { +case (Some(name), Some(subquery)) => + throw new IllegalArgumentException( +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified at the same time." + ) +case (None, None) => + throw new IllegalArgumentException( +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) +case (Some(name), None) => + if (name.isEmpty) { +throw new IllegalArgumentException(s"Option '${JDBC_TABLE_NAME}' can not be empty.") + } else { +name.trim + } +case (None, Some(subquery)) => + if (subquery.isEmpty) { +throw new IllegalArgumentException(s"Option `${JDBC_QUERY_STRING}` can not be empty.") --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198021157 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,32 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + // table name or a table subquery. + val tableOrQuery = (parameters.get(JDBC_TABLE_NAME), parameters.get(JDBC_QUERY_STRING)) match { +case (Some(name), Some(subquery)) => + throw new IllegalArgumentException( +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified at the same time." + ) +case (None, None) => + throw new IllegalArgumentException( +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." --- End diff -- nit: remove braces: '{' and '}' --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198019979 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -174,3 +209,25 @@ object JDBCOptions { val JDBC_TXN_ISOLATION_LEVEL = newOption("isolationLevel") val JDBC_SESSION_INIT_STATEMENT = newOption("sessionInitStatement") } + +class JdbcOptionsInWrite( +@transient override val parameters: CaseInsensitiveMap[String]) + extends JDBCOptions(parameters) { + + import JDBCOptions._ + + def this(parameters: Map[String, String]) = this(CaseInsensitiveMap(parameters)) + + def this(url: String, table: String, parameters: Map[String, String]) = { +this(CaseInsensitiveMap(parameters ++ Map( + JDBCOptions.JDBC_URL -> url, + JDBCOptions.JDBC_TABLE_NAME -> table))) + } + + require( +parameters.get(JDBC_TABLE_NAME).isDefined, +s"Option '${JDBCOptions.JDBC_TABLE_NAME}' is required. " + + s"Option '${JDBCOptions.JDBC_QUERY_STRING}' is not applicable while writing.") + + val destinationTable = parameters(JDBC_TABLE_NAME) +} --- End diff -- Is it bad to change `destinationTable` to `table` for simplicity? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198019710 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -150,6 +183,7 @@ class JDBCOptions( } object JDBCOptions { --- End diff -- Move `JDBCOptions` in the end of this file --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198018682 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,32 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + // table name or a table subquery. + val tableOrQuery = (parameters.get(JDBC_TABLE_NAME), parameters.get(JDBC_QUERY_STRING)) match { +case (Some(name), Some(subquery)) => + throw new IllegalArgumentException( +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified at the same time." + ) +case (None, None) => + throw new IllegalArgumentException( +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) +case (Some(name), None) => + if (name.isEmpty) { +throw new IllegalArgumentException(s"Option '${JDBC_TABLE_NAME}' can not be empty.") + } else { +name.trim + } +case (None, Some(subquery)) => + if (subquery.isEmpty) { +throw new IllegalArgumentException(s"Option `${JDBC_QUERY_STRING}` can not be empty.") + } else { +s"(${subquery}) __SPARK_GEN_JDBC_SUBQUERY_NAME_${curId.getAndIncrement()}" + } + } - // --- End diff -- Don't need to remove this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r198018360 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -174,3 +209,25 @@ object JDBCOptions { val JDBC_TXN_ISOLATION_LEVEL = newOption("isolationLevel") val JDBC_SESSION_INIT_STATEMENT = newOption("sessionInitStatement") } + +class JdbcOptionsInWrite( --- End diff -- How about adding `JdbcOptionsInRead` for read-only options? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197631699 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -109,6 +134,20 @@ class JDBCOptions( s"When reading JDBC data sources, users need to specify all or none for the following " + s"options: '$JDBC_PARTITION_COLUMN', '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND', " + s"and '$JDBC_NUM_PARTITIONS'") + + require(!(query.isDefined && partitionColumn.isDefined), +s""" + |Options '$JDBC_QUERY_STRING' and '$JDBC_PARTITION_COLUMN' can not be specified together. + |Please define the query using `$JDBC_TABLE_NAME` option instead and make sure to qualify + |the partition columns using the supplied subquery alias to resolve any ambiguity. + |Example : + |spark.read.format("jdbc") + |.option("dbtable", "(select c1, c2 from t1) as subq") + |.option("partitionColumn", "subq.c1" + |.load() + """.stripMargin + ) --- End diff -- I was thinking the case below (this can be accepted in `dbtable` and `query` is not?): ``` postgres=# select * from t; c0 | c1 | c2 ++ 1 | 1 | 1 2 | 2 | 2 3 | 3 | 3 (3 rows) scala> :paste val df = spark.read .format("jdbc") .option("driver", "org.postgresql.Driver") .option("url", "jdbc:postgresql://localhost:5432/postgres?user=maropu") .option("dbtable", "(select c0 p0, c1 p1, c2 p2 from t where c0 > 1) t") .option("partitionColumn", "p2") .option("lowerBound", "1") .option("upperBound", "3") .option("numPartitions", "2") .load() scala> df.show +---+---+---+ | p0| p1| p2| +---+---+---+ | 2| 2| 2| | 3| 3| 3| +---+---+---+ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197620329 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) --- End diff -- Another option is to follow what we are doing in another PR: https://github.com/apache/spark/pull/21247 ? We are facing the same issue there. The options are shared by both read and write paths. However, the limitations are different. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197620436 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableOrQuery = tableName.map(_.trim).getOrElse { --- End diff -- Using a tuple match here? ``` (tableName, query) match { case ... } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197620455 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -109,6 +134,20 @@ class JDBCOptions( s"When reading JDBC data sources, users need to specify all or none for the following " + s"options: '$JDBC_PARTITION_COLUMN', '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND', " + s"and '$JDBC_NUM_PARTITIONS'") + + require(!(query.isDefined && partitionColumn.isDefined), +s""" + |Options '$JDBC_QUERY_STRING' and '$JDBC_PARTITION_COLUMN' can not be specified together. + |Please define the query using `$JDBC_TABLE_NAME` option instead and make sure to qualify + |the partition columns using the supplied subquery alias to resolve any ambiguity. + |Example : + |spark.read.format("jdbc") + |.option("dbtable", "(select c1, c2 from t1) as subq") + |.option("partitionColumn", "subq.c1" + |.load() --- End diff -- Great! Please add them to the doc as I mentioned above? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197620140 --- Diff: docs/sql-programming-guide.md --- @@ -1302,9 +1302,20 @@ the following case-insensitive options: dbtable - The JDBC table that should be read. Note that anything that is valid in a FROM clause of - a SQL query can be used. For example, instead of a full table you could also use a - subquery in parentheses. + The JDBC table that should be read from or written into. Note that when using it in the read + path anything that is valid in a FROM clause of a SQL query can be used. + For example, instead of a full table you could also use a subquery in parentheses. Its not + allowed to specify `dbtable` and `query` options at the same time. + + + +query + + A query that will be used to read data into Spark. The specified query will be parenthesized and used + as a subquery in the FROM clause. Spark will also assign a alias to the subquery clause. + As an example, spark will issue a query of the following form to the datasource. + SELECT columns FROM (user_specified_query) spark_gen_alias + Its not allowed to specify `dbtable` and `query` options at the same time. --- End diff -- Also document the limitation of `query` when having `partitionColumn` and the workaround? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197620099 --- Diff: docs/sql-programming-guide.md --- @@ -1302,9 +1302,20 @@ the following case-insensitive options: dbtable - The JDBC table that should be read. Note that anything that is valid in a FROM clause of - a SQL query can be used. For example, instead of a full table you could also use a - subquery in parentheses. + The JDBC table that should be read from or written into. Note that when using it in the read + path anything that is valid in a FROM clause of a SQL query can be used. + For example, instead of a full table you could also use a subquery in parentheses. Its not + allowed to specify `dbtable` and `query` options at the same time. + + + +query + + A query that will be used to read data into Spark. The specified query will be parenthesized and used + as a subquery in the FROM clause. Spark will also assign a alias to the subquery clause. + As an example, spark will issue a query of the following form to the datasource. --- End diff -- `datasource` -> `JDBC source` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197620021 --- Diff: docs/sql-programming-guide.md --- @@ -1302,9 +1302,20 @@ the following case-insensitive options: dbtable - The JDBC table that should be read. Note that anything that is valid in a FROM clause of - a SQL query can be used. For example, instead of a full table you could also use a - subquery in parentheses. + The JDBC table that should be read from or written into. Note that when using it in the read + path anything that is valid in a FROM clause of a SQL query can be used. + For example, instead of a full table you could also use a subquery in parentheses. Its not --- End diff -- Nit: `It is` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197620115 --- Diff: docs/sql-programming-guide.md --- @@ -1302,9 +1302,20 @@ the following case-insensitive options: dbtable - The JDBC table that should be read. Note that anything that is valid in a FROM clause of - a SQL query can be used. For example, instead of a full table you could also use a - subquery in parentheses. + The JDBC table that should be read from or written into. Note that when using it in the read + path anything that is valid in a FROM clause of a SQL query can be used. + For example, instead of a full table you could also use a subquery in parentheses. Its not + allowed to specify `dbtable` and `query` options at the same time. + + + +query + + A query that will be used to read data into Spark. The specified query will be parenthesized and used + as a subquery in the FROM clause. Spark will also assign a alias to the subquery clause. + As an example, spark will issue a query of the following form to the datasource. + SELECT columns FROM (user_specified_query) spark_gen_alias + Its not allowed to specify `dbtable` and `query` options at the same time. --- End diff -- `Its` -> `it is` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197620496 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala --- @@ -58,6 +58,10 @@ class JdbcRelationProvider extends CreatableRelationProvider parameters: Map[String, String], df: DataFrame): BaseRelation = { val options = new JDBCOptions(parameters) +require( + options.tableName.isDefined, + s"Option '${JDBCOptions.JDBC_TABLE_NAME}' is required. " + +s"Option '${JDBCOptions.JDBC_QUERY_STRING}' is not applicable while writing.") --- End diff -- Let us create a `JDBCOptionsInWrite`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197620483 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -109,6 +134,20 @@ class JDBCOptions( s"When reading JDBC data sources, users need to specify all or none for the following " + s"options: '$JDBC_PARTITION_COLUMN', '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND', " + s"and '$JDBC_NUM_PARTITIONS'") + + require(!(query.isDefined && partitionColumn.isDefined), +s""" + |Options '$JDBC_QUERY_STRING' and '$JDBC_PARTITION_COLUMN' can not be specified together. + |Please define the query using `$JDBC_TABLE_NAME` option instead and make sure to qualify + |the partition columns using the supplied subquery alias to resolve any ambiguity. + |Example : + |spark.read.format("jdbc") + |.option("dbtable", "(select c1, c2 from t1) as subq") + |.option("partitionColumn", "subq.c1" + |.load() + """.stripMargin + ) --- End diff -- @maropu The new option `query` is just a syntactic sugar for simplifying the work from many basic JDBC users. We can improve it in the future. For example, parsing the user-specified query and make all the other options work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197620091 --- Diff: docs/sql-programming-guide.md --- @@ -1302,9 +1302,20 @@ the following case-insensitive options: dbtable - The JDBC table that should be read. Note that anything that is valid in a FROM clause of - a SQL query can be used. For example, instead of a full table you could also use a - subquery in parentheses. + The JDBC table that should be read from or written into. Note that when using it in the read + path anything that is valid in a FROM clause of a SQL query can be used. + For example, instead of a full table you could also use a subquery in parentheses. Its not + allowed to specify `dbtable` and `query` options at the same time. + + + +query + + A query that will be used to read data into Spark. The specified query will be parenthesized and used + as a subquery in the FROM clause. Spark will also assign a alias to the subquery clause. --- End diff -- `a alias` -> `an alias` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197620340 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." --- End diff -- -> `can not be specified at the same time`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197620384 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableOrQuery = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subQuery if subQuery.nonEmpty => s"(${subQuery}) spark_gen_${curId.getAndIncrement()}" --- End diff -- `__SPARK_GEN_JDBC_SUBQUERY_NAME_${curId.getAndIncrement()}__`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197356603 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) --- End diff -- yea, it's okay to depend on xiao's opnion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197355161 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) --- End diff -- @maropu Thank you for taking the time to think about this throughly. A couple of questions/comments. 1) Looks like for read path we give precedence to dbtable over query. I feel its good to explicitly disallow this with a clear message in case of an ambiguity. 2) Usage of lazy here (especially to trigger errors) makes me a little nervous. Like if we want to introduce a debug statement to print the variables in side the QueryOptions class, things will not work any more, right ? Thats the reason, i had opted to check for the "invalid query option in write path" in the write function itself (i.e when i am sure of the calling context). Perhaps that how its used every where in which case it may be okay to follow the same approach here. I am okay with this. Lets get some opinion from @gatorsmile. Once i have the final set of comments, i will make the changes. Thanks again. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197349327 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) --- End diff -- I think, since the `tableName` and `query` variables don't need to be exposed to other classes, can we remove them? Btw, I feel sharing the `tableName` variable int both write/read paths makes code some complicated, so how about splitting the variable into two part: `tableOrQuery` for reading and `outputName` for writing? e.g., https://github.com/apache/spark/commit/d62372ab0e855c359122609f1805ce83661d510e --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r197347130 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableOrQuery = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subQuery if subQuery.nonEmpty => s"(${subQuery}) spark_gen_${curId.getAndIncrement()}" + case subQuery => subQuery +} + } + + require(tableOrQuery.nonEmpty, +s"Empty string is not allowed in either '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' options" + ) + - // --- End diff -- nit: revert this line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196975167 --- Diff: docs/sql-programming-guide.md --- @@ -1302,9 +1302,18 @@ the following case-insensitive options: dbtable - The JDBC table that should be read. Note that anything that is valid in a FROM clause of - a SQL query can be used. For example, instead of a full table you could also use a - subquery in parentheses. + The JDBC table that should be read from or written into. Note that when using it in the read + path anything that is valid in a FROM clause of a SQL query can be used. + For example, instead of a full table you could also use a subquery in parentheses. + + + +query + + A query that will be used to read data into Spark. The specified query will be parenthesized and used + as a subquery in the FROM clause. Spark will also assign a alias to the subquery clause. + As an example, spark will issue a query of the following form to the datasource. + SELECT columns FROM (user_specified_query) spark_generated_alias --- End diff -- @viirya OK, i will add this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196897983 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableExpression = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" + case subq => subq +} + } + + require(tableExpression.nonEmpty, --- End diff -- @gengliangwang I see your point. Does this read better to you ? ``` require(tableOrQuery.nonEmpty, s"Empty string is not allowed in either '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' options" ) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196895138 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) --- End diff -- @gengliangwang Thanks.. Actually i had tried a couple of different ways. Some how i found this a little hard to follow when i embed the error message. I like to check things upfront along with comments on top easy to follow. But if others find this easy to follow as well, then i will change. ```SQL val tableExpression = if (parameters.isDefinedAt(JDBC_TABLE_NAME)) { require(!parameters.isDefinedAt(JDBC_QUERY_STRING), s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." ) parameters.get(JDBC_TABLE_NAME).get.trim } else { require(parameters.isDefinedAt(JDBC_QUERY_STRING), s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." ) s"(${parameters.get(JDBC_QUERY_STRING)}) ${curId.getAndIncrement()}" } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196847962 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableExpression = tableName.map(_.trim).getOrElse { --- End diff -- @viirya ok. i will change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196848042 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableExpression = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" --- End diff -- @viirya Will change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196847882 --- Diff: docs/sql-programming-guide.md --- @@ -1302,9 +1302,18 @@ the following case-insensitive options: dbtable - The JDBC table that should be read. Note that anything that is valid in a FROM clause of - a SQL query can be used. For example, instead of a full table you could also use a - subquery in parentheses. + The JDBC table that should be read from or written into. Note that when using it in the read + path anything that is valid in a FROM clause of a SQL query can be used. + For example, instead of a full table you could also use a subquery in parentheses. + + + +query + + A query that will be used to read data into Spark. The specified query will be parenthesized and used + as a subquery in the FROM clause. Spark will also assign a alias to the subquery clause. --- End diff -- @viirya I think its better to let users know how we generate the from clause. That way they can choose to qualify the partition columns if needed. However, if you strongly feel otherwise, i will remove from doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196846549 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableExpression = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" --- End diff -- @maropu Yeah. we need an alias. Systems like postgress require a mandatory table subquery alias. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196788131 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableExpression = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" + case subq => subq --- End diff -- Ok. I saw you forbidding it later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196787410 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableExpression = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" --- End diff -- nit: `subq` -> `subQuery`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196787193 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableExpression = tableName.map(_.trim).getOrElse { --- End diff -- `expression` sounds a bit confusing. `tableOrQuery`? `tableSource`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196785859 --- Diff: docs/sql-programming-guide.md --- @@ -1302,9 +1302,18 @@ the following case-insensitive options: dbtable - The JDBC table that should be read. Note that anything that is valid in a FROM clause of - a SQL query can be used. For example, instead of a full table you could also use a - subquery in parentheses. + The JDBC table that should be read from or written into. Note that when using it in the read + path anything that is valid in a FROM clause of a SQL query can be used. + For example, instead of a full table you could also use a subquery in parentheses. + + + +query + + A query that will be used to read data into Spark. The specified query will be parenthesized and used + as a subquery in the FROM clause. Spark will also assign a alias to the subquery clause. + As an example, spark will issue a query of the following form to the datasource. + SELECT columns FROM (user_specified_query) spark_generated_alias --- End diff -- We can mention `dbtable` and `query` can't be specified at the same time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196786172 --- Diff: docs/sql-programming-guide.md --- @@ -1302,9 +1302,18 @@ the following case-insensitive options: dbtable - The JDBC table that should be read. Note that anything that is valid in a FROM clause of - a SQL query can be used. For example, instead of a full table you could also use a - subquery in parentheses. + The JDBC table that should be read from or written into. Note that when using it in the read + path anything that is valid in a FROM clause of a SQL query can be used. + For example, instead of a full table you could also use a subquery in parentheses. + + + +query + + A query that will be used to read data into Spark. The specified query will be parenthesized and used + as a subquery in the FROM clause. Spark will also assign a alias to the subquery clause. --- End diff -- Is it necessary to mention this query alias? Meaningful to end users? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196783966 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableExpression = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" + case subq => subq --- End diff -- hmm, an empty subquery? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196732502 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableExpression = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" --- End diff -- yea, simpler, better. Btw, we essentially need the alias name for a query? When we currently describe a query in `dbtable`, it seems we don't need the name? https://github.com/apache/spark/blob/bc111463a766a5619966a282fbe0fec991088ceb/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L1196 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196658307 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) --- End diff -- Personally I prefer: ``` val tableExpression = if (parameters.isDefinedAt(JDBC_TABLE_NAME)) { require(!parameters.isDefinedAt(JDBC_QUERY_STRING), "...") parameters.get(JDBC_TABLE_NAME).get.trim } else { require(parameters.isDefinedAt(JDBC_QUERY_STRING), "...") s"(${parameters.get(JDBC_QUERY_STRING)}) ${curId.getAndIncrement()}" } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196657947 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableExpression = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" + case subq => subq +} + } + + require(tableExpression.nonEmpty, --- End diff -- The error check and error message here are confusing. It seems telling user that the two options can be both specified. Maybe we should just check the defined one and improve the error message. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196634511 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -1206,4 +1207,92 @@ class JDBCSuite extends SparkFunSuite }.getMessage assert(errMsg.contains("Statement was canceled or the session timed out")) } + + test("query JDBC option - negative tests") { +val query = "SELECT * FROM test.people WHERE theid = 1" +// load path +val e1 = intercept[RuntimeException] { + val df = spark.read.format("jdbc") +.option("Url", urlWithUserAndPass) +.option("query", query) --- End diff -- @HyukjinKwon Thanks.. I will update the doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196623246 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -1206,4 +1207,92 @@ class JDBCSuite extends SparkFunSuite }.getMessage assert(errMsg.contains("Statement was canceled or the session timed out")) } + + test("query JDBC option - negative tests") { +val query = "SELECT * FROM test.people WHERE theid = 1" +// load path +val e1 = intercept[RuntimeException] { + val df = spark.read.format("jdbc") +.option("Url", urlWithUserAndPass) +.option("query", query) --- End diff -- shall we update https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#jdbc-to-other-databases too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196490214 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableExpression = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" --- End diff -- @maropu Don't mind using a constant name ? "spark_gen_alias" ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196489749 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -109,6 +134,20 @@ class JDBCOptions( s"When reading JDBC data sources, users need to specify all or none for the following " + s"options: '$JDBC_PARTITION_COLUMN', '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND', " + s"and '$JDBC_NUM_PARTITIONS'") + + require(!(query.isDefined && partitionColumn.isDefined), +s""" + |Options '$JDBC_QUERY_STRING' and '$JDBC_PARTITION_COLUMN' can not be specified together. + |Please define the query using `$JDBC_TABLE_NAME` option instead and make sure to qualify + |the partition columns using the supplied subquery alias to resolve any ambiguity. + |Example : + |spark.read.format("jdbc") + |.option("dbtable", "(select c1, c2 from t1) as subq") + |.option("partitionColumn", "subq.c1" + |.load() + """.stripMargin + ) --- End diff -- @maropu So since the we auto generate a subquery alias here for easy of use, we r disallowing the query option together with partition columns. As users wouldn't know how to qualify the partition columns given the suquery alias is generated implicitly. In this case, we ask them to use the existing dbtable to specify the query where they are in control to specify the alias themselves. Another option i considered is to introduce "queryAlias" as another option. But thought to avoid it for simplicity. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196487627 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) --- End diff -- @maropu These two requires are using tableName and query which is computed in lines before. Thats why i have placed these two requires after. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196407423 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) + + // table name or a table expression. + val tableExpression = tableName.map(_.trim).getOrElse { +// We have ensured in the code above that either dbtable or query is specified. +query.get match { + case subq if subq.nonEmpty => s"(${subq}) spark_gen_${curId.getAndIncrement()}" --- End diff -- We need `curId`? A constant name is bad? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196406329 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -109,6 +134,20 @@ class JDBCOptions( s"When reading JDBC data sources, users need to specify all or none for the following " + s"options: '$JDBC_PARTITION_COLUMN', '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND', " + s"and '$JDBC_NUM_PARTITIONS'") + + require(!(query.isDefined && partitionColumn.isDefined), +s""" + |Options '$JDBC_QUERY_STRING' and '$JDBC_PARTITION_COLUMN' can not be specified together. + |Please define the query using `$JDBC_TABLE_NAME` option instead and make sure to qualify + |the partition columns using the supplied subquery alias to resolve any ambiguity. + |Example : + |spark.read.format("jdbc") + |.option("dbtable", "(select c1, c2 from t1) as subq") + |.option("partitionColumn", "subq.c1" + |.load() + """.stripMargin + ) --- End diff -- Can't we define these two parameters simultaneously? How about the case where columns in query output have a partition column? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196403496 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) + + require( +!(tableName.isDefined && query.isDefined), +s"Both '$JDBC_TABLE_NAME' and '$JDBC_QUERY_STRING' can not be specified." + ) --- End diff -- Can you put these `require` in the head of this section (line 68)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21590#discussion_r196402958 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -65,13 +65,38 @@ class JDBCOptions( // Required parameters // require(parameters.isDefinedAt(JDBC_URL), s"Option '$JDBC_URL' is required.") - require(parameters.isDefinedAt(JDBC_TABLE_NAME), s"Option '$JDBC_TABLE_NAME' is required.") + // a JDBC URL val url = parameters(JDBC_URL) - // name of table - val table = parameters(JDBC_TABLE_NAME) + val tableName = parameters.get(JDBC_TABLE_NAME) + val query = parameters.get(JDBC_QUERY_STRING) + // Following two conditions make sure that : + // 1. One of the option (dbtable or query) must be specified. + // 2. Both of them can not be specified at the same time as they are conflicting in nature. + require( +tableName.isDefined || query.isDefined, +s"Option '$JDBC_TABLE_NAME' or '${JDBC_QUERY_STRING}' is required." + ) --- End diff -- Probably, it'd better to put these `require`s in the `Required parameters` section. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21590: [SPARK-24423][SQL] Add a new option for JDBC sour...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/21590 [SPARK-24423][SQL] Add a new option for JDBC sources ## What changes were proposed in this pull request? Here is the description in the JIRA - Currently, our JDBC connector provides the option `dbtable` for users to specify the to-be-loaded JDBC source table. ```SQL val jdbcDf = spark.read .format("jdbc") .option("*dbtable*", "dbName.tableName") .options(jdbcCredentials: Map) .load() ``` Normally, users do not fetch the whole JDBC table due to the poor performance/throughput of JDBC. Thus, they normally just fetch a small set of tables. For advanced users, they can pass a subquery as the option. ```SQL val query = """ (select * from tableName limit 10) as tmp """ val jdbcDf = spark.read .format("jdbc") .option("*dbtable*", query) .options(jdbcCredentials: Map) .load() ``` However, this is straightforward to end users. We should simply allow users to specify the query by a new option `query`. We will handle the complexity for them. ```SQL val query = """select * from tableName limit 10""" val jdbcDf = spark.read .format("jdbc") .option("*{color:#ff}query{color}*", query) .options(jdbcCredentials: Map) .load() ``` ## How was this patch tested? Added tests in JDBCSuite and JDBCWriterSuite. Also tested against MySQL, Postgress, Oracle, DB2 (using docker infrastructure) to make sure there are no syntax issues. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-24423 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21590.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21590 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org