[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-07 Thread via GitHub


hvanhovell commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1128154890


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -140,6 +140,11 @@ message Read {
 
 // (Optional) A list of path for file-system backed data sources.
 repeated string paths = 4;
+
+// (Optional) Condition in the where clause for each partition.
+//
+// Only work for JDBC data source.

Review Comment:
   `This is only supported by the JDBC data source.`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-06 Thread via GitHub


hvanhovell commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1126368202


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -684,26 +686,43 @@ class SparkConnectPlanner(val session: SparkSession) {
   case proto.Read.ReadTypeCase.DATA_SOURCE =>
 val localMap = 
CaseInsensitiveMap[String](rel.getDataSource.getOptionsMap.asScala.toMap)
 val reader = session.read
-if (rel.getDataSource.hasFormat) {
-  reader.format(rel.getDataSource.getFormat)
-}
 localMap.foreach { case (key, value) => reader.option(key, value) }
-if (rel.getDataSource.hasSchema && 
rel.getDataSource.getSchema.nonEmpty) {
-
-  DataType.parseTypeWithFallback(
-rel.getDataSource.getSchema,
-StructType.fromDDL,
-fallbackParser = DataType.fromJson) match {
-case s: StructType => reader.schema(s)
-case other => throw InvalidPlanInput(s"Invalid schema $other")
+
+if (rel.getDataSource.getPredicatesCount == 0) {

Review Comment:
   Please make the logic a bit like this:
   ```scala
   if (format == "jdbc" && rel.getDataSource.getPredicatesCount) {
 // Plan JDBC with predicates
   } else id (rel.getDataSource.getPredicatesCount == 0) {
// Plan datasource
   } else {
 throw InvalidPlan(s"Predicates are not supported for $format datasources.)"
   }
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-06 Thread via GitHub


hvanhovell commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1126365135


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -250,6 +250,46 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
 jdbc(url, table, connectionProperties)
   }
 
+  /**
+   * Construct a `DataFrame` representing the database table accessible via 
JDBC URL url named
+   * table using connection properties. The `predicates` parameter gives a 
list expressions
+   * suitable for inclusion in WHERE clauses; each one defines one partition 
of the `DataFrame`.
+   *
+   * Don't create too many partitions in parallel on a large cluster; 
otherwise Spark might crash
+   * your external database systems.
+   *
+   * You can find the JDBC-specific option and parameter documentation for 
reading tables via JDBC
+   * in https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option;>
+   * Data Source Option in the version you use.
+   *
+   * @param table
+   *   Name of the table in the external database.
+   * @param predicates
+   *   Condition in the where clause for each partition.
+   * @param connectionProperties
+   *   JDBC database connection arguments, a list of arbitrary string 
tag/value. Normally at least
+   *   a "user" and "password" property should be included. "fetchsize" can be 
used to control the
+   *   number of rows per fetch.
+   * @since 3.4.0
+   */
+  def jdbc(
+  url: String,
+  table: String,
+  predicates: Array[String],
+  connectionProperties: Properties): DataFrame = {
+sparkSession.newDataFrame { builder =>

Review Comment:
   Can you please set the format to JDBC? We are now relying the presence of 
predicates to figure out that something is a JDBC table. That is relying far 
too heavily on the client doing the right thing, for example what would happen 
if you set format = parquet and still define predicates?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-06 Thread via GitHub


hvanhovell commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1126360940


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -140,6 +140,9 @@ message Read {
 
 // (Optional) A list of path for file-system backed data sources.
 repeated string paths = 4;
+
+// (Optional) Condition in the where clause for each partition.

Review Comment:
   Please add the comment that this currently only works for jdbc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-05 Thread via GitHub


hvanhovell commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1125835789


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -140,6 +141,21 @@ message Read {
 // (Optional) A list of path for file-system backed data sources.
 repeated string paths = 4;
   }
+
+  message PartitionedJDBC {
+// (Required) JDBC URL.
+string url = 1;
+
+// (Required) Name of the table in the external database.
+string table = 2;
+
+// (Optional) Condition in the where clause for each partition.
+repeated string predicates = 3;

Review Comment:
   Can we just put the predicates into the DataSource message?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org