[GitHub] [spark] hvanhovell commented on a diff in pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-08 Thread via GitHub


hvanhovell commented on code in PR #40291:
URL: https://github.com/apache/spark/pull/40291#discussion_r1130251210


##
connector/connect/server/pom.xml:
##
@@ -199,6 +199,12 @@
   ${tomcat.annotations.api.version}
   provided
 
+

Review Comment:
   Why the move?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-07 Thread via GitHub


hvanhovell commented on code in PR #40291:
URL: https://github.com/apache/spark/pull/40291#discussion_r1128152776


##
connector/connect/common/src/main/protobuf/spark/connect/commands.proto:
##
@@ -116,6 +116,7 @@ message WriteOperation {
   TABLE_SAVE_METHOD_UNSPECIFIED = 0;
   TABLE_SAVE_METHOD_SAVE_AS_TABLE = 1;
   TABLE_SAVE_METHOD_INSERT_INTO = 2;
+  TABLE_SAVE_METHOD_SAVE_AS_JDBC_TABLE = 3;

Review Comment:
   It is calling save() right? Why not call it `TABLE_SAVE_METHOD_SAVE`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-07 Thread via GitHub


hvanhovell commented on code in PR #40291:
URL: https://github.com/apache/spark/pull/40291#discussion_r1128061371


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:
##
@@ -345,6 +345,48 @@ final class DataFrameWriter[T] private[sql] (ds: 
Dataset[T]) {
 })
   }
 
+  /**
+   * Saves the content of the `DataFrame` to an external database table via 
JDBC. In the case the
+   * table already exists in the external database, behavior of this function 
depends on the save
+   * mode, specified by the `mode` function (default to throwing an exception).
+   *
+   * Don't create too many partitions in parallel on a large cluster; 
otherwise Spark might crash
+   * your external database systems.
+   *
+   * JDBC-specific option and parameter documentation for storing tables via 
JDBC in https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option;>
+   * Data Source Option in the version you use.
+   *
+   * @param table
+   *   Name of the table in the external database.
+   * @param connectionProperties
+   *   JDBC database connection arguments, a list of arbitrary string 
tag/value. Normally at least
+   *   a "user" and "password" property should be included. "batchsize" can be 
used to control the
+   *   number of rows per insert. "isolationLevel" can be one of "NONE", 
"READ_COMMITTED",
+   *   "READ_UNCOMMITTED", "REPEATABLE_READ", or "SERIALIZABLE", corresponding 
to standard
+   *   transaction isolation levels defined by JDBC's Connection object, with 
default of
+   *   "READ_UNCOMMITTED".
+   * @since 3.4.0
+   */
+  def jdbc(url: String, table: String, connectionProperties: Properties): Unit 
= {
+// connectionProperties should override settings in extraOptions.
+this.extraOptions ++= connectionProperties.asScala
+// explicit url and dbtable should override all
+this.extraOptions ++= Seq("url" -> url, "dbtable" -> table)
+format("jdbc").saveAsJdbcTable(table)
+  }
+
+  private def saveAsJdbcTable(tableName: String): Unit = {

Review Comment:
   Why do we need a separate method? There is only one method using it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-06 Thread via GitHub


hvanhovell commented on code in PR #40291:
URL: https://github.com/apache/spark/pull/40291#discussion_r1126358047


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:
##
@@ -345,6 +345,37 @@ final class DataFrameWriter[T] private[sql] (ds: 
Dataset[T]) {
 })
   }
 
+  /**
+   * Saves the content of the `DataFrame` to an external database table via 
JDBC. In the case the
+   * table already exists in the external database, behavior of this function 
depends on the save
+   * mode, specified by the `mode` function (default to throwing an exception).
+   *
+   * Don't create too many partitions in parallel on a large cluster; 
otherwise Spark might crash
+   * your external database systems.
+   *
+   * JDBC-specific option and parameter documentation for storing tables via 
JDBC in https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option;>
+   * Data Source Option in the version you use.
+   *
+   * @param table
+   *   Name of the table in the external database.
+   * @param connectionProperties
+   *   JDBC database connection arguments, a list of arbitrary string 
tag/value. Normally at least
+   *   a "user" and "password" property should be included. "batchsize" can be 
used to control the
+   *   number of rows per insert. "isolationLevel" can be one of "NONE", 
"READ_COMMITTED",
+   *   "READ_UNCOMMITTED", "REPEATABLE_READ", or "SERIALIZABLE", corresponding 
to standard
+   *   transaction isolation levels defined by JDBC's Connection object, with 
default of
+   *   "READ_UNCOMMITTED".
+   * @since 3.4.0
+   */
+  def jdbc(url: String, table: String, connectionProperties: Properties): Unit 
= {
+// connectionProperties should override settings in extraOptions.

Review Comment:
   Server side. There are a couple of reasons for this:
   - The server cannot trust the client to implement the verification properly. 
I am sure we will get it right for Scala and Python, but there are potentially 
a plethora of other frontends that need to do the same.
   - Keeping the client simple and reduce duplication. If we need to do this 
for every client we'll end up with a lot of duplication and increase client 
complexity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org