[GitHub] spark pull request: [SPARK-11474][SQL]change fetchSize to fetchsiz...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/9473 [SPARK-11474][SQL]change fetchSize to fetchsize In DefaultDataSource.scala, it has override def createRelation( sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation The parameters is CaseInsensitiveMap. After this line parameters.foreach(kv => properties.setProperty(kv._1, kv._2)) properties is set to all lower case key/value pairs and fetchSize becomes fetchsize. However, in compute method in JDBCRDD, it has val fetchSize = properties.getProperty("fetchSize", "0").toInt so fetchSize value is always 0 and never gets set correctly. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-11474 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9473.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9473 commit 31e18562643bc8e4e60dd395f5062e5e3d7660a6 Author: Huaxin Gao <huax...@oc0558782468.ibm.com> Date: 2015-11-04T07:28:30Z [SPARK-11474][SQL]change fetchSize to fetchsize --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11474]Options to jdbc load are lower ca...
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/9461 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11474]Options to jdbc load are lower ca...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/9461#issuecomment-153832023 I am closing this pull request as i think i reused my previous branch.. i would like to start clean. I will open a new pull request shortly. Sorry for the confusion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark 11474
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/9461 Spark 11474 You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark_11474 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9461.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9461 commit 75b7f57b42a1f0e88258187f75ab6b36a8706b0f Author: Huaxin Gao <huax...@us.ibm.com> Date: 2015-10-09T00:49:50Z add write.mode for insertIntoJDBC when the parm overwrite is false commit 684f4e9388f94d5102554a02c9497f23c40ef139 Author: Huaxin Gao <huax...@oc0558782468.ibm.com> Date: 2015-10-24T06:13:00Z Merge remote branch 'upstream/master' into spark8386 commit 495faae98d4f8d31e68d9736ef6f22aee2c514a6 Author: Huaxin Gao <huax...@oc0558782468.ibm.com> Date: 2015-11-04T05:58:34Z Merge remote branch 'upstream/master' into spark8386 commit becdb71bc0cfed3a71ec100fbcfa922728a40af8 Author: Huaxin Gao <huax...@oc0558782468.ibm.com> Date: 2015-11-04T07:28:30Z change fetchSize to fetchsize becasue the properties has lower case value --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8386] [SQL]add write.mode for insertInt...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/9042#issuecomment-148122772 I looked the error log, it failed at fetching changes from the remote Git repository. Does it mean something is wrong with my pull request and I need to do a new pull request? Sorry I am new and not familiar with these yet. Thanks for your help!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8386] [SQL]add write.mode for insertInt...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/9042 [SPARK-8386] [SQL]add write.mode for insertIntoJDBC when the parm overwrite is false the fix is for jira https://issues.apache.org/jira/browse/SPARK-8386 You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark8386 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9042.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9042 commit 75b7f57b42a1f0e88258187f75ab6b36a8706b0f Author: Huaxin Gao <huax...@us.ibm.com> Date: 2015-10-09T00:49:50Z add write.mode for insertIntoJDBC when the parm overwrite is false --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8386] [SQL]add write.mode for insertInt...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/9042#issuecomment-147592598 Has the test result come back yet? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-11778:parse table name before it is pass...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/9773#issuecomment-157490181 hiveContext.table("db_name.table") works but hiveContext.read.table("db_name.table") throws an org.apache.spark.sql.catalyst.analysis.NoSuchTableException In hiveContext.table("db_name.table"), it goes through SqlParser.parseTableIdentifier(tableName) and the table name "db_name.table" got resolved to 'db_name'.'table', and later, when trying to get the the qualified table name, the database name is resolved to db_name, and table name is table, and it can get the qualified table name OK. In hiveContext.read.table("db_name.table"), it doesn't go through SQLParser to parse the table name, so the table name "db_name.table" remain as is. Later, when trying to get the the qualified table name, the database name resolved as default, and table name is "db_name.table", it can't get the qualified table name correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-11778:parse table name before it is pass...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/9773 SPARK-11778:parse table name before it is passed to lookupRelation You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-11778 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9773.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9773 commit 02300e6dd64b62ea6e4b1143a2018bbd2698cdc6 Author: Huaxin Gao <huax...@oc0558782468.ibm.com> Date: 2015-11-17T19:42:37Z SPARK-11778:parse table name before it is passed to lookupRelation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11788][SQL] surround timestamp/date val...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/9872#issuecomment-160182654 Fixed title and scala style problem. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12088][SQL]check connection.isClosed be...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/10095 [SPARK-12088][SQL]check connection.isClosed before calling connection⦠In Java Spec java.sql.Connection, it has boolean getAutoCommit() throws SQLException Throws: SQLException - if a database access error occurs or this method is called on a closed connection So if conn.getAutoCommit is called on a closed connection, a SQLException will be thrown. Even though the code catch the SQLException and program can continue, I think we should check conn.isClosed before calling conn.getAutoCommit to avoid the unnecessary SQLException. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-12088 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10095.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10095 commit 0efeb958ed118aba3b6c6d2649c0cec58c43131d Author: Huaxin Gao <huax...@oc0558782468.ibm.com> Date: 2015-12-01T20:41:06Z [SPARK-12088][SQL]check connection.isClosed before calling connection.getAutoCommit in JDBCRDD --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12088][SQL]check connection.isClosed be...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10095#issuecomment-161219510 I guess I don't need to provide a regression test. Currently, in JDBCSuite, there are lots of Warning as the following, the Warning will be gone after the fix. 13:11:51.738 WARN org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD: Exception closing connection org.h2.jdbc.JdbcSQLException: The object is already closed [90007-183] at org.h2.message.DbException.getJdbcSQLException(DbException.java:345) at org.h2.message.DbException.get(DbException.java:179) at org.h2.message.DbException.get(DbException.java:155) at org.h2.message.DbException.get(DbException.java:144) at org.h2.jdbc.JdbcConnection.checkClosed(JdbcConnection.java:1466) at org.h2.jdbc.JdbcConnection.checkClosed(JdbcConnection.java:1444) at org.h2.jdbc.JdbcConnection.getAutoCommit(JdbcConnection.java:447) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12391][SQL]JDBC OR operator push down
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/10347 [SPARK-12391][SQL]JDBC OR operator push down You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-12391 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10347.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10347 commit a7c4a1ac41db4f5afa8f6e990ca385f097f540ea Author: Huaxin Gao <huax...@us.ibm.com> Date: 2015-12-17T06:15:44Z [SPARK-12391][SQL]JDBC OR operator push down --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12387][SQL]JDBC IN operator push down
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/10345 [SPARK-12387][SQL]JDBC IN operator push down Will push down SQL IN operator such as the following to JDBC datasource SELECT column_name(s) FROM table_name WHERE column_name IN (value1,value2,...) You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-12387 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10345.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10345 commit bb83a9ca666184e9a6f246f295d20f279bbd8a79 Author: Huaxin Gao <huax...@oc0558782468.ibm.com> Date: 2015-12-17T05:32:07Z [SPARK-12387][SQL]JDBC IN operator push down --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12270][SQL]remove empty space after get...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10262#issuecomment-164680145 @andrewor14 Thanks a lot for your comment. I will change to what you suggested. In the same method, case DecimalConversion has the similar code. Shall I change that one too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12387][SQL]JDBC IN operator push down
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/10345#discussion_r47965769 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -269,6 +269,13 @@ private[sql] class JDBCRDD( case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "'" + timestampValue + "'" case dateValue: Date => "'" + dateValue + "'" +case objectValue: Array[Object] => { + val str = objectValue.map { +case string: String => "'" + string + "'" --- End diff -- @bomeng Thanks a lot for your comment. I will change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12270][SQL]remove empty space after get...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/10262 [SPARK-12270][SQL]remove empty space after getString from database {code} conn.prepareStatement( "create table people (name char(32)").executeUpdate() conn.prepareStatement("insert into people values ('fred')").executeUpdate() sql( s""" |CREATE TEMPORARY TABLE foobar |USING org.apache.spark.sql.jdbc |OPTIONS (url '$url', dbtable 'PEOPLE', user 'testuser', password 'testpassword') """.stripMargin.replaceAll("\n", " ")) val df = sqlContext.sql("SELECT * FROM foobar WHERE NAME = 'fred'") {code} I am expecting to see one row with content 'fred' in df. However, there is no row returned. If I changed the data type to varchar (32) in the create table ddl , then I can get the row back correctly. The cause of the problem is that for data type char (num), DB2 defines it as fixed-length character strings, so if I have char (32), when doing "SELECT * FROM foobar WHERE NAME = 'fred'", DB2 returns 'fred' padded with 28 empty space. Spark treats "fred' padded with empty space not the same as 'fred' so df doesn't have any row. If I have varchar (32), DB2 just returns 'fred' for the select statement and df has the right row. In order to make DB2 char (num) works for spark, I suggest to change spark code to trim the empty space after get the data from database. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-12270 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10262.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10262 commit 63a3d835119ace22a359af03c267451aa68cc570 Author: Huaxin Gao <huax...@us.ibm.com> Date: 2015-12-08T07:41:25Z [SPARK-12270][SQL]remove empty space after getString from database --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10505#issuecomment-168901512 @rxin I am not sure if my approach is OK. Could you please take a quick look when you have time and let me know what you think? Thank you very much for your help!!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12391][SQL]JDBC OR operator push down
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/10347 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12409][SQL]JDBC AND operator push down
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/10369 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12409][SQL]JDBC AND operator push down
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10369#issuecomment-165850180 Close for now. Will put the filter changes in one PR so it's easier to merge. Sorry for the inconvenience. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12409][SQL]JDBC AND operator push down
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10369#issuecomment-165848876 Sorry for the trouble. I should have everything in one PR. Will do it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12387][SQL]JDBC IN operator push down
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10345#issuecomment-165849528 Close for now. Will put the filter changes in one PR so it's easier to merge --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12387][SQL]JDBC IN operator push down
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/10345 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12391][SQL]JDBC OR operator push down
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10347#issuecomment-165849800 Close for now. Will put the filter changes in one PR so it's easier to merge. Sorry for the inconvenience. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12270][SQL]remove empty space after get...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10262#issuecomment-166033287 @yhuai JDBCSuite uses H2 database. It seems that for char(n) data type, either H2 database doesn't pad, or the H2 JDBC driver already trims the empty space for ResultSet.getString. So H2 database doesn't have this problem. To show the problem, it will need DB2 and DB2 JDBC driver ( I guess Oracle has the same problem too) , but I don't think the test system has DB2 JDBC driver. So I am guessing maybe no need to add the test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12409][SQL]add filter (IN, AND, OR)
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/10386 [SPARK-12409][SQL]add filter (IN, AND, OR) push filters IN, ADD, OR to JDBC layer. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark_new_filter Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10386.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10386 commit e0b5e7c80f7b59944309aaa46f86ed8d4e62cd4a Author: Huaxin Gao <huax...@us.ibm.com> Date: 2015-12-18T20:11:36Z [SPARK-12409][SQL]add filter (IN, AND, OR) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12459][SQL]add ExpressionDescription to...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/10460 [SPARK-12459][SQL]add ExpressionDescription to string functions You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-12459 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10460.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10460 commit f88386e34a0231ccb1fbfd427b0783e3364ce83e Author: Huaxin Gao <huax...@us.ibm.com> Date: 2015-12-23T22:31:02Z [SPARK-12459][SQL]add ExpressionDescription to string functions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/10505 [SPARK-12506][SQL]push down WHERE clause arithmetic operator to JDBC ⦠â¦layer For arithmetic operator in WHERE clause such as select * from table where c1 + c2 > 10 Currently where c1 + c2 >10 is done at spark layer. Will push this to JDBC layer so it will be done in database You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark12506 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10505.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10505 commit b10310898953e8830898c0ac95c861c8f3c88fa5 Author: Huaxin Gao <huax...@us.ibm.com> Date: 2015-12-27T20:45:52Z [SPARK-12506][SQL]push down WHERE clause arithmetic operator to JDBC layer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10505#issuecomment-167746195 I only added + operator for now. If the change is accepted, I will also add -,* and /. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12391][SQL]JDBC OR operator push down
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/10347#discussion_r47948040 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -288,6 +288,8 @@ private[sql] class JDBCRDD( case GreaterThanOrEqual(attr, value) => s"$attr >= ${compileValue(value)}" case IsNull(attr) => s"$attr IS NULL" case IsNotNull(attr) => s"$attr IS NOT NULL" +case Or(filter1, filter2) => + compileFilter (filter1) + " OR " + compileFilter (filter2) --- End diff -- @rxin Thank you very much for your comments. I will change the code to case Or(filter1, filter2) => "(" + compileFilter (filter1) + ") OR (" + compileFilter (filter2) + ")" The reason that I didn't add the AND filter along with OR is because I thought AND filter is not necessary. When I tried the simple test such as "select * from test where THEID = 1 and NAME = 'fred'", the filter passed to JDBC layer is an Array of filters with elements EqualTo(THEID,1), EqualTo(Name,fred). This can be handled by the current filter EqualTo and no need to add AND filter. But today, after more thinking, I found I still need to add AND. For query such as "SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' and THEID = 2" , the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2))) I will open another jira to add AND I also looked other filters: 1. LIKE For string column, SPARK used EqualTo. "Where columnName LIKE xxx" is changed to "Where columnName = xxx". I am not sure if SPARK intentionally changes LIKE to '='. In SQL, LIKE is not exactly the same as '=' 2. BETWEEN is changed to two filters GreatThanOrEqual, LessThanOrEqual. So current code works OK. No need to add anything 3. NOT BETWEEN is changed to Or(LessThanOrEqual, GreatThanOrEqual). No need to add anything. I will look more to see if I need to change other filters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12409][SQL]JDBC AND operator push down
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/10369 [SPARK-12409][SQL]JDBC AND operator push down For simple AND such as select * from test where THEID = 1 AND NAME = 'fred', The filters pushed down to JDBC layers are EqualTo(THEID,1), EqualTo(Name,fred). These are handled OK by the current code. For query such as SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" , the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2))) So need to add And filter in JDBC layer. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark_12409 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10369.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10369 commit f5cfd1d038954a4b0fd1f502769fbd3140e7d35b Author: Huaxin Gao <huax...@us.ibm.com> Date: 2015-12-17T17:32:24Z [SPARK-12409][SQL]JDBC AND operator push down --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11788][SQL]:surround timestamp/date val...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/9872#issuecomment-159375392 @JoshRosen Regression test added. Could you please take a look? Thanks a lot for your help!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11778] [SQL]:parse table name before it...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/9773#issuecomment-157915848 test case added. Could you please take a look? Thanks a lot!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11788][SQL]:surround timestamp/date val...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/9872 [SPARK-11788][SQL]:surround timestamp/date value with quotes You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-11788 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9872.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9872 commit 648ab61193cfd7e1c385823c1af67e39e40b8349 Author: Huaxin Gao <huax...@oc0558782468.ibm.com> Date: 2015-11-20T22:21:20Z [SPARK-11788][SQL}:surround timestamp/date value with quotes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11788][SQL]:surround timestamp/date val...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/9872#issuecomment-158545092 Sure. Will add a test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-11778:parse table name before it is pass...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/9773#issuecomment-157865029 I will add a test case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11778][SQL]:add regression test
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/9890 [SPARK-11778][SQL]:add regression test Fix regression test for SPARK-11778. @marmbrus Could you please take a look? Thank you very much!! You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-11778-regression-test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9890.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9890 commit cb04f55d79560393c454361807e60dbdb94640c4 Author: Huaxin Gao <huax...@oc0558782468.ibm.com> Date: 2015-11-22T03:14:02Z [SPARK-11778][SQL]:add regression test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11778][SQL]:add regression test
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/9890#issuecomment-159143868 I couldn't figure out why the tests failed, but it doesn't seem to me that my new test suite caused the failure. Is it OK to have a retest? Thanks a lot!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12270][SQL]remove empty space after get...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10262#issuecomment-170701187 @yhuai Sorry for the late reply. I waited for my coworker Luciano to come back from vacation today to check with him about his DB2 docker test status. He has a PR to add DB2 docker test. https://github.com/apache/spark/pull/9893. His PR is still pending because of the DB2 jdbc driver dependency. After his PR is merged, I will add a test in his DB2 docker test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13287: [SPARK-15491][SQL]fix assertion failure for JDBC DataFra...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/13287 @rxin gentle ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13492: [SPARK-15749][SQL]make the error message more meaningful
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/13492 retest please. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13492: [SPARK-15749][SQL]make the error message more mea...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/13492 [SPARK-15749][SQL]make the error message more meaningful ## What changes were proposed in this pull request? For table test1 (C1 varchar (10), C2 varchar (10)), when I insert a row using ``` sqlContext.sql("insert into test1 values ('abc', 'def', 1)") ``` I got error message ``` Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1) requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema. ``` The error message is a little confusing. In my simple insert statement, it doesn't have a SELECT clause. I will change the error message to a more general one ``` Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1) requires that the data to be inserted have the same number of columns as the target table. ``` ## How was this patch tested? I tested the patch using my simple unit test, but it's a very trivial change and I don't think I need to check in any test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-15749 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13492.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13492 commit 11768e8ec799a8b51bbf90ff9943b2b0a2808e72 Author: Huaxin Gao <huax...@us.ibm.com> Date: 2016-06-03T06:15:32Z [SPARK-15749][SQL]make the error message more meaningful --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15491][SQL]fix assertion failure for JD...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/13287#issuecomment-221617823 @rxin Sorry, I didn't notice your comment until this morning. Reformatted the description. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13492: [SPARK-15749][SQL]make the error message more meaningful
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/13492 @andrewor14 I am not sure why the test failed, but it doesn't seem to be related to my change. Could you please start another test? Thanks a lot!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15491][SQL]fix assertion failure for JD...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/13287#issuecomment-21356 @rxin Thanks for your comment. When I wrote the test, I found there is one more place I need to change. So there are two places that have problems: 1. TreeNode that has two layers of constructor parameters. 2. TreeNode has a constructor parameter that contains two layers of constructor parameters. I have both cases in my test. Hopefully it is what you want. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13492: [SPARK-15749][SQL]make the error message more meaningful
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/13492 @andrewor14 Thanks a lot for merging in the change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10750#issuecomment-171498490 @viirya I changed the code based on your suggestion. Could you please review again? Thanks a lot for your help!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/10750 [SPARK-12506][SQL]push down WHERE clause arithmetic operator to JDBC ⦠â¦layer For arithmetic operator in WHERE clause such as select * from table where c1 + c2 > 10 Currently where c1 + c2 >10 is done at spark layer. Will push this to JDBC layer so it will be done in database You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark__12506 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10750.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10750 commit 944bee6fe204e1cb12f3c5c57d6e09ad580905bd Author: Huaxin Gao <huax...@us.ibm.com> Date: 2016-01-13T14:04:20Z [SPARK-12506][SQL]push down WHERE clause arithmetic operator to JDBC layer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/10505 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/10505#discussion_r49678184 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -485,6 +486,74 @@ private[sql] object DataSourceStrategy extends Strategy with Logging { } /** + * Convert add predicate such as C1 + C2 + C3 to string + */ + private[this] def getAddString (predicate: Expression): String = { +predicate match { + case expressions.Add(left, right) => + { +val leftString = left match { + case a: Attribute => a.name + case add: Add => getAddString (add) + case _ => None +} +val rightString = right match { + case a: Attribute => a.name + case add: Add => getAddString (add) + case _ => None +} +leftString + " + " + rightString + } +} + } + + /** + * Tries to translate a Catalyst [[Expression]] into data source [[Filter]]. --- End diff -- @viirya Thank you very much for your comments. I changed the code based on your suggestion. Since the new code is quite different from the original, I will close this PR and submit a new one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/11104 [SPARK-13186][Streaming]Migrate away from SynchronizedMap trait SynchronizedMap in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Change to java.util.concurrent.ConcurrentHashMap instead. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark_13186 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11104.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11104 commit 24e75ae7070f82ec847c144a5ba4940736d95503 Author: Huaxin Gao <huax...@us.ibm.com> Date: 2016-02-02T07:13:50Z [SPARK-13186][Streaming]Migrate away from SynchronizedMap --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/11104#issuecomment-180883943 @holdenk Could you please review? Thanks!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/11104#discussion_r52126950 --- Diff: external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala --- @@ -30,6 +30,9 @@ import org.apache.spark.{SparkConf, SparkFunSuite} import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.{Milliseconds, StreamingContext} +import java.util.concurrent.ConcurrentHashMap +import scala.collection.convert.decorateAsScala._ --- End diff -- @holdenk Thanks for your comments. Yes, that's why I have decorateAsScala there. I am working on changing the code to use Java API for +=, put and getOrElseUpdate. Do we also need concurrency guarantee for ++ and --? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/11104#issuecomment-181664339 Sorry for the file line length problem. Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/11104#discussion_r52228031 --- Diff: external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala --- @@ -65,12 +67,14 @@ class KafkaStreamSuite extends SparkFunSuite with Eventually with BeforeAndAfter val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) -val result = new mutable.HashMap[String, Long]() with mutable.SynchronizedMap[String, Long] +val result = new ConcurrentHashMap[String, Long].asScala stream.map(_._2).countByValue().foreachRDD { r => val ret = r.collect() ret.toMap.foreach { kv => -val count = result.getOrElseUpdate(kv._1, 0) + kv._2 -result.put(kv._1, count) +result.synchronized { + val count = result.getOrElseUpdate(kv._1, 0) + kv._2 --- End diff -- @holdenk Thanks for your quick reply. I initially changed val count = result.getOrElseUpdate(kv._1, 0) + kv._2 to result.putIfAbsent(kv._1, 0) val count = result.get(kv._1) + kv._2_ but the test failed for me. I guess a different thread can come in between of the two lines and the concurrency is not guaranteed any more. So I used synchronized block instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/11104#issuecomment-181528520 @holdenk Could you please review one more time? I changed to java api except the getOrElseUpdate in KafkaStreamSuite.scala. I can't find a java equivalent that can be done in one line. So I used the synchronized block. Thank you very much for your help!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/11104#issuecomment-183434809 @srowen Will do. I have my local branch messed up. If i can't figure out how to fix it, I will close this PR and submit a new one. Also, one of the python streaming test failed with java.net.BindException. I am still trying to figure out the problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/11104#issuecomment-181633831 @zsxwing Thanks for the comments. I didn't see a PR for removing SynchronizedSet. I will work on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/11104#discussion_r52250939 --- Diff: external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala --- @@ -65,12 +67,14 @@ class KafkaStreamSuite extends SparkFunSuite with Eventually with BeforeAndAfter val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) -val result = new mutable.HashMap[String, Long]() with mutable.SynchronizedMap[String, Long] +val result = new ConcurrentHashMap[String, Long].asScala stream.map(_._2).countByValue().foreachRDD { r => val ret = r.collect() ret.toMap.foreach { kv => -val count = result.getOrElseUpdate(kv._1, 0) + kv._2 -result.put(kv._1, count) +result.synchronized { + val count = result.getOrElseUpdate(kv._1, 0) + kv._2 --- End diff -- @holdenk @zsxwing I tried _val count = result.putIfAbsent(kv.1, 0) + kv._2,_ but the test failed for me. So I will change to mutable.HashMap and put in synchronized block. Is it OK to use mutable.HashMap and synchronized block in this file only, but use java.util.concurrent.ConcurrentHashMap in other files(StreamingListenerSuite, KinesisStreamTests and FileInputDStream)? Or is it better to to use mutable.HashMap and synchronized block for all the files that has SynchronizedMap? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/11104#issuecomment-181648992 Fixed the problems. Thank you all very much for your help!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12506][SPARK-12126][SQL]use CatalystSca...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/11005 [SPARK-12506][SPARK-12126][SQL]use CatalystScan for JDBCRelation As suggested https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526; class="external-link" rel="nofollow">here, I will change JDBCRelation to implement CatalystScan, and then directly access Catalyst expressions in JDBCRDD. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-12126 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11005.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11005 commit 0bdfea86fc0671bf636bbed3174baa417de6367e Author: Huaxin Gao <huax...@us.ibm.com> Date: 2016-01-31T04:01:12Z [SPARK-12506][SPARK-12126][SQL]use CatalystScan for JDBCRelation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/10750 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10750#issuecomment-178119526 @viirya @HyukjinKwon @rxin Thank you all very much for your comments. I will change JDBCRelation to implement CatalystScan, and then directly access Catalyst expressions in JDBCRDD. I will close this PR and submit a new one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]migrate away from Sync...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/11250#issuecomment-187305582 @srowen @holdenk Thank you very much for your help!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/11104 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/11104#issuecomment-185604993 @srowen @holdenk I will close this PR and submit a new one. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]migrate away from Sync...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/11250 [SPARK-13186][Streaming]migrate away from SynchronizedMap trait SynchronizedMap in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Change to java.util.concurrent.ConcurrentHashMap instead. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark__13186 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11250.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11250 commit ff5e52ae100408e81de7f138b876ec8a47aaafa1 Author: Huaxin Gao <huax...@us.ibm.com> Date: 2016-02-18T04:25:41Z [SPARK-13186][Streaming]migrate away from SynchronizedMap --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]migrate away from Sync...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/11250#issuecomment-185608162 @srowen @holdenk Could you please take a look of this PR? I ran the python streaming test cleanly on my local before I submitted the PR. Thanks a lot!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]migrate away from Sync...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/11250#discussion_r53430913 --- Diff: extras/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala --- @@ -241,13 +243,13 @@ abstract class KinesisStreamTests(aggregateTestData: Boolean) extends KinesisFun kinesisStream.foreachRDD((rdd: RDD[Array[Byte]], time: Time) => { val kRdd = rdd.asInstanceOf[KinesisBackedBlockRDD[Array[Byte]]] val data = rdd.map { bytes => new String(bytes).toInt }.collect().toSeq - collectedData(time) = (kRdd.arrayOfseqNumberRanges, data) + collectedData.put(time, (kRdd.arrayOfseqNumberRanges, data)) }) ssc.remember(Minutes(60)) // remember all the batches so that they are all saved in checkpoint ssc.start() -def numBatchesWithData: Int = collectedData.count(_._2._2.nonEmpty) +def numBatchesWithData: Int = collectedData.asScala.count(_._2._2.nonEmpty) --- End diff -- @srowen Thanks for your comment. For the 5 files I changed, I will remove the usage of Java ConcurrentHashMap, and use mutable.HashMap instead. I will wrap every mutable.HashMap operation in a synchronized block. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13186][Streaming]migrate away from Sync...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/11250#discussion_r53563513 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala --- @@ -163,8 +166,11 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]]( /** Clear the old time-to-files mappings along with old RDDs */ protected[streaming] override def clearMetadata(time: Time) { super.clearMetadata(time) -val oldFiles = batchTimeToSelectedFiles.filter(_._1 < (time - rememberDuration)) -batchTimeToSelectedFiles --= oldFiles.keys +val oldFiles = batchTimeToSelectedFiles.synchronized +{ + batchTimeToSelectedFiles.filter(_._1 < (time - rememberDuration)) --- End diff -- @srowen Thank you very much for your comments. I changed the code to synchronize the whole block. The extra parentheses at line 153 seems to be required. Removing it causes problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12270][SQL]remove empty space after get...
Github user huaxingao commented on the pull request: https://github.com/apache/spark/pull/10262#issuecomment-214437787 @HyukjinKwon I will continue working on this and finish the work this week. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15491][SQL]fix assertion failure for JD...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/13287 [SPARK-15491][SQL]fix assertion failure for JDBC DataFrame to JSON ## What changes were proposed in this pull request? in TreeNode.scala parseToJson, it has case p: Product => try { val fieldNames = getConstructorParameterNames(p.getClass) val fieldValues = p.productIterator.toSeq assert(fieldNames.length == fieldValues.length) ("product-class" -> JString(p.getClass.getName)) :: fieldNames.zip(fieldValues).map { case (name, value) => name -> parseToJson(value) }.toList For a class that has two level of constructor parameters, for example, private[sql] case class JDBCRelation( url: String, table: String, parts: Array[Partition], properties: Properties = new Properties())(@transient val sparkSession: SparkSession) val fieldValues = p.productIterator.toSeq will only get the values for the first level of constructor parameters, in this case, 4 parameters, but val fieldNames = getConstructorParameterNames(p.getClass) will get all the 5 parameters, so assert(fieldNames.length == fieldValues.length) will fail. I will change code to make fieldValues have all the parameters. ## How was this patch tested? I have a unit test. Will add it soon. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-15491 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13287.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13287 commit 8307b348b62f0653cccd04cd543cf77281bc4b85 Author: Huaxin Gao <huax...@us.ibm.com> Date: 2016-05-25T00:11:53Z [SPARK-15491][SQL]fix assertion failure for JDBC DataFrame to JSON --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14535: [SPARK-16946][SQL]throw Exception if saveAsTable[...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/14535 [SPARK-16946][SQL]throw Exception if saveAsTable[apend] has different⦠## What changes were proposed in this pull request? In HiveContext, if saveAsTable[append] has different number of columns, Spark will throw Exception. e.g. ``` test("saveAsTable[append]: too many columns") { withTable("saveAsTable_too_many_columns") { Seq((1, 2)).toDF("i", "j").write.saveAsTable("saveAsTable_too_many_columns") val e = intercept[AnalysisException] { Seq((3, 4, 5)).toDF("i", "j", "k").write.mode("append").saveAsTable("saveAsTable_too_many_columns") } assert(e.getMessage.contains("doesn't match")) } } ``` However, in SparkSession or SQLContext, if use the above code example, the extra column in the append data will be removed silently without any warning or Exception. The table becomes ``` i j 3 4 1 2 ``` I changed the code to throw Exception if saveAsTable[apend] has different column numbers ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Add unit tests to test the patch. ⦠number of columns You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-16946 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14535.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14535 commit 52e7b66e1ff335566af92c795c1679c11e0ab8cb Author: Huaxin Gao <huax...@us.ibm.com> Date: 2016-08-08T06:06:58Z [SPARK-16946][SQL]throw Exception if saveAsTable[apend] has different number of columns --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10262: [SPARK-12270][SQL]remove empty space after getStr...
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/10262 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/16175#discussion_r91345742 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -1110,6 +1110,16 @@ class DatasetSuite extends QueryTest with SharedSQLContext { } assert(e.getMessage.contains("Cannot create encoder for Option of Product type")) } + + + test ("SPARK-17460: If the sizeInBytes in Statistics exceeds the limit of an Int, " + --- End diff -- Thanks for the comment. Changed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes >=0 in...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/16175 @gatorsmile Thanks a lot for reviewing this. Sorry I just saw your last comment after I pushed the change. Will make more changes for other potential overflow issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes >=0 in...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/16175 @gatorsmile Could you please take a look when you have time? Thanks a lot!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/16175 [SPARK-17460][SQL]check if statistics.sizeInBytes >=0 in canBroadcast ## What changes were proposed in this pull request? 1. In SparkStrategies.canBroadcast, I will add the check plan.statistics.sizeInBytes >= 0 2. In LocalRelations.statistics, when calculate the statistics, I will change the size to BigInt so it won't overflow. ## How was this patch tested? I will add a test case to make sure the statistics.sizeInBytes won't overflow. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-17460 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16175.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16175 commit 9ad6725975e36e513ce89f54582409bd3e9d2cfe Author: Huaxin Gao <huax...@us.ibm.com> Date: 2016-12-06T18:22:38Z [SPARK-17460][SQL]check if statistics.sizeInBytes >=0 in canBroadcast --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/16175#discussion_r91765482 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -115,7 +115,8 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { */ private def canBroadcast(plan: LogicalPlan): Boolean = { plan.statistics.isBroadcastable || -plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold +(plan.statistics.sizeInBytes >= 0 && --- End diff -- @cloud-fan @gatorsmile Thank you all for your comments. I will change defaultSizeInBytes based on the above comments. Seems all the other places of Statistics.sizeInBytes are safe and will not cause overflow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14535: [SPARK-16946][SQL]throw Exception if saveAsTable[...
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/14535 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14535: [SPARK-16946][SQL]throw Exception if saveAsTable[apend] ...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/14535 @gatorsmile Sorry for the late response. I just came back from China. This is not an issue any more. I will close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13287: [SPARK-15491][SQL]fix assertion failure for JDBC DataFra...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/13287 @gatorsmile Sorry for the late response. I just came back from China. For a class that has two level of constructor parameters, e.g. private[sql] case class JDBCRelation( url: String, table: String, parts: Array[Partition], properties: Properties = new Properties())(@transient val sparkSession: SparkSession) toJson will have problem. The current code doesn't have problem any more because in the new method TreeNode.shouldConvertToJson, JDBCRelation returns false and is not allowed to convert to JSON. However, it seems to me that the root problem is not fixed. In the following cases: 1. TreeNode that has two layers of constructor parameters. 2. TreeNode has a constructor parameter that contains two layers of constructor parameters. And if the TreeNode can be converted to JSON, the Exception will still occur. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13287: [SPARK-15491][SQL]fix assertion failure for JDBC ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/13287#discussion_r123368171 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala --- @@ -594,7 +596,9 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { // returns null if the product type doesn't have a primary constructor, e.g. HiveFunctionWrapper case p: Product => try { --- End diff -- Thank you very much for your reply. Shall we also allow the JDBCRelation to be converted to JSON like the following? ``` val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:h2:mem:testdb0;user=testUser;password=testPass", "dbtable" -> "people")).load().queryExecution.logical.toJSON ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13287: [SPARK-15491][SQL]fix assertion failure for JDBC ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/13287#discussion_r123818099 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala --- @@ -594,7 +596,9 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { // returns null if the product type doesn't have a primary constructor, e.g. HiveFunctionWrapper case p: Product => try { --- End diff -- Thanks. I will close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13287: [SPARK-15491][SQL]fix assertion failure for JDBC ...
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/13287 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19256: [SPARK-21338][SQL]implement isCascadingTruncateTa...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/19256 [SPARK-21338][SQL]implement isCascadingTruncateTable() method in Aggr⦠â¦egatedDialect ## What changes were proposed in this pull request? org.apache.spark.sql.jdbc.JdbcDialect's method: def isCascadingTruncateTable(): Option[Boolean] = None is not overriden in org.apache.spark.sql.jdbc.AggregatedDialect class. Because of this issue, when you add more than one dialect Spark doesn't truncate table because isCascadingTruncateTable always returns default None for Aggregated Dialect. Will implement isCascadingTruncateTable in AggregatedDialect class in this PR. ## How was this patch tested? In JDBCSuite, inside test("Aggregated dialects"), will add one line to test AggregatedDialect.isCascadingTruncateTable You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-21338 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19256.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19256 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19256: [SPARK-21338][SQL]implement isCascadingTruncateTa...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/19256#discussion_r139313120 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/AggregatedDialect.scala --- @@ -41,4 +41,8 @@ private class AggregatedDialect(dialects: List[JdbcDialect]) extends JdbcDialect override def getJDBCType(dt: DataType): Option[JdbcType] = { dialects.flatMap(_.getJDBCType(dt)).headOption } + + override def isCascadingTruncateTable(): Option[Boolean] = { +dialects.flatMap(_.isCascadingTruncateTable).headOption --- End diff -- Both getCatalystType and getJDBCType(dt: DataType) use the first one. Also, in the class header, it has the following: /** * AggregatedDialect can unify multiple dialects into one virtual Dialect. * Dialects are tried in order, and the first dialect that does not return a * neutral element will will. * * @param dialects List of dialects. */ It has a typo, I guess it means "the first dialect that does not return a neutral element will return" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/19256 Thanks @gatorsmile I will change both the implementation and the PR title. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/19256 Thanks @gatorsmile Does the following logic look good to you? ``` if(any dialect's isCascadingTruncateTable returns true) return Some(true) else if (any dialect's isCascadingTruncateTable returns false) return Some(false) else // None of the dialect implements this method, return the superclass default value return None ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/19256 @gatorsmile Thanks a lot for your help!!! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/19496#discussion_r144732734 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2103,4 +2103,13 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)), Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2))) } + + test("SPARK-22271: mean overflows and returns null for some decimal variables") { +val d: BigDecimal = BigDecimal(0.034567890) --- End diff -- It's not necessary. I will remove. Thanks for your review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/19496 [SPARK-22271][SQL]mean overflows and returns null for some decimal variables ## What changes were proposed in this pull request? In Average.scala, it has ``` override lazy val evaluateExpression = child.dataType match { case DecimalType.Fixed(p, s) => // increase the precision and scale to prevent precision loss val dt = DecimalType.bounded(p + 14, s + 4) Cast(Cast(sum, dt) / Cast(count, dt), resultType) case _ => Cast(sum, resultType) / Cast(count, resultType) } def setChild (newchild: Expression) = { child = newchild } ``` It is possible that Cast(count, dt), resultType) will make the precision of the decimal number bigger than 38, and this causes over flow. Since count is an integer and doesn't need a scale, I will cast it using DecimalType.bounded(38,0) ## How was this patch tested? In DataFrameSuite, I will add a test case. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-22271 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19496.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19496 commit a3437ee4a87d1f51b362adeb20d4fcc264085ba7 Author: Huaxin Gao <huax...@us.ibm.com> Date: 2017-10-14T04:45:27Z [SPARK-22271][SQL]mean overflows and returns null for some decimal variables --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/19496#discussion_r145182120 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2103,4 +2103,13 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)), Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2))) } + + test("SPARK-22271: mean overflows and returns null for some decimal variables") { +val d = 0.034567890 +val df = Seq(d, d, d, d, d, d, d, d, d, d).toDF("DecimalCol") +val result = df.select('DecimalCol cast DecimalType(38, 33)) +.select(col("DecimalCol")).describe() +val mean = result.select("DecimalCol").where($"summary" === "mean") +assert(mean.collect.toSet === Set(Row("0.034567890"))) --- End diff -- @gatorsmile Thanks Sean for your review. I will fix the problems. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19496: [SPARK-22271][SQL]mean overflows and returns null for so...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/19496 @gatorsmile Thank you very much for your help! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/19715#discussion_r150450334 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -146,4 +146,172 @@ class QuantileDiscretizerSuite val model = discretizer.fit(df) assert(model.hasParent) } + + test("Multiple Columns: Test observed number of buckets and their sizes match expected values") { +val spark = this.spark +import spark.implicits._ + +val datasetSize = 10 +val numBuckets = 5 +val data1 = Array.range(1, 11, 1).map(_.toDouble) +val data2 = Array.range(1, 20, 2).map(_.toDouble) +val data = (0 until 10).map { idx => + (data1(idx), data2(idx)) +} +val df: DataFrame = data.toSeq.toDF("input1", "input2") + +val discretizer = new QuantileDiscretizer() + .setInputCols(Array("input1", "input2")) + .setOutputCols(Array("result1", "result2")) + .setNumBuckets(numBuckets) +assert(discretizer.isQuantileDiscretizeMultipleColumns()) +val result = discretizer.fit(df).transform(df) + +val relativeError = discretizer.getRelativeError +val isGoodBucket = udf { + (size: Int) => math.abs( size - (datasetSize / numBuckets)) <= (relativeError * datasetSize) +} + +for (i <- 1 to 2) { + val observedNumBuckets = result.select("result" + i).distinct.count + assert(observedNumBuckets === numBuckets, +"Observed number of buckets does not equal expected number of buckets.") + + val numGoodBuckets = result.groupBy("result" + i).count.filter(isGoodBucket($"count")).count + assert(numGoodBuckets === numBuckets, +"Bucket sizes are not within expected relative error tolerance.") +} + } + + test("Multiple Columns: Test on data with high proportion of duplicated values") { +val spark = this.spark +import spark.implicits._ + +val numBuckets = 5 +val expectedNumBucket = 3 +val data1 = Array(1.0, 3.0, 2.0, 1.0, 1.0, 2.0, 3.0, 2.0, 2.0, 2.0, 1.0, 3.0) +val data2 = Array(1.0, 2.0, 3.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 3.0, 1.0, 2.0) +val data = (0 until data1.length).map { idx => + (data1(idx), data2(idx)) +} --- End diff -- Will change to data1.zip(data2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/19715#discussion_r150451065 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -146,4 +146,172 @@ class QuantileDiscretizerSuite val model = discretizer.fit(df) assert(model.hasParent) } + + test("Multiple Columns: Test observed number of buckets and their sizes match expected values") { +val spark = this.spark +import spark.implicits._ + +val datasetSize = 10 +val numBuckets = 5 +val data1 = Array.range(1, 11, 1).map(_.toDouble) +val data2 = Array.range(1, 20, 2).map(_.toDouble) +val data = (0 until 10).map { idx => + (data1(idx), data2(idx)) +} +val df: DataFrame = data.toSeq.toDF("input1", "input2") + +val discretizer = new QuantileDiscretizer() + .setInputCols(Array("input1", "input2")) + .setOutputCols(Array("result1", "result2")) + .setNumBuckets(numBuckets) +assert(discretizer.isQuantileDiscretizeMultipleColumns()) +val result = discretizer.fit(df).transform(df) + +val relativeError = discretizer.getRelativeError +val isGoodBucket = udf { + (size: Int) => math.abs( size - (datasetSize / numBuckets)) <= (relativeError * datasetSize) +} + +for (i <- 1 to 2) { + val observedNumBuckets = result.select("result" + i).distinct.count + assert(observedNumBuckets === numBuckets, +"Observed number of buckets does not equal expected number of buckets.") + + val numGoodBuckets = result.groupBy("result" + i).count.filter(isGoodBucket($"count")).count + assert(numGoodBuckets === numBuckets, +"Bucket sizes are not within expected relative error tolerance.") +} + } + + test("Multiple Columns: Test on data with high proportion of duplicated values") { +val spark = this.spark +import spark.implicits._ + +val numBuckets = 5 +val expectedNumBucket = 3 +val data1 = Array(1.0, 3.0, 2.0, 1.0, 1.0, 2.0, 3.0, 2.0, 2.0, 2.0, 1.0, 3.0) +val data2 = Array(1.0, 2.0, 3.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 3.0, 1.0, 2.0) +val data = (0 until data1.length).map { idx => + (data1(idx), data2(idx)) +} +val df: DataFrame = data.toSeq.toDF("input1", "input2") +val discretizer = new QuantileDiscretizer() + .setInputCols(Array("input1", "input2")) + .setOutputCols(Array("result1", "result2")) + .setNumBuckets(numBuckets) +assert(discretizer.isQuantileDiscretizeMultipleColumns()) +val result = discretizer.fit(df).transform(df) +for (i <- 1 to 2) { + val observedNumBuckets = result.select("result" + i).distinct.count + assert(observedNumBuckets == expectedNumBucket, +s"Observed number of buckets are not correct." + + s" Expected $expectedNumBucket but found ($observedNumBuckets") +} + } + + test("Multiple Columns: Test transform on data with NaN value") { +val spark = this.spark +import spark.implicits._ + +val numBuckets = 3 +val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9, Double.NaN, Double.NaN, Double.NaN) +val expectedKeep1 = Array(0.0, 0.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0) +val validData2 = Array(0.2, -0.1, 0.3, 0.0, 0.1, 0.3, 0.5, 0.8, Double.NaN, Double.NaN) +val expectedKeep2 = Array(1.0, 0.0, 2.0, 0.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0) + +val data = (0 until validData1.length).map { idx => + (validData1(idx), validData2(idx), expectedKeep1(idx), expectedKeep2(idx)) +} +val dataFrame: DataFrame = data.toSeq.toDF("input1", "input2", "expected1", "expected2") + +val discretizer = new QuantileDiscretizer() + .setInputCols(Array("input1", "input2")) + .setOutputCols(Array("result1", "result2")) + .setNumBuckets(numBuckets) +assert(discretizer.isQuantileDiscretizeMultipleColumns()) + +withClue("QuantileDiscretizer with handleInvalid=error should throw exception for NaN values") { + intercept[SparkException] { +discretizer.fit(dataFrame).transform(dataFrame).collect() + } +} + +discretizer.setHandleInvalid("keep") +discretizer.fit(dataFrame).transform(dataFrame). +
[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/19715#discussion_r150450151 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -104,7 +126,8 @@ private[feature] trait QuantileDiscretizerBase extends Params */ @Since("1.6.0") final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override val uid: String) - extends Estimator[Bucketizer] with QuantileDiscretizerBase with DefaultParamsWritable { + extends Estimator[Bucketizer] with QuantileDiscretizerBase with DefaultParamsWritable +with HasInputCols with HasOutputCols { --- End diff -- I guess I will leave this as is even though it's a bit weird. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/19715#discussion_r150450280 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -146,4 +146,172 @@ class QuantileDiscretizerSuite val model = discretizer.fit(df) assert(model.hasParent) } + + test("Multiple Columns: Test observed number of buckets and their sizes match expected values") { +val spark = this.spark +import spark.implicits._ + +val datasetSize = 10 +val numBuckets = 5 +val data1 = Array.range(1, 11, 1).map(_.toDouble) +val data2 = Array.range(1, 20, 2).map(_.toDouble) +val data = (0 until 10).map { idx => + (data1(idx), data2(idx)) +} --- End diff -- Yes. Will change to data1.zip(data2) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/19715#discussion_r150450222 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -129,34 +152,95 @@ final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override val ui @Since("2.1.0") def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + /** @group setParam */ + @Since("2.3.0") + def setNumBucketsArray(value: Array[Int]): this.type = set(numBucketsArray, value) + + /** @group setParam */ + @Since("2.3.0") + def setInputCols(value: Array[String]): this.type = set(inputCols, value) + + /** @group setParam */ + @Since("2.3.0") + def setOutputCols(value: Array[String]): this.type = set(outputCols, value) + + private[feature] def isQuantileDiscretizeMultipleColumns(): Boolean = { +if (isSet(inputCols) && isSet(inputCol)) { + logWarning("Both `inputCol` and `inputCols` are set, we ignore `inputCols` and this " + +"`QuantileDiscretize` only map one column specified by `inputCol`") + false +} else if (isSet(inputCols)) { + true +} else { + false +} + } + + private[feature] def getInOutCols: (Array[String], Array[String]) = { +if (!isQuantileDiscretizeMultipleColumns) { + (Array($(inputCol)), Array($(outputCol))) +} else { + require($(inputCols).length == $(outputCols).length, +"inputCols number do not match outputCols") + ($(inputCols), $(outputCols)) +} + } + @Since("1.6.0") override def transformSchema(schema: StructType): StructType = { -SchemaUtils.checkNumericType(schema, $(inputCol)) -val inputFields = schema.fields -require(inputFields.forall(_.name != $(outputCol)), - s"Output column ${$(outputCol)} already exists.") -val attr = NominalAttribute.defaultAttr.withName($(outputCol)) -val outputFields = inputFields :+ attr.toStructField() +val (inputColNames, outputColNames) = getInOutCols +val existingFields = schema.fields +var outputFields = existingFields +inputColNames.zip(outputColNames).map { case (inputColName, outputColName) => + SchemaUtils.checkNumericType(schema, inputColName) + require(existingFields.forall(_.name != outputColName), +s"Output column ${outputColName} already exists.") + val attr = NominalAttribute.defaultAttr.withName(outputColName) + outputFields :+= attr.toStructField() +} StructType(outputFields) } @Since("2.0.0") override def fit(dataset: Dataset[_]): Bucketizer = { transformSchema(dataset.schema, logging = true) -val splits = dataset.stat.approxQuantile($(inputCol), - (0.0 to 1.0 by 1.0/$(numBuckets)).toArray, $(relativeError)) +val bucketizer = new Bucketizer(uid).setHandleInvalid($(handleInvalid)) +if (isQuantileDiscretizeMultipleColumns) { + var bucketArray = Array.empty[Int] + if (isSet(numBucketsArray)) { +bucketArray = $(numBucketsArray) + } + else { +bucketArray = Array($(numBuckets)) + } + val probabilityArray = bucketArray.toSeq.flatMap { numOfBucket => +(0.0 to 1.0 by 1.0 / numOfBucket) + } + val splitsArray = dataset.stat.approxQuantile($(inputCols), +probabilityArray.sorted.toArray.distinct, $(relativeError)) + val distinctSplitsArray = splitsArray.toSeq.map { splits => +getDistinctSplits(splits) + } + bucketizer.setSplitsArray(distinctSplitsArray.toArray) + copyValues(bucketizer.setParent(this)) +} +else { --- End diff -- Will fix this. And fix the same problem in another place. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org