from:"huaxingao"

[GitHub] spark pull request: [SPARK-11474][SQL]change fetchSize to fetchsiz...

2015-11-04 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/9473

[SPARK-11474][SQL]change fetchSize to fetchsize

In DefaultDataSource.scala, it has
override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation
The parameters is CaseInsensitiveMap. 
After this line
parameters.foreach(kv => properties.setProperty(kv._1, kv._2))
properties is set to all lower case key/value pairs and fetchSize becomes 
fetchsize.
However, in compute method in JDBCRDD, it has
val fetchSize = properties.getProperty("fetchSize", "0").toInt
so fetchSize value is always 0 and never gets set correctly.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-11474

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9473.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9473


commit 31e18562643bc8e4e60dd395f5062e5e3d7660a6
Author: Huaxin Gao <huax...@oc0558782468.ibm.com>
Date:   2015-11-04T07:28:30Z

[SPARK-11474][SQL]change fetchSize to fetchsize




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11474]Options to jdbc load are lower ca...

2015-11-04 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/9461


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11474]Options to jdbc load are lower ca...

2015-11-04 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/9461#issuecomment-153832023
  
I am closing this pull request as i think i reused my previous branch.. i 
would like to start clean. I will open a new pull request shortly. Sorry for 
the confusion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Spark 11474

2015-11-03 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/9461

Spark 11474



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark_11474

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9461.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9461


commit 75b7f57b42a1f0e88258187f75ab6b36a8706b0f
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2015-10-09T00:49:50Z

add write.mode for insertIntoJDBC when the parm overwrite is false

commit 684f4e9388f94d5102554a02c9497f23c40ef139
Author: Huaxin Gao <huax...@oc0558782468.ibm.com>
Date:   2015-10-24T06:13:00Z

Merge remote branch 'upstream/master' into spark8386

commit 495faae98d4f8d31e68d9736ef6f22aee2c514a6
Author: Huaxin Gao <huax...@oc0558782468.ibm.com>
Date:   2015-11-04T05:58:34Z

Merge remote branch 'upstream/master' into spark8386

commit becdb71bc0cfed3a71ec100fbcfa922728a40af8
Author: Huaxin Gao <huax...@oc0558782468.ibm.com>
Date:   2015-11-04T07:28:30Z

change fetchSize to fetchsize becasue the properties has lower case value




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8386] [SQL]add write.mode for insertInt...

2015-10-14 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/9042#issuecomment-148122772
  
I looked the error log, it failed at fetching changes from the remote Git 
repository.  Does it mean something is wrong with my pull request and I need to 
do a new pull request?  Sorry I am new and not familiar with these yet. 
Thanks for your help!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8386] [SQL]add write.mode for insertInt...

2015-10-08 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/9042

[SPARK-8386] [SQL]add write.mode for insertIntoJDBC when the parm overwrite 
is false

the fix is for jira https://issues.apache.org/jira/browse/SPARK-8386



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark8386

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9042.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9042


commit 75b7f57b42a1f0e88258187f75ab6b36a8706b0f
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2015-10-09T00:49:50Z

add write.mode for insertIntoJDBC when the parm overwrite is false




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8386] [SQL]add write.mode for insertInt...

2015-10-12 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/9042#issuecomment-147592598
  
Has the test result come back yet?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-11778:parse table name before it is pass...

2015-11-17 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/9773#issuecomment-157490181
  
hiveContext.table("db_name.table") works but
hiveContext.read.table("db_name.table")
throws an org.apache.spark.sql.catalyst.analysis.NoSuchTableException

In hiveContext.table("db_name.table"), it goes through 
SqlParser.parseTableIdentifier(tableName)
and the table name "db_name.table" got resolved to 'db_name'.'table', and 
later, when trying to get the the qualified table name, the database name is 
resolved to db_name, and table name is table, and it can get the qualified 
table name OK.

In hiveContext.read.table("db_name.table"), it doesn't go through SQLParser 
to parse the table name, so the table name "db_name.table" remain as is. Later, 
when trying to get the the qualified table name, the database name resolved as 
default, and table name is "db_name.table", it can't get the qualified table 
name correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-11778:parse table name before it is pass...

2015-11-17 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/9773

SPARK-11778:parse table name before it is passed to lookupRelation



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-11778

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9773.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9773


commit 02300e6dd64b62ea6e4b1143a2018bbd2698cdc6
Author: Huaxin Gao <huax...@oc0558782468.ibm.com>
Date:   2015-11-17T19:42:37Z

SPARK-11778:parse table name before it is passed to lookupRelation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11788][SQL] surround timestamp/date val...

2015-11-27 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/9872#issuecomment-160182654
  
Fixed title and scala style problem.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12088][SQL]check connection.isClosed be...

2015-12-02 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/10095

[SPARK-12088][SQL]check connection.isClosed before calling connectionâ¦

In Java Spec java.sql.Connection, it has 
boolean getAutoCommit() throws SQLException
Throws:
SQLException - if a database access error occurs or this method is called 
on a closed connection

So if conn.getAutoCommit is called on a closed connection, a SQLException 
will be thrown. Even though the code catch the SQLException and program can 
continue, I think we should check conn.isClosed before calling 
conn.getAutoCommit to avoid the unnecessary SQLException.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-12088

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10095.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10095


commit 0efeb958ed118aba3b6c6d2649c0cec58c43131d
Author: Huaxin Gao <huax...@oc0558782468.ibm.com>
Date:   2015-12-01T20:41:06Z

[SPARK-12088][SQL]check connection.isClosed before calling 
connection.getAutoCommit in JDBCRDD




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12088][SQL]check connection.isClosed be...

2015-12-02 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10095#issuecomment-161219510
  
I guess I don't need to provide a regression test. Currently, in JDBCSuite, 
there are lots of Warning as the following, the Warning will be gone after the 
fix. 

13:11:51.738 WARN org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD: 
Exception closing connection
org.h2.jdbc.JdbcSQLException: The object is already closed [90007-183]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
at org.h2.message.DbException.get(DbException.java:179)
at org.h2.message.DbException.get(DbException.java:155)
at org.h2.message.DbException.get(DbException.java:144)
at org.h2.jdbc.JdbcConnection.checkClosed(JdbcConnection.java:1466)
at org.h2.jdbc.JdbcConnection.checkClosed(JdbcConnection.java:1444)
at org.h2.jdbc.JdbcConnection.getAutoCommit(JdbcConnection.java:447)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12391][SQL]JDBC OR operator push down

2015-12-16 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/10347

[SPARK-12391][SQL]JDBC OR operator push down



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-12391

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10347.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10347


commit a7c4a1ac41db4f5afa8f6e990ca385f097f540ea
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2015-12-17T06:15:44Z

[SPARK-12391][SQL]JDBC OR operator push down




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12387][SQL]JDBC IN operator push down

2015-12-16 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/10345

[SPARK-12387][SQL]JDBC IN operator push down

Will push down SQL  IN operator such as the following to JDBC datasource

SELECT column_name(s)
FROM table_name
WHERE column_name IN (value1,value2,...)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-12387

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10345.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10345


commit bb83a9ca666184e9a6f246f295d20f279bbd8a79
Author: Huaxin Gao <huax...@oc0558782468.ibm.com>
Date:   2015-12-17T05:32:07Z

[SPARK-12387][SQL]JDBC IN operator push down




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12270][SQL]remove empty space after get...

2015-12-15 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10262#issuecomment-164680145
  
@andrewor14 
Thanks  a lot for your comment. I will change to what you suggested. 
In the same method, case DecimalConversion has the similar code.  Shall I 
change that one too? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12387][SQL]JDBC IN operator push down

2015-12-17 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10345#discussion_r47965769
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -269,6 +269,13 @@ private[sql] class JDBCRDD(
 case stringValue: String => s"'${escapeSql(stringValue)}'"
 case timestampValue: Timestamp => "'" + timestampValue + "'"
 case dateValue: Date => "'" + dateValue + "'"
+case objectValue: Array[Object] => {
+  val str = objectValue.map {
+case string: String => "'" + string + "'"
--- End diff --

@bomeng 
Thanks a lot for your comment.  I will change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12270][SQL]remove empty space after get...

2015-12-10 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/10262

[SPARK-12270][SQL]remove empty space after getString from database

{code}
conn.prepareStatement(
"create table people (name char(32)").executeUpdate()
 conn.prepareStatement("insert into people values 
('fred')").executeUpdate()
 sql(
   s"""
  |CREATE TEMPORARY TABLE foobar
  |USING org.apache.spark.sql.jdbc
  |OPTIONS (url '$url', dbtable 'PEOPLE', user 'testuser', password 
'testpassword')
  """.stripMargin.replaceAll("\n", " "))
 val df = sqlContext.sql("SELECT * FROM foobar WHERE NAME = 'fred'")
{code}
I am expecting to see one row with content 'fred' in df. However, there is 
no row returned. If I changed the data type to varchar (32) in the create table 
ddl , then I can get the row back correctly. The cause of the problem is that 
for data type char (num), DB2 defines it as fixed-length character strings, so 
if I have char (32), when doing "SELECT * FROM foobar WHERE NAME = 'fred'", DB2 
returns 'fred' padded with 28 empty space. Spark treats "fred' padded with 
empty space not the same as 'fred' so df doesn't have any row. If I have 
varchar (32), DB2 just returns 'fred' for the select statement and df has the 
right row. In order to make DB2 char (num) works for spark, I suggest to change 
spark code to trim the empty space after get the data from database.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-12270

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10262.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10262


commit 63a3d835119ace22a359af03c267451aa68cc570
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2015-12-08T07:41:25Z

[SPARK-12270][SQL]remove empty space after getString from database




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...

2016-01-04 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10505#issuecomment-168901512
  
@rxin 
I am not sure if my approach is OK.  Could you please take a quick look 
when you have time and let me know what you think?  Thank you very much for 
your help!!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12391][SQL]JDBC OR operator push down

2015-12-18 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/10347


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12409][SQL]JDBC AND operator push down

2015-12-18 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/10369


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12409][SQL]JDBC AND operator push down

2015-12-18 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10369#issuecomment-165850180
  
Close for now.  Will put the filter changes in one PR so it's easier to 
merge.  Sorry for the inconvenience. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12409][SQL]JDBC AND operator push down

2015-12-18 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10369#issuecomment-165848876
  
Sorry for the trouble. I should have everything in one PR.  Will do it now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12387][SQL]JDBC IN operator push down

2015-12-18 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10345#issuecomment-165849528
  
Close for now.  Will put the filter changes in one PR so it's easier to 
merge


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12387][SQL]JDBC IN operator push down

2015-12-18 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/10345


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12391][SQL]JDBC OR operator push down

2015-12-18 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10347#issuecomment-165849800
  
Close for now.  Will put the filter changes in one PR so it's easier to 
merge.  Sorry for the inconvenience. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12270][SQL]remove empty space after get...

2015-12-19 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10262#issuecomment-166033287
  
@yhuai 
JDBCSuite uses H2 database.  It seems that for char(n) data type, either H2 
database doesn't pad, or the H2 JDBC driver already trims the empty space for 
ResultSet.getString.  So H2 database doesn't have this problem.  To show the 
problem, it will need DB2 and DB2 JDBC driver ( I guess Oracle has the same 
problem too) , but I don't think the test system has DB2 JDBC driver.  So I am 
guessing maybe no need to add the test? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12409][SQL]add filter (IN, AND, OR)

2015-12-18 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/10386

[SPARK-12409][SQL]add filter (IN, AND, OR)

push filters IN, ADD, OR to JDBC layer. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark_new_filter

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10386.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10386


commit e0b5e7c80f7b59944309aaa46f86ed8d4e62cd4a
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2015-12-18T20:11:36Z

[SPARK-12409][SQL]add filter (IN, AND, OR)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12459][SQL]add ExpressionDescription to...

2015-12-23 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/10460

[SPARK-12459][SQL]add ExpressionDescription to string functions



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-12459

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10460.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10460


commit f88386e34a0231ccb1fbfd427b0783e3364ce83e
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2015-12-23T22:31:02Z

[SPARK-12459][SQL]add ExpressionDescription to string functions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...

2015-12-29 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/10505

[SPARK-12506][SQL]push down WHERE clause arithmetic operator to JDBC â¦

â¦layer

For arithmetic operator in WHERE clause such as
select * from table where c1 + c2 > 10
Currently where c1 + c2 >10 is done at spark layer.
Will push this to JDBC layer so it will be done in database

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark12506

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10505.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10505


commit b10310898953e8830898c0ac95c861c8f3c88fa5
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2015-12-27T20:45:52Z

[SPARK-12506][SQL]push down WHERE clause arithmetic operator to JDBC layer




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...

2015-12-29 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10505#issuecomment-167746195
  
I only added + operator for now.  If the change is accepted, I will also 
add -,* and /. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12391][SQL]JDBC OR operator push down

2015-12-17 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10347#discussion_r47948040
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 ---
@@ -288,6 +288,8 @@ private[sql] class JDBCRDD(
 case GreaterThanOrEqual(attr, value) => s"$attr >= 
${compileValue(value)}"
 case IsNull(attr) => s"$attr IS NULL"
 case IsNotNull(attr) => s"$attr IS NOT NULL"
+case Or(filter1, filter2) =>
+  compileFilter (filter1) + " OR "  + compileFilter (filter2)
--- End diff --

@rxin Thank you very much for your comments.  I will change the code to 
case Or(filter1, filter2) =>
  "(" + compileFilter (filter1) + ") OR ("  + compileFilter (filter2)  
+ ")"

The reason that I didn't add the AND filter along with OR is because I 
thought AND filter is not necessary. When I tried the simple test such as 
"select * from test where THEID = 1 and NAME = 'fred'", the filter passed to 
JDBC layer is an Array of filters with elements EqualTo(THEID,1), 
EqualTo(Name,fred). This can be handled by the current filter EqualTo and no 
need to add AND filter.  But today, after more thinking, I found I still need 
to add AND.  For query such as "SELECT * FROM foobar WHERE THEID = 1 OR NAME = 
'mary' and THEID = 2" , the  filter is  
Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2)))
I will open another jira to add AND
I also looked other filters:
1. LIKE
For string column, SPARK used EqualTo.  "Where columnName LIKE xxx" is 
changed to "Where columnName = xxx". I am not sure if SPARK intentionally 
changes LIKE to '='. In SQL, LIKE is not exactly the same as '=' 
2. BETWEEN is changed to two filters GreatThanOrEqual, LessThanOrEqual. So 
current code works OK. No need to add anything
3. NOT BETWEEN is changed to Or(LessThanOrEqual, GreatThanOrEqual). No need 
to add anything.
I will look more to see if I need to change other filters.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12409][SQL]JDBC AND operator push down

2015-12-17 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/10369

[SPARK-12409][SQL]JDBC AND operator push down

For simple AND such as 
select * from test where THEID = 1 AND NAME = 'fred', 
The filters pushed down to JDBC layers are EqualTo(THEID,1), 
EqualTo(Name,fred). These are handled OK by the current code.
For query such as 
SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" ,
the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2)))
So need to add And filter in JDBC layer.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark_12409

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10369.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10369


commit f5cfd1d038954a4b0fd1f502769fbd3140e7d35b
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2015-12-17T17:32:24Z

[SPARK-12409][SQL]JDBC AND operator push down




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11788][SQL]:surround timestamp/date val...

2015-11-24 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/9872#issuecomment-159375392
  
@JoshRosen 
Regression test added.  Could you please take a look?  
Thanks a lot for your help!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11778] [SQL]:parse table name before it...

2015-11-18 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/9773#issuecomment-157915848
  
test case added.  Could you please take a look? Thanks a lot!!




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11788][SQL]:surround timestamp/date val...

2015-11-20 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/9872

[SPARK-11788][SQL]:surround timestamp/date value with quotes



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-11788

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9872.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9872


commit 648ab61193cfd7e1c385823c1af67e39e40b8349
Author: Huaxin Gao <huax...@oc0558782468.ibm.com>
Date:   2015-11-20T22:21:20Z

[SPARK-11788][SQL}:surround timestamp/date value with quotes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11788][SQL]:surround timestamp/date val...

2015-11-20 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/9872#issuecomment-158545092
  
Sure.  Will add a test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-11778:parse table name before it is pass...

2015-11-18 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/9773#issuecomment-157865029
  
I will add a test case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11778][SQL]:add regression test

2015-11-21 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/9890

[SPARK-11778][SQL]:add regression test

Fix regression test for SPARK-11778.
 @marmbrus
Could you please take a look?
Thank you very much!!


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-11778-regression-test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9890.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9890


commit cb04f55d79560393c454361807e60dbdb94640c4
Author: Huaxin Gao <huax...@oc0558782468.ibm.com>
Date:   2015-11-22T03:14:02Z

[SPARK-11778][SQL]:add regression test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11778][SQL]:add regression test

2015-11-23 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/9890#issuecomment-159143868
  
I couldn't figure out why the tests failed, but it doesn't seem to me that 
my new test suite caused the failure.  Is it OK to have a retest? 
Thanks a lot!!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12270][SQL]remove empty space after get...

2016-01-11 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10262#issuecomment-170701187
  
@yhuai 
Sorry for the late reply.  I waited for my coworker Luciano to come back 
from vacation today to check with him about his DB2 docker test  status. He has 
a PR to add DB2 docker test.  https://github.com/apache/spark/pull/9893. His PR 
is still pending because of the DB2 jdbc driver dependency.  After his PR is 
merged, I will add a test in his DB2 docker test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13287: [SPARK-15491][SQL]fix assertion failure for JDBC DataFra...

2016-06-02 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/13287
  
@rxin gentle ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13492: [SPARK-15749][SQL]make the error message more meaningful

2016-06-03 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/13492
  
retest please.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13492: [SPARK-15749][SQL]make the error message more mea...

2016-06-03 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/13492

[SPARK-15749][SQL]make the error message more meaningful

## What changes were proposed in this pull request?

For table test1 (C1 varchar (10), C2 varchar (10)), when I insert a row 
using
```
sqlContext.sql("insert into test1 values ('abc', 'def', 1)")
```
I got error message

```
Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 
JDBCRelation(test1)
requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE 
statement generates the same number of columns as its schema.
```
The error message is a little confusing. In my simple insert statement, it 
doesn't have a SELECT clause.

I will change the error message to a more general one

```
Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 
JDBCRelation(test1)
requires that the data to be inserted have the same number of columns as 
the target table.
```


## How was this patch tested?

I tested the patch using my simple unit test, but it's a very trivial 
change and I don't think I need to check in any test. 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-15749

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13492.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13492


commit 11768e8ec799a8b51bbf90ff9943b2b0a2808e72
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2016-06-03T06:15:32Z

[SPARK-15749][SQL]make the error message more meaningful




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15491][SQL]fix assertion failure for JD...

2016-05-25 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/13287#issuecomment-221617823
  
@rxin Sorry, I didn't notice your comment until this morning.  Reformatted 
the description.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13492: [SPARK-15749][SQL]make the error message more meaningful

2016-06-14 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/13492
  
@andrewor14 
I am not sure why the test failed, but it doesn't seem to be related to my 
change.  Could you please start another test?  Thanks a lot!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15491][SQL]fix assertion failure for JD...

2016-05-27 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/13287#issuecomment-21356
  
@rxin Thanks for your comment. 
When I wrote the test, I found there is one more place I need to change.  
So there are two places that have problems: 
1. TreeNode that has two layers of constructor parameters. 
2. TreeNode has a constructor parameter that contains two layers of 
constructor parameters. 
I have both cases in my test.  Hopefully it is what you want.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13492: [SPARK-15749][SQL]make the error message more meaningful

2016-06-16 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/13492
  
@andrewor14 Thanks a lot for merging in the change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...

2016-01-13 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10750#issuecomment-171498490
  
@viirya 
I changed the code based on your suggestion.  Could you please review 
again? 
Thanks a lot for your help!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...

2016-01-13 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/10750

[SPARK-12506][SQL]push down WHERE clause arithmetic operator to JDBC â¦

â¦layer

For arithmetic operator in WHERE clause such as
select * from table where c1 + c2 > 10
Currently where c1 + c2 >10 is done at spark layer.
Will push this to JDBC layer so it will be done in database

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark__12506

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10750.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10750


commit 944bee6fe204e1cb12f3c5c57d6e09ad580905bd
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2016-01-13T14:04:20Z

[SPARK-12506][SQL]push down WHERE clause arithmetic operator to JDBC layer




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...

2016-01-13 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/10505


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...

2016-01-13 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/10505#discussion_r49678184
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -485,6 +486,74 @@ private[sql] object DataSourceStrategy extends 
Strategy with Logging {
   }
 
   /**
+   * Convert add predicate such as C1 + C2 + C3 to string
+   */
+  private[this] def getAddString (predicate: Expression): String = {
+predicate match {
+  case expressions.Add(left, right) =>
+  {
+val leftString = left match {
+  case a: Attribute => a.name
+  case add: Add => getAddString (add)
+  case _ => None
+}
+val rightString = right match {
+  case a: Attribute => a.name
+  case add: Add => getAddString (add)
+  case _ => None
+}
+leftString + " + " + rightString
+  }
+}
+  }
+
+  /**
+   * Tries to translate a Catalyst [[Expression]] into data source 
[[Filter]].
--- End diff --

@viirya 
Thank you very much for your comments.  I changed the code based on your 
suggestion.  Since the new code is quite different from the original, I will 
close this PR and submit a new one. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...

2016-02-06 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/11104

[SPARK-13186][Streaming]Migrate away from SynchronizedMap 

trait SynchronizedMap in package mutable is deprecated: Synchronization via 
traits is deprecated as it is inherently unreliable. Change to  
java.util.concurrent.ConcurrentHashMap instead. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark_13186

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11104.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11104


commit 24e75ae7070f82ec847c144a5ba4940736d95503
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2016-02-02T07:13:50Z

[SPARK-13186][Streaming]Migrate away from SynchronizedMap




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...

2016-02-06 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/11104#issuecomment-180883943
  
@holdenk 
Could you please review?  Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...

2016-02-07 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/11104#discussion_r52126950
  
--- Diff: 
external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala
 ---
@@ -30,6 +30,9 @@ import org.apache.spark.{SparkConf, SparkFunSuite}
 import org.apache.spark.storage.StorageLevel
 import org.apache.spark.streaming.{Milliseconds, StreamingContext}
 
+import java.util.concurrent.ConcurrentHashMap
+import scala.collection.convert.decorateAsScala._
--- End diff --

@holdenk Thanks for your comments.  Yes, that's why I have decorateAsScala 
there.  

I am working on changing the code to use Java API for +=, put and 
getOrElseUpdate.  Do we also need concurrency guarantee for ++ and --? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...

2016-02-08 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/11104#issuecomment-181664339
  
Sorry for the file line length problem.  Fixed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...

2016-02-08 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/11104#discussion_r52228031
  
--- Diff: 
external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala
 ---
@@ -65,12 +67,14 @@ class KafkaStreamSuite extends SparkFunSuite with 
Eventually with BeforeAndAfter
 
 val stream = KafkaUtils.createStream[String, String, StringDecoder, 
StringDecoder](
   ssc, kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY)
-val result = new mutable.HashMap[String, Long]() with 
mutable.SynchronizedMap[String, Long]
+val result = new ConcurrentHashMap[String, Long].asScala
 stream.map(_._2).countByValue().foreachRDD { r =>
   val ret = r.collect()
   ret.toMap.foreach { kv =>
-val count = result.getOrElseUpdate(kv._1, 0) + kv._2
-result.put(kv._1, count)
+result.synchronized {
+  val count = result.getOrElseUpdate(kv._1, 0) + kv._2
--- End diff --

@holdenk 
Thanks for your quick reply. 
I initially changed 
   val count = result.getOrElseUpdate(kv._1, 0) + kv._2
to
result.putIfAbsent(kv._1, 0)
val count = result.get(kv._1) + kv._2_
but the test failed for me. I guess a different thread can come in between 
of the two lines and the concurrency is not guaranteed any more.  So I used 
synchronized block instead. 
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...

2016-02-08 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/11104#issuecomment-181528520
  
@holdenk 
Could you please review one more time? 
I changed to java api except the getOrElseUpdate in KafkaStreamSuite.scala. 
I can't find a java equivalent that can be done in one line. So I used the 
synchronized block. 
Thank you very much for your help!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...

2016-02-12 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/11104#issuecomment-183434809
  
@srowen 
Will do.  I have my local branch messed up.  If i can't figure out how to 
fix it, I will close this PR and submit a new one. Also, one of the python 
streaming test failed with java.net.BindException.  I am still trying to figure 
out the problem. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...

2016-02-08 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/11104#issuecomment-181633831
  
@zsxwing 
Thanks for the comments. I didn't see a PR for removing SynchronizedSet.  I 
will work on this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...

2016-02-08 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/11104#discussion_r52250939
  
--- Diff: 
external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala
 ---
@@ -65,12 +67,14 @@ class KafkaStreamSuite extends SparkFunSuite with 
Eventually with BeforeAndAfter
 
 val stream = KafkaUtils.createStream[String, String, StringDecoder, 
StringDecoder](
   ssc, kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY)
-val result = new mutable.HashMap[String, Long]() with 
mutable.SynchronizedMap[String, Long]
+val result = new ConcurrentHashMap[String, Long].asScala
 stream.map(_._2).countByValue().foreachRDD { r =>
   val ret = r.collect()
   ret.toMap.foreach { kv =>
-val count = result.getOrElseUpdate(kv._1, 0) + kv._2
-result.put(kv._1, count)
+result.synchronized {
+  val count = result.getOrElseUpdate(kv._1, 0) + kv._2
--- End diff --

@holdenk @zsxwing 
I tried  _val count = result.putIfAbsent(kv.1, 0) + kv._2,_ but the test 
failed for me.  So I will change to mutable.HashMap and put in synchronized 
block. 
Is it OK to use  mutable.HashMap and synchronized block in this file only, 
but use java.util.concurrent.ConcurrentHashMap in other 
files(StreamingListenerSuite, KinesisStreamTests and FileInputDStream)? Or is 
it better to  to use  mutable.HashMap and synchronized block for all the files 
that has SynchronizedMap?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...

2016-02-08 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/11104#issuecomment-181648992
  
Fixed the problems.  Thank you all very much for your help!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12506][SPARK-12126][SQL]use CatalystSca...

2016-02-01 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/11005

[SPARK-12506][SPARK-12126][SQL]use CatalystScan for JDBCRelation

As suggested https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526;
 class="external-link" rel="nofollow">here, I will change JDBCRelation to 
implement CatalystScan, and then directly access Catalyst expressions in 
JDBCRDD.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-12126

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11005.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11005


commit 0bdfea86fc0671bf636bbed3174baa417de6367e
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2016-01-31T04:01:12Z

[SPARK-12506][SPARK-12126][SQL]use CatalystScan for JDBCRelation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...

2016-02-01 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/10750


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12506][SQL]push down WHERE clause arith...

2016-02-01 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10750#issuecomment-178119526
  
@viirya @HyukjinKwon @rxin 
Thank you all very much for your comments.  I will change JDBCRelation to 
implement CatalystScan, and then  directly access Catalyst expressions in 
JDBCRDD. I will close this PR and submit a new one. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]migrate away from Sync...

2016-02-22 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/11250#issuecomment-187305582
  
@srowen @holdenk 
Thank you very much for your help!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...

2016-02-18 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/11104


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]Migrate away from Sync...

2016-02-18 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/11104#issuecomment-185604993
  
@srowen @holdenk 
I will close this PR and submit a new one. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]migrate away from Sync...

2016-02-18 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/11250

[SPARK-13186][Streaming]migrate away from SynchronizedMap

trait SynchronizedMap in package mutable is deprecated: Synchronization via 
traits is deprecated as it is inherently unreliable. Change to 
java.util.concurrent.ConcurrentHashMap instead. 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark__13186

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11250.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11250


commit ff5e52ae100408e81de7f138b876ec8a47aaafa1
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2016-02-18T04:25:41Z

[SPARK-13186][Streaming]migrate away from SynchronizedMap




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]migrate away from Sync...

2016-02-18 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/11250#issuecomment-185608162
  
@srowen @holdenk 

Could you please take a look of this PR? I ran the python streaming test 
cleanly on my local before I submitted the PR. 

Thanks a lot!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]migrate away from Sync...

2016-02-18 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/11250#discussion_r53430913
  
--- Diff: 
extras/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala
 ---
@@ -241,13 +243,13 @@ abstract class KinesisStreamTests(aggregateTestData: 
Boolean) extends KinesisFun
 kinesisStream.foreachRDD((rdd: RDD[Array[Byte]], time: Time) => {
   val kRdd = rdd.asInstanceOf[KinesisBackedBlockRDD[Array[Byte]]]
   val data = rdd.map { bytes => new String(bytes).toInt 
}.collect().toSeq
-  collectedData(time) = (kRdd.arrayOfseqNumberRanges, data)
+  collectedData.put(time, (kRdd.arrayOfseqNumberRanges, data))
 })
 
 ssc.remember(Minutes(60)) // remember all the batches so that they are 
all saved in checkpoint
 ssc.start()
 
-def numBatchesWithData: Int = collectedData.count(_._2._2.nonEmpty)
+def numBatchesWithData: Int = 
collectedData.asScala.count(_._2._2.nonEmpty)
--- End diff --

@srowen 
Thanks for your comment. For the 5 files I changed, I will remove the usage 
of Java ConcurrentHashMap, and use mutable.HashMap instead.  I will  wrap every 
mutable.HashMap operation in a synchronized block. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13186][Streaming]migrate away from Sync...

2016-02-21 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/11250#discussion_r53563513
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
 ---
@@ -163,8 +166,11 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, 
V]](
   /** Clear the old time-to-files mappings along with old RDDs */
   protected[streaming] override def clearMetadata(time: Time) {
 super.clearMetadata(time)
-val oldFiles = batchTimeToSelectedFiles.filter(_._1 < (time - 
rememberDuration))
-batchTimeToSelectedFiles --= oldFiles.keys
+val oldFiles = batchTimeToSelectedFiles.synchronized
+{
+  batchTimeToSelectedFiles.filter(_._1 < (time - rememberDuration))
--- End diff --

@srowen 
Thank you very much for your comments.  I changed the code to synchronize 
the whole block. 
 The extra parentheses at line 153 seems to be required.  Removing it 
causes problem. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12270][SQL]remove empty space after get...

2016-04-25 Thread huaxingao

Github user huaxingao commented on the pull request:

https://github.com/apache/spark/pull/10262#issuecomment-214437787
  
@HyukjinKwon 
I will continue working on this and finish the work this week. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15491][SQL]fix assertion failure for JD...

2016-05-24 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/13287

[SPARK-15491][SQL]fix assertion failure for JDBC DataFrame to JSON

## What changes were proposed in this pull request?
 in TreeNode.scala parseToJson, it has

case p: Product => try {
  val fieldNames = getConstructorParameterNames(p.getClass)
  val fieldValues = p.productIterator.toSeq
  assert(fieldNames.length == fieldValues.length)
  ("product-class" -> JString(p.getClass.getName)) :: 
fieldNames.zip(fieldValues).map {
case (name, value) => name -> parseToJson(value)
  }.toList

For a class that has two level of constructor parameters, for example, 
private[sql] case class JDBCRelation(
url: String,
table: String,
parts: Array[Partition],
properties: Properties = new Properties())(@transient val sparkSession: 
SparkSession)

val fieldValues = p.productIterator.toSeq will only get the values for the 
first level of constructor parameters, in this case, 4 parameters,  but val 
fieldNames = getConstructorParameterNames(p.getClass) will get all the 5 
parameters, so assert(fieldNames.length == fieldValues.length) will fail. 

I will change code to make fieldValues have all the parameters. 

## How was this patch tested?

I have a unit test.  Will add it soon. 




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-15491

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13287.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13287


commit 8307b348b62f0653cccd04cd543cf77281bc4b85
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2016-05-25T00:11:53Z

[SPARK-15491][SQL]fix assertion failure for JDBC DataFrame to JSON




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14535: [SPARK-16946][SQL]throw Exception if saveAsTable[...

2016-08-08 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/14535

[SPARK-16946][SQL]throw Exception if saveAsTable[apend] has differentâ¦

## What changes were proposed in this pull request?

In HiveContext, if saveAsTable[append] has different number of columns, 
Spark will throw Exception.
e.g.
```
test("saveAsTable[append]: too many columns") {
  withTable("saveAsTable_too_many_columns") {
Seq((1, 2)).toDF("i", 
"j").write.saveAsTable("saveAsTable_too_many_columns")
val e = intercept[AnalysisException] {
  Seq((3, 4, 5)).toDF("i", "j", 
"k").write.mode("append").saveAsTable("saveAsTable_too_many_columns")
}
assert(e.getMessage.contains("doesn't match"))
  }
}

```
However, in SparkSession or SQLContext, if use the above code example, the 
extra column in the append data will be removed silently without any warning or 
Exception. The table becomes
```
i  j
3 4
1 2

```
I changed the code to throw Exception  if saveAsTable[apend] has different 
column numbers


## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)

Add unit tests to test the patch. 


â¦ number of columns

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-16946

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14535.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14535


commit 52e7b66e1ff335566af92c795c1679c11e0ab8cb
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2016-08-08T06:06:58Z

[SPARK-16946][SQL]throw Exception if saveAsTable[apend] has different 
number of columns




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #10262: [SPARK-12270][SQL]remove empty space after getStr...

2017-02-09 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/10262


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes...

2016-12-07 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/16175#discussion_r91345742
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -1110,6 +1110,16 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
 }
 assert(e.getMessage.contains("Cannot create encoder for Option of 
Product type"))
   }
+
+
+  test ("SPARK-17460: If the sizeInBytes in Statistics exceeds the limit 
of an Int, " +
--- End diff --

Thanks for the comment.  Changed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes >=0 in...

2016-12-07 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/16175
  
@gatorsmile Thanks a lot for reviewing this.  Sorry I just saw  your last 
comment after I pushed the change.  Will make more changes for other potential 
overflow issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes >=0 in...

2016-12-06 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/16175
  
@gatorsmile Could you please take a look when you have time?  Thanks a lot!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes...

2016-12-06 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/16175

[SPARK-17460][SQL]check if statistics.sizeInBytes >=0 in canBroadcast

## What changes were proposed in this pull request?

1. In SparkStrategies.canBroadcast, I will add the check   
plan.statistics.sizeInBytes >= 0
2. In LocalRelations.statistics, when calculate the statistics, I will 
change the size to BigInt so it won't overflow. 

  
## How was this patch tested?

I will add a test case to make sure the statistics.sizeInBytes won't 
overflow. 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-17460

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16175.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16175


commit 9ad6725975e36e513ce89f54582409bd3e9d2cfe
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2016-12-06T18:22:38Z

[SPARK-17460][SQL]check if statistics.sizeInBytes >=0 in canBroadcast




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16175: [SPARK-17460][SQL]check if statistics.sizeInBytes...

2016-12-09 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/16175#discussion_r91765482
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -115,7 +115,8 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
  */
 private def canBroadcast(plan: LogicalPlan): Boolean = {
   plan.statistics.isBroadcastable ||
-plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold
+(plan.statistics.sizeInBytes >= 0 &&
--- End diff --

@cloud-fan @gatorsmile 
Thank you all for your comments.  I will change defaultSizeInBytes based on 
the above comments. 
Seems all the other places of Statistics.sizeInBytes are safe and will not 
cause overflow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14535: [SPARK-16946][SQL]throw Exception if saveAsTable[...

2017-06-20 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/14535


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14535: [SPARK-16946][SQL]throw Exception if saveAsTable[apend] ...

2017-06-20 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/14535
  
@gatorsmile 
Sorry for the late response.  I just came back from China. This is not an 
issue any more. I will close this PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13287: [SPARK-15491][SQL]fix assertion failure for JDBC DataFra...

2017-06-20 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/13287
  
@gatorsmile 
Sorry for the late response.  I just came back from China.

For a class that has two level of constructor parameters, e.g.
private[sql] case class JDBCRelation(
url: String,
table: String,
parts: Array[Partition],
properties: Properties = new Properties())(@transient val sparkSession: 
SparkSession)
toJson will have problem.  The current code doesn't have problem any more 
because in the new method TreeNode.shouldConvertToJson, JDBCRelation returns 
false and is not allowed to convert to JSON. However, it seems to me that the 
root problem is not fixed. In the following cases:
1. TreeNode that has two layers of constructor parameters.
2. TreeNode has a constructor parameter that contains two layers of 
constructor parameters.
And if the TreeNode can be converted to JSON, the Exception will still 
occur. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13287: [SPARK-15491][SQL]fix assertion failure for JDBC ...

2017-06-21 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/13287#discussion_r123368171
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala 
---
@@ -594,7 +596,9 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] 
extends Product {
 // returns null if the product type doesn't have a primary 
constructor, e.g. HiveFunctionWrapper
 case p: Product => try {
--- End diff --

Thank you very much for your reply. 
Shall we also allow the JDBCRelation to be converted to JSON like the 
following?
```
 val jdbcDF = sqlContext.read.format("jdbc").options(
   Map("url" -> "jdbc:h2:mem:testdb0;user=testUser;password=testPass",
 "dbtable" -> "people")).load().queryExecution.logical.toJSON
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13287: [SPARK-15491][SQL]fix assertion failure for JDBC ...

2017-06-23 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/13287#discussion_r123818099
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala 
---
@@ -594,7 +596,9 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] 
extends Product {
 // returns null if the product type doesn't have a primary 
constructor, e.g. HiveFunctionWrapper
 case p: Product => try {
--- End diff --

Thanks.  I will close this PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13287: [SPARK-15491][SQL]fix assertion failure for JDBC ...

2017-06-23 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/13287


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19256: [SPARK-21338][SQL]implement isCascadingTruncateTa...

2017-09-16 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/19256

[SPARK-21338][SQL]implement isCascadingTruncateTable() method in Aggrâ¦

â¦egatedDialect

## What changes were proposed in this pull request?

org.apache.spark.sql.jdbc.JdbcDialect's method:
def isCascadingTruncateTable(): Option[Boolean] = None
is not overriden in org.apache.spark.sql.jdbc.AggregatedDialect class.
Because of this issue, when you add more than one dialect Spark doesn't 
truncate table because isCascadingTruncateTable always returns default None for 
Aggregated Dialect. 
Will implement isCascadingTruncateTable in AggregatedDialect class in this 
PR. 

## How was this patch tested?

In JDBCSuite, inside test("Aggregated dialects"), will add one line to test 
AggregatedDialect.isCascadingTruncateTable


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-21338

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19256.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19256






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19256: [SPARK-21338][SQL]implement isCascadingTruncateTa...

2017-09-17 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19256#discussion_r139313120
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/jdbc/AggregatedDialect.scala ---
@@ -41,4 +41,8 @@ private class AggregatedDialect(dialects: 
List[JdbcDialect]) extends JdbcDialect
   override def getJDBCType(dt: DataType): Option[JdbcType] = {
 dialects.flatMap(_.getJDBCType(dt)).headOption
   }
+
+  override def isCascadingTruncateTable(): Option[Boolean] = {
+dialects.flatMap(_.isCascadingTruncateTable).headOption
--- End diff --

Both getCatalystType and getJDBCType(dt: DataType) use the first one. Also, 
in the class header, it has the following:
/**
 * AggregatedDialect can unify multiple dialects into one virtual Dialect.
 * Dialects are tried in order, and the first dialect that does not return a
 * neutral element will will.
 *
 * @param dialects List of dialects.
 */
It has a typo, I guess it means "the first dialect that does not return a
 neutral element will return"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...

2017-09-17 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/19256
  
Thanks @gatorsmile 
I will change both the implementation and the PR title. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...

2017-09-17 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/19256
  
Thanks @gatorsmile 
Does the following logic look good to you?
```
if(any dialect's isCascadingTruncateTable returns true)
  return Some(true)
else
  if (any dialect's isCascadingTruncateTable returns false)
 return Some(false)
  else // None of the dialect implements this method, return the superclass 
default value
 return None
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...

2017-09-19 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/19256
  
@gatorsmile Thanks a lot for your help!!!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...

2017-10-15 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19496#discussion_r144732734
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2103,4 +2103,13 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
   Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 
3, 1), Row(7, 3, 2)))
   }
+
+  test("SPARK-22271: mean overflows and returns null for some decimal 
variables") {
+val d: BigDecimal = BigDecimal(0.034567890)
--- End diff --

It's not necessary. I will remove. Thanks for your review. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...

2017-10-13 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/19496

[SPARK-22271][SQL]mean overflows and returns null for some decimal variables



## What changes were proposed in this pull request?

In Average.scala, it has
```
  override lazy val evaluateExpression = child.dataType match {
case DecimalType.Fixed(p, s) =>
  // increase the precision and scale to prevent precision loss
  val dt = DecimalType.bounded(p + 14, s + 4)
  Cast(Cast(sum, dt) / Cast(count, dt), resultType)
case _ =>
  Cast(sum, resultType) / Cast(count, resultType)
  }

  def setChild (newchild: Expression) = {
child = newchild
  }

```
It is possible that  Cast(count, dt), resultType) will make the precision 
of the decimal number bigger than 38, and this causes over flow.  Since count 
is an integer and doesn't need a scale, I will cast it using 
DecimalType.bounded(38,0)
## How was this patch tested?
In DataFrameSuite, I will add a test case. 

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-22271

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19496.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19496


commit a3437ee4a87d1f51b362adeb20d4fcc264085ba7
Author: Huaxin Gao <huax...@us.ibm.com>
Date:   2017-10-14T04:45:27Z

[SPARK-22271][SQL]mean overflows and returns null for some decimal variables




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19496: [SPARK-22271][SQL]mean overflows and returns null...

2017-10-17 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19496#discussion_r145182120
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2103,4 +2103,13 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
   Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 
3, 1), Row(7, 3, 2)))
   }
+
+  test("SPARK-22271: mean overflows and returns null for some decimal 
variables") {
+val d = 0.034567890
+val df = Seq(d, d, d, d, d, d, d, d, d, d).toDF("DecimalCol")
+val result = df.select('DecimalCol cast DecimalType(38, 33))
+.select(col("DecimalCol")).describe()
+val mean = result.select("DecimalCol").where($"summary" === "mean")
+assert(mean.collect.toSet === 
Set(Row("0.034567890")))
--- End diff --

@gatorsmile Thanks Sean for your review. I will fix the problems. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19496: [SPARK-22271][SQL]mean overflows and returns null for so...

2017-10-17 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/19496
  
@gatorsmile Thank you very much for your help!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...

2017-11-12 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19715#discussion_r150450334
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -146,4 +146,172 @@ class QuantileDiscretizerSuite
 val model = discretizer.fit(df)
 assert(model.hasParent)
   }
+
+  test("Multiple Columns: Test observed number of buckets and their sizes 
match expected values") {
+val spark = this.spark
+import spark.implicits._
+
+val datasetSize = 10
+val numBuckets = 5
+val data1 = Array.range(1, 11, 1).map(_.toDouble)
+val data2 = Array.range(1, 20, 2).map(_.toDouble)
+val data = (0 until 10).map { idx =>
+  (data1(idx), data2(idx))
+}
+val df: DataFrame = data.toSeq.toDF("input1", "input2")
+
+val discretizer = new QuantileDiscretizer()
+  .setInputCols(Array("input1", "input2"))
+  .setOutputCols(Array("result1", "result2"))
+  .setNumBuckets(numBuckets)
+assert(discretizer.isQuantileDiscretizeMultipleColumns())
+val result = discretizer.fit(df).transform(df)
+
+val relativeError = discretizer.getRelativeError
+val isGoodBucket = udf {
+  (size: Int) => math.abs( size - (datasetSize / numBuckets)) <= 
(relativeError * datasetSize)
+}
+
+for (i <- 1 to 2) {
+  val observedNumBuckets = result.select("result" + i).distinct.count
+  assert(observedNumBuckets === numBuckets,
+"Observed number of buckets does not equal expected number of 
buckets.")
+
+  val numGoodBuckets = result.groupBy("result" + 
i).count.filter(isGoodBucket($"count")).count
+  assert(numGoodBuckets === numBuckets,
+"Bucket sizes are not within expected relative error tolerance.")
+}
+  }
+
+  test("Multiple Columns: Test on data with high proportion of duplicated 
values") {
+val spark = this.spark
+import spark.implicits._
+
+val numBuckets = 5
+val expectedNumBucket = 3
+val data1 = Array(1.0, 3.0, 2.0, 1.0, 1.0, 2.0, 3.0, 2.0, 2.0, 2.0, 
1.0, 3.0)
+val data2 = Array(1.0, 2.0, 3.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 3.0, 
1.0, 2.0)
+val data = (0 until data1.length).map { idx =>
+  (data1(idx), data2(idx))
+}
--- End diff --

Will change to data1.zip(data2). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...

2017-11-12 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19715#discussion_r150451065
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -146,4 +146,172 @@ class QuantileDiscretizerSuite
 val model = discretizer.fit(df)
 assert(model.hasParent)
   }
+
+  test("Multiple Columns: Test observed number of buckets and their sizes 
match expected values") {
+val spark = this.spark
+import spark.implicits._
+
+val datasetSize = 10
+val numBuckets = 5
+val data1 = Array.range(1, 11, 1).map(_.toDouble)
+val data2 = Array.range(1, 20, 2).map(_.toDouble)
+val data = (0 until 10).map { idx =>
+  (data1(idx), data2(idx))
+}
+val df: DataFrame = data.toSeq.toDF("input1", "input2")
+
+val discretizer = new QuantileDiscretizer()
+  .setInputCols(Array("input1", "input2"))
+  .setOutputCols(Array("result1", "result2"))
+  .setNumBuckets(numBuckets)
+assert(discretizer.isQuantileDiscretizeMultipleColumns())
+val result = discretizer.fit(df).transform(df)
+
+val relativeError = discretizer.getRelativeError
+val isGoodBucket = udf {
+  (size: Int) => math.abs( size - (datasetSize / numBuckets)) <= 
(relativeError * datasetSize)
+}
+
+for (i <- 1 to 2) {
+  val observedNumBuckets = result.select("result" + i).distinct.count
+  assert(observedNumBuckets === numBuckets,
+"Observed number of buckets does not equal expected number of 
buckets.")
+
+  val numGoodBuckets = result.groupBy("result" + 
i).count.filter(isGoodBucket($"count")).count
+  assert(numGoodBuckets === numBuckets,
+"Bucket sizes are not within expected relative error tolerance.")
+}
+  }
+
+  test("Multiple Columns: Test on data with high proportion of duplicated 
values") {
+val spark = this.spark
+import spark.implicits._
+
+val numBuckets = 5
+val expectedNumBucket = 3
+val data1 = Array(1.0, 3.0, 2.0, 1.0, 1.0, 2.0, 3.0, 2.0, 2.0, 2.0, 
1.0, 3.0)
+val data2 = Array(1.0, 2.0, 3.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 3.0, 
1.0, 2.0)
+val data = (0 until data1.length).map { idx =>
+  (data1(idx), data2(idx))
+}
+val df: DataFrame = data.toSeq.toDF("input1", "input2")
+val discretizer = new QuantileDiscretizer()
+  .setInputCols(Array("input1", "input2"))
+  .setOutputCols(Array("result1", "result2"))
+  .setNumBuckets(numBuckets)
+assert(discretizer.isQuantileDiscretizeMultipleColumns())
+val result = discretizer.fit(df).transform(df)
+for (i <- 1 to 2) {
+  val observedNumBuckets = result.select("result" + i).distinct.count
+  assert(observedNumBuckets == expectedNumBucket,
+s"Observed number of buckets are not correct." +
+  s" Expected $expectedNumBucket but found ($observedNumBuckets")
+}
+  }
+
+  test("Multiple Columns: Test transform on data with NaN value") {
+val spark = this.spark
+import spark.implicits._
+
+val numBuckets = 3
+val validData1 = Array(-0.9, -0.5, -0.3, 0.0, 0.2, 0.5, 0.9, 
Double.NaN, Double.NaN, Double.NaN)
+val expectedKeep1 = Array(0.0, 0.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 
3.0)
+val validData2 = Array(0.2, -0.1, 0.3, 0.0, 0.1, 0.3, 0.5, 0.8, 
Double.NaN, Double.NaN)
+val expectedKeep2 = Array(1.0, 0.0, 2.0, 0.0, 1.0, 2.0, 2.0, 2.0, 3.0, 
3.0)
+
+val data = (0 until validData1.length).map { idx =>
+  (validData1(idx), validData2(idx), expectedKeep1(idx), 
expectedKeep2(idx))
+}
+val dataFrame: DataFrame = data.toSeq.toDF("input1", "input2", 
"expected1", "expected2")
+
+val discretizer = new QuantileDiscretizer()
+  .setInputCols(Array("input1", "input2"))
+  .setOutputCols(Array("result1", "result2"))
+  .setNumBuckets(numBuckets)
+assert(discretizer.isQuantileDiscretizeMultipleColumns())
+
+withClue("QuantileDiscretizer with handleInvalid=error should throw 
exception for NaN values") {
+  intercept[SparkException] {
+discretizer.fit(dataFrame).transform(dataFrame).collect()
+  }
+}
+
+discretizer.setHandleInvalid("keep")
+discretizer.fit(dataFrame).transform(dataFrame).
+

[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...

2017-11-12 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19715#discussion_r150450151
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -104,7 +126,8 @@ private[feature] trait QuantileDiscretizerBase extends 
Params
  */
 @Since("1.6.0")
 final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override 
val uid: String)
-  extends Estimator[Bucketizer] with QuantileDiscretizerBase with 
DefaultParamsWritable {
+  extends Estimator[Bucketizer] with QuantileDiscretizerBase with 
DefaultParamsWritable
+with HasInputCols with HasOutputCols {
--- End diff --

I guess I will leave this as is even though it's a bit weird. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...

2017-11-12 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19715#discussion_r150450280
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 
---
@@ -146,4 +146,172 @@ class QuantileDiscretizerSuite
 val model = discretizer.fit(df)
 assert(model.hasParent)
   }
+
+  test("Multiple Columns: Test observed number of buckets and their sizes 
match expected values") {
+val spark = this.spark
+import spark.implicits._
+
+val datasetSize = 10
+val numBuckets = 5
+val data1 = Array.range(1, 11, 1).map(_.toDouble)
+val data2 = Array.range(1, 20, 2).map(_.toDouble)
+val data = (0 until 10).map { idx =>
+  (data1(idx), data2(idx))
+}
--- End diff --

Yes. Will change to data1.zip(data2)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19715: [SPARK-22397][ML]add multiple columns support to ...

2017-11-12 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19715#discussion_r150450222
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -129,34 +152,95 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
   @Since("2.1.0")
   def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
 
+  /** @group setParam */
+  @Since("2.3.0")
+  def setNumBucketsArray(value: Array[Int]): this.type = 
set(numBucketsArray, value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setOutputCols(value: Array[String]): this.type = set(outputCols, 
value)
+
+  private[feature] def isQuantileDiscretizeMultipleColumns(): Boolean = {
+if (isSet(inputCols) && isSet(inputCol)) {
+  logWarning("Both `inputCol` and `inputCols` are set, we ignore 
`inputCols` and this " +
+"`QuantileDiscretize` only map one column specified by `inputCol`")
+  false
+} else if (isSet(inputCols)) {
+  true
+} else {
+  false
+}
+  }
+
+  private[feature] def getInOutCols: (Array[String], Array[String]) = {
+if (!isQuantileDiscretizeMultipleColumns) {
+  (Array($(inputCol)), Array($(outputCol)))
+} else {
+  require($(inputCols).length == $(outputCols).length,
+"inputCols number do not match outputCols")
+  ($(inputCols), $(outputCols))
+}
+  }
+
   @Since("1.6.0")
   override def transformSchema(schema: StructType): StructType = {
-SchemaUtils.checkNumericType(schema, $(inputCol))
-val inputFields = schema.fields
-require(inputFields.forall(_.name != $(outputCol)),
-  s"Output column ${$(outputCol)} already exists.")
-val attr = NominalAttribute.defaultAttr.withName($(outputCol))
-val outputFields = inputFields :+ attr.toStructField()
+val (inputColNames, outputColNames) = getInOutCols
+val existingFields = schema.fields
+var outputFields = existingFields
+inputColNames.zip(outputColNames).map { case (inputColName, 
outputColName) =>
+  SchemaUtils.checkNumericType(schema, inputColName)
+  require(existingFields.forall(_.name != outputColName),
+s"Output column ${outputColName} already exists.")
+  val attr = NominalAttribute.defaultAttr.withName(outputColName)
+  outputFields :+= attr.toStructField()
+}
 StructType(outputFields)
   }
 
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): Bucketizer = {
 transformSchema(dataset.schema, logging = true)
-val splits = dataset.stat.approxQuantile($(inputCol),
-  (0.0 to 1.0 by 1.0/$(numBuckets)).toArray, $(relativeError))
+val bucketizer = new Bucketizer(uid).setHandleInvalid($(handleInvalid))
+if (isQuantileDiscretizeMultipleColumns) {
+  var bucketArray = Array.empty[Int]
+  if (isSet(numBucketsArray)) {
+bucketArray = $(numBucketsArray)
+  }
+  else {
+bucketArray = Array($(numBuckets))
+  }
+  val probabilityArray = bucketArray.toSeq.flatMap { numOfBucket =>
+(0.0 to 1.0 by 1.0 / numOfBucket)
+  }
+  val splitsArray = dataset.stat.approxQuantile($(inputCols),
+probabilityArray.sorted.toArray.distinct, $(relativeError))
+  val distinctSplitsArray = splitsArray.toSeq.map { splits =>
+getDistinctSplits(splits)
+  }
+  bucketizer.setSplitsArray(distinctSplitsArray.toArray)
+  copyValues(bucketizer.setParent(this))
+}
+else {
--- End diff --

Will fix this. And fix the same problem in another place. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 348 matches

Mail list logo