[GitHub] spark issue #20864: [SPARK-23745][SQL]Remove the directories of the “hive....

2018-03-23 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20864
  
I thought the directory is also created from this line: 
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java#L143.
 For this one, we need to think about whether we can remove all the temp 
directories creation, because the statements are executed by spark sql and it 
has nothing about the Hive in the thrift server.

You are right that HiveClientImpl (the Hive inside spark sql) will also 
produce such temp directories. However, it seems like the following line alone 
is sufficient to add the jar to the class loader: 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L836.
 So I doubt we still need the `runSqlHive(s"ADD JAR $path")` to download the 
jar to a temp directory.

Overall, I think we need an overall design to remove the Hive legacy in 
both the thrift server and Spark SQL. Adding more temp fixes will make such a 
design harder.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18666: [SPARK-21449][SQL][Hive]Close HiveClient's SessionState ...

2018-03-21 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/18666
  
Maybe I missed something, but it seems Spark has its own class loader right 
now, which can load the class from the given URL: 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L211.
 I doubt it needs to download the jar to the hive configured temp directory. In 
other words, this line of code is not necessary: 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L837.
 

This can be verified very easily: Remove 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L837
 and run `add jar` and see whether it works.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18666: [SPARK-21449][SQL][Hive]Close HiveClient's SessionState ...

2018-03-20 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/18666
  
I asked the following question in 
https://github.com/apache/spark/pull/20864: is it necessary to create these 
temp directories when the hive thrift server starts? It sounds some legacy from 
Hive and we can skip creating them in the first place.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20864: [SPARK-23745][SQL]Remove the directories of the “hive....

2018-03-20 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20864
  
@samartinucci @zuotingbing a high-level question: is it necessary to create 
these temp directories when the hive thrift server starts? It sounds some 
legacy from Hive and we can skip creating them in the first place.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20702: [SPARK-23547][SQL]Cleanup the .pipeout file when the Hiv...

2018-03-13 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20702
  
lgtm!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20681: [SPARK-23518][SQL] Avoid metastore access when th...

2018-03-01 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20681#discussion_r171770982
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -67,6 +67,8 @@ sparkSession <- if (windows_with_hadoop()) {
 sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE)
   }
 sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", 
"getJavaSparkContext", sparkSession)
+# materialize the catalog implementation
+listTables()
--- End diff --

`test_sparkSQL.R` is the only one uses 
`newJObject("org.apache.spark.sql.hive.test.TestHiveContext", ssc, FALSE)` on 
the `ssc`, so the catalog impl spark conf is changed. So ``test_sparkSQL.R` is 
the only one broken.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...

2018-03-01 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20681
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...

2018-03-01 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20681
  
Overall, I think this suite needs a refactoring: split to in-memory catalog 
one and hive catalog one. The catalog conf should not be manipulated after the 
spark context is created. The other way is just a hack.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...

2018-03-01 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20681
  
We can remove the test, but it is not a good practice. You don't know 
exactly why the test is added, which hidden assuption he wants to guarantee, 
right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...

2018-03-01 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20681
  
My original plan to fix the test should not work, because of this test: 
https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_sparkSQL.R#L3343

The new plan is to run some simple catalog commands immediately after the 
spark session is created, so the catalog is materialized (like the old 
behavior).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...

2018-02-28 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20681
  
retest this please




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...

2018-02-26 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20681
  
@felixcheung Can you take a look at the changes in the R tests?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20681: [SPARK-23518][SQL] Completely remove metastore ac...

2018-02-26 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20681

[SPARK-23518][SQL] Completely remove metastore access if the query is not 
using tables

## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/18944 added one patch, which allowed a 
spark session to be created when the hive metastore server is down. However, it 
did not allow running any commands with the spark session. So the users could 
not read / write data frames, when the hive metastore server is down.


## How was this patch tested?

Added some unit tests to read and write data frames based on the original 
HiveMetastoreLazyInitializationSuite.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark completely-lazy

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20681.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20681


commit b447beab9e43595d61ef0f51c91a09fd2f72a87e
Author: Feng Liu <fengliu@...>
Date:   2018-02-26T07:08:47Z

lazy




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20557: [SPARK-23364][SQL]'desc table' command in spark-sql add ...

2018-02-12 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20557
  
There may be some spark JDBC/ODBC drivers need to parse the returned 
results to get all the columns. We should avoid making changes on the returned 
"schema" from the server side. You can fix the issue on the client side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20565: SPAR[SPARK-23379][SQL] remove redundant metastore...

2018-02-09 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20565

SPAR[SPARK-23379][SQL] remove redundant metastore access

## What changes were proposed in this pull request?

If the target database name is as same as the current database, we should 
be able to skip one metastore access. 

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark remove-redundant

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20565.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20565


commit f1acb5ad9a12df11faf082144c676082df647ff9
Author: Feng Liu <fengliu@...>
Date:   2018-02-10T00:18:44Z

some

commit c29aa3ecb78dc345bb2ad1e5cbaf29d3fdb3a803
Author: Feng Liu <fengliu@...>
Date:   2018-02-10T01:10:53Z

init




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20564: [SPARK-23378][SQL] move setCurrentDatabase from H...

2018-02-09 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20564#discussion_r167381054
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -1107,11 +1107,6 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   }
 }
 
-// Note: Before altering table partitions in Hive, you *must* set the 
current database
-// to the one that contains the table of interest. Otherwise you will 
end up with the
-// most helpful error message ever: "Unable to alter partition. alter 
is not possible."
-// See HIVE-2742 for more detail.
-client.setCurrentDatabase(db)
--- End diff --

Sorry, I meant `a special case`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20564: [SPARK-23378][SQL] move setCurrentDatabase from H...

2018-02-09 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20564

[SPARK-23378][SQL] move setCurrentDatabase from HiveExternalCatalog to 
HiveClientImpl

## What changes were proposed in this pull request?

This enforces the rule that no calls from `HiveExternalCatalog` reset the 
current database in the hive client, except the `setCurrentDatabase` method. 

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark move

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20564.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20564


commit f1acb5ad9a12df11faf082144c676082df647ff9
Author: Feng Liu <fengliu@...>
Date:   2018-02-10T00:18:44Z

some




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

2018-02-09 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20407#discussion_r167352400
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -156,6 +156,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val ALLOW_NESTEDJOIN_FALLBACK = 
buildConf("spark.sql.join.broadcastJoinFallback.enabled")
--- End diff --

nit: the key can be `spark.sql.join.broadcastNestedLoopJoin.enabled`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

2018-02-09 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20407#discussion_r167351883
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -262,6 +262,10 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
 joins.CartesianProductExec(planLater(left), planLater(right), 
condition) :: Nil
 
   case logical.Join(left, right, joinType, condition) =>
+if (!SQLConf.get.allowNestedJoinFallback) {
+  throw new AnalysisException("The only JOIN strategy available 
for this plan is " +
+s"BroadcastNestedLoopJoin, but 
`${SQLConf.ALLOW_NESTEDJOIN_FALLBACK}` is `false`.")
--- End diff --

nit: `SQLConf.ALLOW_NESTEDJOIN_FALLBACK.key`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20441: [SPARK-23275][SQL] hive/tests have been failing when run...

2018-02-09 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20441
  
@gatorsmile sorry for the late reply. I think the root cause is in hive 
metastore. I created one pr to bypass it: 
https://github.com/apache/spark/pull/20562


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20562: [SPARK-23275][SQL] fix the thread leaking in hive...

2018-02-09 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20562

[SPARK-23275][SQL] fix the thread leaking in hive/tests

## What changes were proposed in this pull request?

The two lines actually can trigger the hive metastore bug: 
https://issues.apache.org/jira/browse/HIVE-16844

The two configs are not in the default `ObjectStore` properties, so any run 
hive commands after these two lines will set the `propsChanged` flag in the 
`ObjectStore.setConf` and then cause thread leaks.

I don't think two lines are very useful and can be removed completed.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark fix-omm

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20562.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20562


commit 4e10f34ebb17940a1f0fd54ce26cb16d06320770
Author: Feng Liu <fengliu@...>
Date:   2018-02-09T21:15:36Z

init: remove the two lines




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17886: [SPARK-13983][SQL] Fix HiveThriftServer2 can not ...

2018-02-01 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17886#discussion_r165441473
  
--- Diff: 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java
 ---
@@ -221,6 +227,70 @@ private void configureSession(Map<String, String> 
sessionConfMap) throws HiveSQL
 }
   }
 
+  // Copy from org.apache.hadoop.hive.ql.processors.SetProcessor, only 
change:
+  // setConf(varname, propName, varvalue, true) when 
varname.startsWith(HIVECONF_PREFIX)
+  public static int setVariable(String varname, String varvalue) throws 
Exception {
+SessionState ss = SessionState.get();
+if (varvalue.contains("\n")){
+  ss.err.println("Warning: Value had a \\n character in it.");
+}
+varname = varname.trim();
+if (varname.startsWith(ENV_PREFIX)){
+  ss.err.println("env:* variables can not be set.");
+  return 1;
+} else if (varname.startsWith(SYSTEM_PREFIX)){
+  String propName = varname.substring(SYSTEM_PREFIX.length());
+  System.getProperties().setProperty(propName,
+  new 
VariableSubstitution().substitute(ss.getConf(),varvalue));
+} else if (varname.startsWith(HIVECONF_PREFIX)){
+  String propName = varname.substring(HIVECONF_PREFIX.length());
+  setConf(varname, propName, varvalue, true);
--- End diff --

The fix of hivevar is by this line: 
https://github.com/apache/spark/pull/17886/files#diff-9d2cd65aaeae992250b5f40d8c289287R56.
 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17886: [SPARK-13983][SQL] Fix HiveThriftServer2 can not get "--...

2018-01-31 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/17886
  
@gatorsmile this is a great patch. The test can be improved, but I think it 
is safe to merge as it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19219: [SPARK-21993][SQL] Close sessionState when finish

2018-01-31 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/19219
  
The major issue this PR tries to cover has been fixed by 
https://github.com/apache/spark/pull/20029, so I think we are good if there are 
no calls to `HiveClientImpl.newSession`. We can close this PR with no-fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20385: [SPARK-21396][SQL] Fixes MatchError when UDTs are passed...

2018-01-31 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20385
  
Actually, one more thing, do you need to consider the UDT as one attribute 
of a structured type? 
https://github.com/apache/spark/pull/20385/files#diff-842e3447fc453de26c706db1cac8f2c4L467


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20385: [SPARK-21396][SQL] Fixes MatchError when UDTs are passed...

2018-01-31 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20385
  
LGTM!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20425: [WIP] remove the redundant code in HiveExternalCa...

2018-01-29 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20425

[WIP] remove the redundant code in HiveExternalCatlog and HiveClientImpl

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark refact-hive-external

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20425.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20425


commit 41829e4fadd4b9821be68d8b2a0484eca45711a1
Author: Feng Liu <fengliu@...>
Date:   2018-01-29T18:37:43Z

small




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20420: [SPARK-22916][SQL][FOLLOW-UP] Update the Description of ...

2018-01-29 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20420
  
LGTM! Thanks for doing this!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20385: [SPARK-21396][SQL] Fixes MatchError when UDTs are...

2018-01-24 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20385#discussion_r163648777
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
 ---
@@ -102,6 +102,8 @@ private[hive] class SparkExecuteStatementOperation(
 to += from.getAs[Timestamp](ordinal)
   case BinaryType =>
 to += from.getAs[Array[Byte]](ordinal)
+  case udt: UserDefinedType[_] =>
+to += from.get(ordinal).toString
--- End diff --

It is possible `from.get(ordinal)` returns `null`, then a null pointer 
exception. I think a better way to add this case is by the method 
`HiveUtils.toHiveString` method.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...

2018-01-24 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20025
  
lgtm


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18983: [SPARK-21771][SQL]remove useless hive client in SparkSQL...

2018-01-19 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/18983
  
LGTM! It is only created once though.

Frankly, we should completely remove the implementation of `newSession()` 
method.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20025: [SPARK-22837][SQL]Session timeout checker does no...

2018-01-19 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20025#discussion_r162698093
  
--- Diff: 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/SessionManager.java
 ---
@@ -80,7 +76,6 @@ public synchronized void init(HiveConf hiveConf) {
 }
 createBackgroundOperationPool();
 addService(operationManager);
-super.init(hiveConf);
--- End diff --

hmm, I think we revert keep this line too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...

2018-01-18 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20025
  
@gatorsmile @felixcheung I left one comment, otherwise lgtm.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20025: [SPARK-22837][SQL]Session timeout checker does no...

2018-01-18 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20025#discussion_r162426604
  
--- Diff: 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/SessionManager.java
 ---
@@ -79,35 +75,19 @@ public synchronized void init(HiveConf hiveConf) {
   initOperationLogRootDir();
 }
 createBackgroundOperationPool();
-addService(operationManager);
-super.init(hiveConf);
-  }
-
-  private void createBackgroundOperationPool() {
-int poolSize = 
hiveConf.getIntVar(ConfVars.HIVE_SERVER2_ASYNC_EXEC_THREADS);
-LOG.info("HiveServer2: Background operation thread pool size: " + 
poolSize);
-int poolQueueSize = 
hiveConf.getIntVar(ConfVars.HIVE_SERVER2_ASYNC_EXEC_WAIT_QUEUE_SIZE);
-LOG.info("HiveServer2: Background operation thread wait queue size: " 
+ poolQueueSize);
-long keepAliveTime = HiveConf.getTimeVar(
-hiveConf, ConfVars.HIVE_SERVER2_ASYNC_EXEC_KEEPALIVE_TIME, 
TimeUnit.SECONDS);
-LOG.info(
-"HiveServer2: Background operation thread keepalive time: " + 
keepAliveTime + " seconds");
-
-// Create a thread pool with #poolSize threads
-// Threads terminate when they are idle for more than the keepAliveTime
-// A bounded blocking queue is used to queue incoming operations, if 
#operations > poolSize
-String threadPoolName = "HiveServer2-Background-Pool";
-backgroundOperationPool = new ThreadPoolExecutor(poolSize, poolSize,
-keepAliveTime, TimeUnit.SECONDS, new 
LinkedBlockingQueue(poolQueueSize),
-new ThreadFactoryWithGarbageCleanup(threadPoolName));
-backgroundOperationPool.allowCoreThreadTimeOut(true);
-
--- End diff --

 I think we can keep this file unchanged. Looks like the hive thread pool 
is more flexible than the spark one. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...

2018-01-10 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20174
  
@mgaido91 Your proposal and current approach are both with one line change. 
Since the issue is actually related to the hash aggregate implementation, I 
think it is reasonable to include it in the  `ReplaceDeduplicateWithAggregate` 
transformation. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...

2018-01-10 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160757582
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1221,7 +1221,12 @@ object ReplaceDeduplicateWithAggregate extends 
Rule[LogicalPlan] {
   Alias(new First(attr).toAggregateExpression(), 
attr.name)(attr.exprId)
 }
   }
-  Aggregate(keys, aggCols, child)
+  // SPARK-22951: the implementation of aggregate operator treats the 
cases with and without
+  // grouping keys differently, when there are not input rows. For the 
aggregation after
+  // `dropDuplicates()` on an empty data frame, a grouping key is 
added here to make sure the
+  // aggregate operator can work correctly (returning an empty 
iterator).
--- End diff --

ok, I like this one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20202: [MINOR] fix a typo in BroadcastJoinSuite

2018-01-09 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20202
  
thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...

2018-01-08 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160305419
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -230,6 +236,7 @@ case class HashAggregateExec(
  | private void $doAgg() throws java.io.IOException {
  |   // initialize aggregation buffer
  |   $initBufVar
+ |   $hasInput = false;
--- End diff --

If the result expression is empty and `hasInput` is `false`, it should 
return an empty iterator (this is exactly the corner case we want to fix). If 
the result expression is empty but `hasInput` is `true`, it should append empty 
rows.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...

2018-01-08 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160217718
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -230,6 +236,7 @@ case class HashAggregateExec(
  | private void $doAgg() throws java.io.IOException {
  |   // initialize aggregation buffer
  |   $initBufVar
+ |   $hasInput = false;
--- End diff --

I actually tried it, but found it was hard because we don't control the 
codegen of line `/* 038 */   if (shouldStop()) return;` in `doAgg`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...

2018-01-08 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160216751
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -102,10 +102,12 @@ case class HashAggregateExec(
 
   val beforeAgg = System.nanoTime()
   val hasInput = iter.hasNext
-  val res = if (!hasInput && groupingExpressions.nonEmpty) {
-// This is a grouped aggregate and the input iterator is empty,
+  val res = if (!hasInput && (groupingExpressions.nonEmpty || 
resultExpressions.isEmpty)) {
+// The input iterator is empty, and this is a grouped aggregate or 
without result columns,
 // so return an empty iterator.
 Iterator.empty
+  } else if (!hasInput && resultExpressions.isEmpty) {
--- End diff --

oops, forgot to remove it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20174: [SPARK-22951][SQL] aggregate should not produce empty ro...

2018-01-07 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20174
  
test this please



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...

2018-01-06 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160042482
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -245,11 +252,15 @@ case class HashAggregateExec(
|   $doAggFuncName();
|   $aggTime.add((System.nanoTime() - $beforeAgg) / 100);
|
-   |   // output the result
-   |   ${genResult.trim}
+   |   if (!$hasInput && ${resultVars.isEmpty}) {
--- End diff --

I think it hurts the code readability if the code for the two cases are 
defined separately.  For the regular case, the generated code will look like 
`if (false && !hasInput) ... else ...`. This pattern should be optimized easily 
during jit, so we don't need to worry about the performance too much.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: S[SPARK-22951][SQL] aggregate should not produce ...

2018-01-06 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20174

S[SPARK-22951][SQL] aggregate should not produce empty rows if data frame 
is empty

## What changes were proposed in this pull request?

 WIP

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark fix-duplicate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20174.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20174


commit 89d6b7504239211cd64188be50ae32a199f1ddd8
Author: Feng Liu <fengliu@...>
Date:   2018-01-06T07:51:16Z

init




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

2018-01-05 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20029
  
lgtm!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

2017-12-29 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20029
  
By [this 
line](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L78),
 yes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...

2017-12-28 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20025
  
My understanding is that the reflection was used because we might use a 
different version of hive then we didn't control what it was done inside the 
`super.init`. However, after we inlined the hive code, it is safe to call the 
`super.init` method. This is a cleaner way to fix the referred and other 
potential bugs, IMO.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...

2017-12-28 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20025
  
@zuotingbing I think all the code in SparkSQLSessionManager.scala should 
gone because they are just some reflection hacks. It is possible to call 
`super.init(hiveConf)` instead to get the session timeout checker and all other 
things start, isn't?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20109: [SPARK-22891][SQL] Make hive client creation thre...

2017-12-28 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20109

[SPARK-22891][SQL] Make hive client creation thread safe

## What changes were proposed in this pull request?

This is to walk around the hive issue: 
https://issues.apache.org/jira/browse/HIVE-11935

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark synchronized

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20109.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20109


commit 163d3443681af2c5ff246ecc546355934c0f6dbb
Author: Feng Liu <fengliu@...>
Date:   2017-12-29T01:02:16Z

init




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

2017-12-28 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20029
  
@zuotingbing I took a close look at the related code and thought the issue 
you raised is valid:

1. The hiveClient created for the 
[resourceLoader](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L45)
 is only used to addJar, which is, in turn, to add Jar to the shared 
[`IsolatedClientLoader`](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L817).
 Then we can just use the shared hive client for this purpose.

2. Another possible reason to use a new hive client is to run [this hive 
statement](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L818).
 But I think it just some leftovers from old spark and should be removed. So 
overall it is fined to use the shared `client` from `HiveExternalCatalog` 
without creating a new hive client.

3. Currrently, there are no ways to cleanup the resource created by a [new 
session of 
SQLContext/SparkSession](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L78).
 I couldn't understand the design tradeoff behind 
[this](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L716)
 (@srowen ). So it is not easy to remove the temp dirs when a session is closed.

4. To what extent, does spark need these scratch dirs? Is it possible we 
can make this step optional, if it is not used for all the deployment modes?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20099: [SPARK-22916][SQL] shouldn't bias towards build r...

2017-12-27 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20099

[SPARK-22916][SQL] shouldn't bias towards build right if user does not 
specify

## What changes were proposed in this pull request?

When there are no broadcast hints, the current spark strategies will prefer 
to build right, without considering the sizes of the two sides. This patch 
added the logic to consider the sizes of the two tables for the build side. To 
make the logic clear, the build side is determined by two steps:

1. If there are broadcast hints, the build side is determined by 
`broadcastSideByHints`;
2. If there are no broadcast hints, the build side is determined by 
`broadcastSideByConfig`;
3. If the broadcast is disabled by the config, it falls back to the next 
cases.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark fix-spark-strategies

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20099.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20099


commit e4b63f5fab81b7637d107efe6524b2f41c681a10
Author: Feng Liu <fengliu@...>
Date:   2017-12-27T23:22:30Z

init




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs on requ...

2017-12-16 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/18812
  
I actually did not get the motivation of this PR. HiveThriftServer2 can run 
independently or be started with a SQL context: 
https://github.com/apache/spark/pull/18812/files#diff-709404b0d3defeff035ef0c4f5a960e5L57.
 For the later, this OOM should be handled by the upper layer, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs ...

2017-12-16 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18812#discussion_r157361046
  
--- Diff: 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java
 ---
@@ -39,6 +32,8 @@
 import org.apache.hive.service.cli.thrift.ThriftCLIService;
 import org.apache.hive.service.cli.thrift.ThriftHttpCLIService;
 
+import java.util.Properties;
--- End diff --

java imports should be the first?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs ...

2017-12-16 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18812#discussion_r157361025
  
--- Diff: 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java
 ---
@@ -37,21 +30,29 @@
 import org.apache.thrift.transport.TServerSocket;
 import org.apache.thrift.transport.TTransportFactory;
 
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.SynchronousQueue;
+import java.util.concurrent.TimeUnit;
--- End diff --

unnecessary change


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19989: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

2017-12-16 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/19989
  
I think this method can take care of resource clean up automatically: 
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/SessionManager.java#L151

Can you really make a heap dump and find out why the sessions are not 
cleaned up? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19721: [SPARK-22496][SQL]thrift server adds operation logs

2017-12-08 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/19721
  
lgtm!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHo...

2017-10-09 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/19460

[SPARK-2][core] Fix the ARRAY_MAX in BufferHolder and add a test

## What changes were proposed in this pull request?

We should not break the assumption that the length of the allocated byte 
array is word rounded:  

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L170
So we want to use `Integer.MAX_VALUE - 15` instead of `Integer.MAX_VALUE - 
8` as the upper bound of an allocated byte array. 

## How was this patch tested?

Since the Spark unit test JVM has less than 1GB heap, here we run the test 
code as a submit job, so it can run on a JVM has 4GB memory. 

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark fix_array_max

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19460.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19460


commit 92a6d2d53aea02042d47888e99df5a4f2167cd1f
Author: Feng Liu <feng...@databricks.com>
Date:   2017-10-09T17:43:39Z

init




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...

2017-10-07 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19266#discussion_r143346155
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java
 ---
@@ -35,6 +35,11 @@
  * if the fields of row are all fixed-length, as the size of result row is 
also fixed.
  */
 public class BufferHolder {
+
+  // Some JVMs can't allocate arrays of length Integer.MAX_VALUE; actual 
max is somewhat
+  // smaller. Be conservative and lower the cap a little.
+  private static final int ARRAY_MAX = Integer.MAX_VALUE - 8;
--- End diff --

@srowen I think we can use `Integer.MAX_VALUE - 7` instead of 
`Integer.MAX_VALUE - 8` to make the size align with words, otherwise, this 
check will fail: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L170.
  

This is the reason why all the size inputs to the methods are rounded, for 
example, 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java#L216.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19394: [SPARK-22170][SQL] Reduce memory consumption in b...

2017-10-02 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19394#discussion_r142290483
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala ---
@@ -280,13 +280,20 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] 
with Logging with Serializ
 results.toArray
   }
 
+  private[spark] def executeCollectIterator(): (Long, 
Iterator[InternalRow]) = {
+val countsAndBytes = getByteArrayRdd().collect()
--- End diff --

This still fetches all the compressed rows to the driver, before building 
the hashed relation. Ideally, you should fetch the rows from executors 
incrementally. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...

2017-09-17 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19230#discussion_r139329522
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
 ---
@@ -16,6 +16,7 @@
  */
 package org.apache.spark.sql.execution.vectorized;
 
+import org.apache.spark.api.java.function.Function;
--- End diff --

oops, reverted it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...

2017-09-17 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19230#discussion_r139329523
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnVectorSuite.scala
 ---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.vectorized
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+class ColumnVectorSuite extends SparkFunSuite with BeforeAndAfterEach {
+
+  var testVector: WritableColumnVector = _
+
+  private def allocate(capacity: Int, dt: DataType): WritableColumnVector 
= {
+new OnHeapColumnVector(capacity, dt)
+  }
+
+  override def afterEach(): Unit = {
+testVector.close()
+  }
+
+  test("boolean") {
+testVector = allocate(10, BooleanType)
+(0 until 10).foreach { i =>
+  testVector.appendBoolean(i % 2 == 0)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, BooleanType) === (i % 2 == 0))
+}
+  }
+
+  test("byte") {
+testVector = allocate(10, ByteType)
+(0 until 10).foreach { i =>
+  testVector.appendByte(i.toByte)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, ByteType) === (i.toByte))
+}
+  }
+
+  test("short") {
+testVector = allocate(10, ShortType)
+(0 until 10).foreach { i =>
+  testVector.appendShort(i.toShort)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, ShortType) === (i.toShort))
+}
+  }
+
+  test("int") {
+testVector = allocate(10, IntegerType)
+(0 until 10).foreach { i =>
+  testVector.appendInt(i)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, IntegerType) === i)
+}
+  }
+
+  test("long") {
+testVector = allocate(10, LongType)
+(0 until 10).foreach { i =>
+  testVector.appendLong(i)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, LongType) === i)
+}
+  }
+
+  test("float") {
+testVector = allocate(10, FloatType)
+(0 until 10).foreach { i =>
+  testVector.appendFloat(i.toFloat)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, FloatType) === i.toFloat)
+}
+  }
+
+  test("double") {
+testVector = allocate(10, DoubleType)
+(0 until 10).foreach { i =>
+  testVector.appendDouble(i.toDouble)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, DoubleType) === i.toDouble)
+}
+  }
+
+  test("string") {
+testVector = allocate(10, StringType)
+(0 until 10).map { i =>
+  val utf8 = s"str$i".getBytes("utf8")
+  testVector.appendByteArray(utf8, 0, utf8.length)
+}
+
+val array = new ColumnVector.Array(testVector)
+
+(0 until 10).foreach { i =>
+  assert(array.get(i, StringType) === UTF8String.fromString(s"str$i"))
+}
+  }
+
+  test("binary") {
+testVector = allocate(10, BinaryType)
+(0 until 10).ma

[GitHub] spark issue #19230: [SPARK-22003][SQL] support array column in vectorized re...

2017-09-15 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/19230
  
@viirya @cloud-fan  unit test updated. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...

2017-09-15 Thread liufengdb
Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/19230#discussion_r139072681
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
 ---
@@ -99,73 +100,18 @@ public ArrayData copy() {
 @Override
 public Object[] array() {
   DataType dt = data.dataType();
+  Function<Integer, Object> getAtMethod = (Function<Integer, Object>) 
i -> get(i, dt);
   Object[] list = new Object[length];
-
-  if (dt instanceof BooleanType) {
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = data.getBoolean(offset + i);
-  }
-}
-  } else if (dt instanceof ByteType) {
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = data.getByte(offset + i);
-  }
-}
-  } else if (dt instanceof ShortType) {
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = data.getShort(offset + i);
-  }
-}
-  } else if (dt instanceof IntegerType) {
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = data.getInt(offset + i);
-  }
-}
-  } else if (dt instanceof FloatType) {
-for (int i = 0; i < length; i++) {
-  if (!data.isNullAt(offset + i)) {
-list[i] = data.getFloat(offset + i);
-  }
-}
-  } else if (dt instanceof DoubleType) {
+  try {
 for (int i = 0; i < length; i++) {
   if (!data.isNullAt(offset + i)) {
-list[i] = data.getDouble(offset + i);
+list[i] = getAtMethod.call(i);
--- End diff --

It should be `get(i, dt)`? I updated it anyway. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...

2017-09-13 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/19230

[SPARK-22003][SQL] support array column in vectorized reader with UDF

## What changes were proposed in this pull request?

The UDF needs to deserialize the `UnsafeRow`. When the column type is 
Array, the `get` method from the `ColumnVector`, which is used by the 
vectorized reader, is called, but this method is not implemented. 

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark fix_array_open

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19230.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19230


commit adbaeabf18ee1f96611ecbd6ee627bc0a457289d
Author: Feng Liu <feng...@databricks.com>
Date:   2017-09-12T21:56:55Z

init




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18400: [SPARK-21188][CORE] releaseAllLocksForTask should...

2017-06-22 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/18400

[SPARK-21188][CORE] releaseAllLocksForTask should synchronize the whole 
method

## What changes were proposed in this pull request?

Since the objects `readLocksByTask`, `writeLocksByTask` and `info`s are 
coupled and supposed to be modified by other threads concurrently, all the read 
and writes of them in the method `releaseAllLocksForTask` should be protected 
by a single synchronized block like other similar methods. 

## How was this patch tested?

existing tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark synchronize

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18400.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18400


commit 5cdd328ee9a32969377cbdbfea229cc364dbee17
Author: Feng Liu <feng...@databricks.com>
Date:   2017-06-23T05:24:07Z

init




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18208: [SPARK-20991] BROADCAST_TIMEOUT conf should be a ...

2017-06-05 Thread liufengdb
GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/18208

[SPARK-20991] BROADCAST_TIMEOUT conf should be a TimeoutConf

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

The construction of BROADCAST_TIMEOUT conf should take the TimeUnit 
argument as a TimeoutConf.

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark fix_timeout

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18208.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18208


commit 8a2d37a10cd6eb36403006b99a33a7d057905e6e
Author: Feng Liu <feng...@databricks.com>
Date:   2017-06-05T20:00:55Z

small




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17397: [SPARK-20070][SQL] Redact DataSourceScanExec treeString

2017-03-23 Thread liufengdb
Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/17397
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org