from:"liufengdb"

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20681#discussion_r171770982
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -67,6 +67,8 @@ sparkSession <- if (windows_with_hadoop()) {
 sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE)
   }
 sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", 
"getJavaSparkContext", sparkSession)
+# materialize the catalog implementation
+listTables()
--- End diff --

`test_sparkSQL.R` is the only one uses 
`newJObject("org.apache.spark.sql.hive.test.TestHiveContext", ssc, FALSE)` on 
the `ssc`, so the catalog impl spark conf is changed. So ``test_sparkSQL.R` is 
the only one broken.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20681
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20681
  
Overall, I think this suite needs a refactoring: split to in-memory catalog 
one and hive catalog one. The catalog conf should not be manipulated after the 
spark context is created. The other way is just a hack.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20681
  
We can remove the test, but it is not a good practice. You don't know 
exactly why the test is added, which hidden assuption he wants to guarantee, 
right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20681
  
My original plan to fix the test should not work, because of this test: 
https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_sparkSQL.R#L3343

The new plan is to run some simple catalog commands immediately after the 
spark session is created, so the catalog is materialized (like the old 
behavior).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...

2018-02-28 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20681
  
retest this please




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...

2018-02-26 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20681
  
@felixcheung Can you take a look at the changes in the R tests?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20681: [SPARK-23518][SQL] Completely remove metastore ac...

2018-02-26 Thread liufengdb

GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20681

[SPARK-23518][SQL] Completely remove metastore access if the query is not 
using tables

## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/18944 added one patch, which allowed a 
spark session to be created when the hive metastore server is down. However, it 
did not allow running any commands with the spark session. So the users could 
not read / write data frames, when the hive metastore server is down.


## How was this patch tested?

Added some unit tests to read and write data frames based on the original 
HiveMetastoreLazyInitializationSuite.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark completely-lazy

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20681.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20681


commit b447beab9e43595d61ef0f51c91a09fd2f72a87e
Author: Feng Liu <fengliu@...>
Date:   2018-02-26T07:08:47Z

lazy




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20557: [SPARK-23364][SQL]'desc table' command in spark-sql add ...

2018-02-12 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20557
  
There may be some spark JDBC/ODBC drivers need to parse the returned 
results to get all the columns. We should avoid making changes on the returned 
"schema" from the server side. You can fix the issue on the client side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20565: SPAR[SPARK-23379][SQL] remove redundant metastore...

GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20565

SPAR[SPARK-23379][SQL] remove redundant metastore access

## What changes were proposed in this pull request?

If the target database name is as same as the current database, we should 
be able to skip one metastore access. 

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark remove-redundant

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20565.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20565


commit f1acb5ad9a12df11faf082144c676082df647ff9
Author: Feng Liu <fengliu@...>
Date:   2018-02-10T00:18:44Z

some

commit c29aa3ecb78dc345bb2ad1e5cbaf29d3fdb3a803
Author: Feng Liu <fengliu@...>
Date:   2018-02-10T01:10:53Z

init




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20564: [SPARK-23378][SQL] move setCurrentDatabase from H...

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20564#discussion_r167381054
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -1107,11 +1107,6 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   }
 }
 
-// Note: Before altering table partitions in Hive, you *must* set the 
current database
-// to the one that contains the table of interest. Otherwise you will 
end up with the
-// most helpful error message ever: "Unable to alter partition. alter 
is not possible."
-// See HIVE-2742 for more detail.
-client.setCurrentDatabase(db)
--- End diff --

Sorry, I meant `a special case`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20564: [SPARK-23378][SQL] move setCurrentDatabase from H...

GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20564

[SPARK-23378][SQL] move setCurrentDatabase from HiveExternalCatalog to 
HiveClientImpl

## What changes were proposed in this pull request?

This enforces the rule that no calls from `HiveExternalCatalog` reset the 
current database in the hive client, except the `setCurrentDatabase` method. 

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark move

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20564.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20564


commit f1acb5ad9a12df11faf082144c676082df647ff9
Author: Feng Liu <fengliu@...>
Date:   2018-02-10T00:18:44Z

some




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20407#discussion_r167352400
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -156,6 +156,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val ALLOW_NESTEDJOIN_FALLBACK = 
buildConf("spark.sql.join.broadcastJoinFallback.enabled")
--- End diff --

nit: the key can be `spark.sql.join.broadcastNestedLoopJoin.enabled`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20407#discussion_r167351883
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -262,6 +262,10 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
 joins.CartesianProductExec(planLater(left), planLater(right), 
condition) :: Nil
 
   case logical.Join(left, right, joinType, condition) =>
+if (!SQLConf.get.allowNestedJoinFallback) {
+  throw new AnalysisException("The only JOIN strategy available 
for this plan is " +
+s"BroadcastNestedLoopJoin, but 
`${SQLConf.ALLOW_NESTEDJOIN_FALLBACK}` is `false`.")
--- End diff --

nit: `SQLConf.ALLOW_NESTEDJOIN_FALLBACK.key`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20441: [SPARK-23275][SQL] hive/tests have been failing when run...

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20441
  
@gatorsmile sorry for the late reply. I think the root cause is in hive 
metastore. I created one pr to bypass it: 
https://github.com/apache/spark/pull/20562


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20562: [SPARK-23275][SQL] fix the thread leaking in hive...

GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20562

[SPARK-23275][SQL] fix the thread leaking in hive/tests

## What changes were proposed in this pull request?

The two lines actually can trigger the hive metastore bug: 
https://issues.apache.org/jira/browse/HIVE-16844

The two configs are not in the default `ObjectStore` properties, so any run 
hive commands after these two lines will set the `propsChanged` flag in the 
`ObjectStore.setConf` and then cause thread leaks.

I don't think two lines are very useful and can be removed completed.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark fix-omm

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20562.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20562


commit 4e10f34ebb17940a1f0fd54ce26cb16d06320770
Author: Feng Liu <fengliu@...>
Date:   2018-02-09T21:15:36Z

init: remove the two lines




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17886: [SPARK-13983][SQL] Fix HiveThriftServer2 can not ...

2018-02-01 Thread liufengdb

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/17886#discussion_r165441473
  
--- Diff: 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java
 ---
@@ -221,6 +227,70 @@ private void configureSession(Map<String, String> 
sessionConfMap) throws HiveSQL
 }
   }
 
+  // Copy from org.apache.hadoop.hive.ql.processors.SetProcessor, only 
change:
+  // setConf(varname, propName, varvalue, true) when 
varname.startsWith(HIVECONF_PREFIX)
+  public static int setVariable(String varname, String varvalue) throws 
Exception {
+SessionState ss = SessionState.get();
+if (varvalue.contains("\n")){
+  ss.err.println("Warning: Value had a \\n character in it.");
+}
+varname = varname.trim();
+if (varname.startsWith(ENV_PREFIX)){
+  ss.err.println("env:* variables can not be set.");
+  return 1;
+} else if (varname.startsWith(SYSTEM_PREFIX)){
+  String propName = varname.substring(SYSTEM_PREFIX.length());
+  System.getProperties().setProperty(propName,
+  new 
VariableSubstitution().substitute(ss.getConf(),varvalue));
+} else if (varname.startsWith(HIVECONF_PREFIX)){
+  String propName = varname.substring(HIVECONF_PREFIX.length());
+  setConf(varname, propName, varvalue, true);
--- End diff --

The fix of hivevar is by this line: 
https://github.com/apache/spark/pull/17886/files#diff-9d2cd65aaeae992250b5f40d8c289287R56.
 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17886: [SPARK-13983][SQL] Fix HiveThriftServer2 can not get "--...

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/17886
  
@gatorsmile this is a great patch. The test can be improved, but I think it 
is safe to merge as it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19219: [SPARK-21993][SQL] Close sessionState when finish

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/19219
  
The major issue this PR tries to cover has been fixed by 
https://github.com/apache/spark/pull/20029, so I think we are good if there are 
no calls to `HiveClientImpl.newSession`. We can close this PR with no-fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20385: [SPARK-21396][SQL] Fixes MatchError when UDTs are passed...

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20385
  
Actually, one more thing, do you need to consider the UDT as one attribute 
of a structured type? 
https://github.com/apache/spark/pull/20385/files#diff-842e3447fc453de26c706db1cac8f2c4L467


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20385: [SPARK-21396][SQL] Fixes MatchError when UDTs are passed...

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20385
  
LGTM!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20425: [WIP] remove the redundant code in HiveExternalCa...

2018-01-29 Thread liufengdb

GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20425

[WIP] remove the redundant code in HiveExternalCatlog and HiveClientImpl

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark refact-hive-external

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20425.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20425


commit 41829e4fadd4b9821be68d8b2a0484eca45711a1
Author: Feng Liu <fengliu@...>
Date:   2018-01-29T18:37:43Z

small




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20420: [SPARK-22916][SQL][FOLLOW-UP] Update the Description of ...

2018-01-29 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20420
  
LGTM! Thanks for doing this!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20385: [SPARK-21396][SQL] Fixes MatchError when UDTs are...

2018-01-24 Thread liufengdb

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20385#discussion_r163648777
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
 ---
@@ -102,6 +102,8 @@ private[hive] class SparkExecuteStatementOperation(
 to += from.getAs[Timestamp](ordinal)
   case BinaryType =>
 to += from.getAs[Array[Byte]](ordinal)
+  case udt: UserDefinedType[_] =>
+to += from.get(ordinal).toString
--- End diff --

It is possible `from.get(ordinal)` returns `null`, then a null pointer 
exception. I think a better way to add this case is by the method 
`HiveUtils.toHiveString` method.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...

2018-01-24 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20025
  
lgtm


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18983: [SPARK-21771][SQL]remove useless hive client in SparkSQL...

2018-01-19 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/18983
  
LGTM! It is only created once though.

Frankly, we should completely remove the implementation of `newSession()` 
method.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20025: [SPARK-22837][SQL]Session timeout checker does no...

2018-01-19 Thread liufengdb

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20025#discussion_r162698093
  
--- Diff: 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/SessionManager.java
 ---
@@ -80,7 +76,6 @@ public synchronized void init(HiveConf hiveConf) {
 }
 createBackgroundOperationPool();
 addService(operationManager);
-super.init(hiveConf);
--- End diff --

hmm, I think we revert keep this line too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...

2018-01-18 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20025
  
@gatorsmile @felixcheung I left one comment, otherwise lgtm.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20025: [SPARK-22837][SQL]Session timeout checker does no...

2018-01-18 Thread liufengdb

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20025#discussion_r162426604
  
--- Diff: 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/SessionManager.java
 ---
@@ -79,35 +75,19 @@ public synchronized void init(HiveConf hiveConf) {
   initOperationLogRootDir();
 }
 createBackgroundOperationPool();
-addService(operationManager);
-super.init(hiveConf);
-  }
-
-  private void createBackgroundOperationPool() {
-int poolSize = 
hiveConf.getIntVar(ConfVars.HIVE_SERVER2_ASYNC_EXEC_THREADS);
-LOG.info("HiveServer2: Background operation thread pool size: " + 
poolSize);
-int poolQueueSize = 
hiveConf.getIntVar(ConfVars.HIVE_SERVER2_ASYNC_EXEC_WAIT_QUEUE_SIZE);
-LOG.info("HiveServer2: Background operation thread wait queue size: " 
+ poolQueueSize);
-long keepAliveTime = HiveConf.getTimeVar(
-hiveConf, ConfVars.HIVE_SERVER2_ASYNC_EXEC_KEEPALIVE_TIME, 
TimeUnit.SECONDS);
-LOG.info(
-"HiveServer2: Background operation thread keepalive time: " + 
keepAliveTime + " seconds");
-
-// Create a thread pool with #poolSize threads
-// Threads terminate when they are idle for more than the keepAliveTime
-// A bounded blocking queue is used to queue incoming operations, if 
#operations > poolSize
-String threadPoolName = "HiveServer2-Background-Pool";
-backgroundOperationPool = new ThreadPoolExecutor(poolSize, poolSize,
-keepAliveTime, TimeUnit.SECONDS, new 
LinkedBlockingQueue(poolQueueSize),
-new ThreadFactoryWithGarbageCleanup(threadPoolName));
-backgroundOperationPool.allowCoreThreadTimeOut(true);
-
--- End diff --

 I think we can keep this file unchanged. Looks like the hive thread pool 
is more flexible than the spark one. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...

2018-01-10 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20174
  
@mgaido91 Your proposal and current approach are both with one line change. 
Since the issue is actually related to the hash aggregate implementation, I 
think it is reasonable to include it in the  `ReplaceDeduplicateWithAggregate` 
transformation. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...

2018-01-10 Thread liufengdb

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160757582
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1221,7 +1221,12 @@ object ReplaceDeduplicateWithAggregate extends 
Rule[LogicalPlan] {
   Alias(new First(attr).toAggregateExpression(), 
attr.name)(attr.exprId)
 }
   }
-  Aggregate(keys, aggCols, child)
+  // SPARK-22951: the implementation of aggregate operator treats the 
cases with and without
+  // grouping keys differently, when there are not input rows. For the 
aggregation after
+  // `dropDuplicates()` on an empty data frame, a grouping key is 
added here to make sure the
+  // aggregate operator can work correctly (returning an empty 
iterator).
--- End diff --

ok, I like this one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20202: [MINOR] fix a typo in BroadcastJoinSuite

2018-01-09 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20202
  
thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...

2018-01-08 Thread liufengdb

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160305419
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -230,6 +236,7 @@ case class HashAggregateExec(
  | private void $doAgg() throws java.io.IOException {
  |   // initialize aggregation buffer
  |   $initBufVar
+ |   $hasInput = false;
--- End diff --

If the result expression is empty and `hasInput` is `false`, it should 
return an empty iterator (this is exactly the corner case we want to fix). If 
the result expression is empty but `hasInput` is `true`, it should append empty 
rows.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...

2018-01-08 Thread liufengdb

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160217718
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -230,6 +236,7 @@ case class HashAggregateExec(
  | private void $doAgg() throws java.io.IOException {
  |   // initialize aggregation buffer
  |   $initBufVar
+ |   $hasInput = false;
--- End diff --

I actually tried it, but found it was hard because we don't control the 
codegen of line `/* 038 */   if (shouldStop()) return;` in `doAgg`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...

2018-01-08 Thread liufengdb

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160216751
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -102,10 +102,12 @@ case class HashAggregateExec(
 
   val beforeAgg = System.nanoTime()
   val hasInput = iter.hasNext
-  val res = if (!hasInput && groupingExpressions.nonEmpty) {
-// This is a grouped aggregate and the input iterator is empty,
+  val res = if (!hasInput && (groupingExpressions.nonEmpty || 
resultExpressions.isEmpty)) {
+// The input iterator is empty, and this is a grouped aggregate or 
without result columns,
 // so return an empty iterator.
 Iterator.empty
+  } else if (!hasInput && resultExpressions.isEmpty) {
--- End diff --

oops, forgot to remove it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20174: [SPARK-22951][SQL] aggregate should not produce empty ro...

2018-01-07 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20174
  
test this please



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...

2018-01-06 Thread liufengdb

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160042482
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -245,11 +252,15 @@ case class HashAggregateExec(
|   $doAggFuncName();
|   $aggTime.add((System.nanoTime() - $beforeAgg) / 100);
|
-   |   // output the result
-   |   ${genResult.trim}
+   |   if (!$hasInput && ${resultVars.isEmpty}) {
--- End diff --

I think it hurts the code readability if the code for the two cases are 
defined separately.  For the regular case, the generated code will look like 
`if (false && !hasInput) ... else ...`. This pattern should be optimized easily 
during jit, so we don't need to worry about the performance too much.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20174: S[SPARK-22951][SQL] aggregate should not produce ...

2018-01-06 Thread liufengdb

GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20174

S[SPARK-22951][SQL] aggregate should not produce empty rows if data frame 
is empty

## What changes were proposed in this pull request?

 WIP

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark fix-duplicate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20174.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20174


commit 89d6b7504239211cd64188be50ae32a199f1ddd8
Author: Feng Liu <fengliu@...>
Date:   2018-01-06T07:51:16Z

init




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

2018-01-05 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20029
  
lgtm!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

2017-12-29 Thread liufengdb

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20029
  
By [this 
line](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L78),
 yes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20025
  
My understanding is that the reflection was used because we might use a 
different version of hive then we didn't control what it was done inside the 
`super.init`. However, after we inlined the hive code, it is safe to call the 
`super.init` method. This is a cleaner way to fix the referred and other 
potential bugs, IMO.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20025
  
@zuotingbing I think all the code in SparkSQLSessionManager.scala should 
gone because they are just some reflection hacks. It is possible to call 
`super.init(hiveConf)` instead to get the session timeout checker and all other 
things start, isn't?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20109: [SPARK-22891][SQL] Make hive client creation thre...

GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20109

[SPARK-22891][SQL] Make hive client creation thread safe

## What changes were proposed in this pull request?

This is to walk around the hive issue: 
https://issues.apache.org/jira/browse/HIVE-11935

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark synchronized

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20109.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20109


commit 163d3443681af2c5ff246ecc546355934c0f6dbb
Author: Feng Liu <fengliu@...>
Date:   2017-12-29T01:02:16Z

init




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/20029

@zuotingbing I took a close look at the related code and thought the issue
you raised is valid:

1. The hiveClient created for the
[resourceLoader](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L45)
is only used to addJar, which is, in turn, to add Jar to the shared
[`IsolatedClientLoader`](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L817).
Then we can just use the shared hive client for this purpose.

2. Another possible reason to use a new hive client is to run [this hive
statement](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L818).
But I think it just some leftovers from old spark and should be removed. So
overall it is fined to use the shared `client` from `HiveExternalCatalog`
without creating a new hive client.

3. Currrently, there are no ways to cleanup the resource created by a [new
session of
SQLContext/SparkSession](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L78).
I couldn't understand the design tradeoff behind
[this](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L716)
(@srowen ). So it is not easy to remove the temp dirs when a session is closed.

4. To what extent, does spark need these scratch dirs? Is it possible we
can make this step optional, if it is not used for all the deployment modes?

---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20099: [SPARK-22916][SQL] shouldn't bias towards build r...

2017-12-27 Thread liufengdb

GitHub user liufengdb opened a pull request:

https://github.com/apache/spark/pull/20099

[SPARK-22916][SQL] shouldn't bias towards build right if user does not 
specify

## What changes were proposed in this pull request?

When there are no broadcast hints, the current spark strategies will prefer 
to build right, without considering the sizes of the two sides. This patch 
added the logic to consider the sizes of the two tables for the build side. To 
make the logic clear, the build side is determined by two steps:

1. If there are broadcast hints, the build side is determined by 
`broadcastSideByHints`;
2. If there are no broadcast hints, the build side is determined by 
`broadcastSideByConfig`;
3. If the broadcast is disabled by the config, it falls back to the next 
cases.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liufengdb/spark fix-spark-strategies

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20099.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20099


commit e4b63f5fab81b7637d107efe6524b2f41c681a10
Author: Feng Liu <fengliu@...>
Date:   2017-12-27T23:22:30Z

init




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs on requ...

Github user liufengdb commented on the issue:

https://github.com/apache/spark/pull/18812
  
I actually did not get the motivation of this PR. HiveThriftServer2 can run 
independently or be started with a SQL context: 
https://github.com/apache/spark/pull/18812/files#diff-709404b0d3defeff035ef0c4f5a960e5L57.
 For the later, this OOM should be handled by the upper layer, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs ...

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18812#discussion_r157361046
  
--- Diff: 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java
 ---
@@ -39,6 +32,8 @@
 import org.apache.hive.service.cli.thrift.ThriftCLIService;
 import org.apache.hive.service.cli.thrift.ThriftHttpCLIService;
 
+import java.util.Properties;
--- End diff --

java imports should be the first?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs ...

Github user liufengdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18812#discussion_r157361025
  
--- Diff: 
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java
 ---
@@ -37,21 +30,29 @@
 import org.apache.thrift.transport.TServerSocket;
 import org.apache.thrift.transport.TTransportFactory;
 
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.SynchronousQueue;
+import java.util.concurrent.TimeUnit;
--- End diff --

unnecessary change


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19989: [SPARK-22793][SQL]Memory leak in Spark Thrift Server