[GitHub] spark issue #20864: [SPARK-23745][SQL]Remove the directories of the “hive....
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20864 I thought the directory is also created from this line: https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java#L143. For this one, we need to think about whether we can remove all the temp directories creation, because the statements are executed by spark sql and it has nothing about the Hive in the thrift server. You are right that HiveClientImpl (the Hive inside spark sql) will also produce such temp directories. However, it seems like the following line alone is sufficient to add the jar to the class loader: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L836. So I doubt we still need the `runSqlHive(s"ADD JAR $path")` to download the jar to a temp directory. Overall, I think we need an overall design to remove the Hive legacy in both the thrift server and Spark SQL. Adding more temp fixes will make such a design harder. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18666: [SPARK-21449][SQL][Hive]Close HiveClient's SessionState ...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/18666 Maybe I missed something, but it seems Spark has its own class loader right now, which can load the class from the given URL: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L211. I doubt it needs to download the jar to the hive configured temp directory. In other words, this line of code is not necessary: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L837. This can be verified very easily: Remove https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L837 and run `add jar` and see whether it works. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18666: [SPARK-21449][SQL][Hive]Close HiveClient's SessionState ...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/18666 I asked the following question in https://github.com/apache/spark/pull/20864: is it necessary to create these temp directories when the hive thrift server starts? It sounds some legacy from Hive and we can skip creating them in the first place. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20864: [SPARK-23745][SQL]Remove the directories of the “hive....
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20864 @samartinucci @zuotingbing a high-level question: is it necessary to create these temp directories when the hive thrift server starts? It sounds some legacy from Hive and we can skip creating them in the first place. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20702: [SPARK-23547][SQL]Cleanup the .pipeout file when the Hiv...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20702 lgtm! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20681: [SPARK-23518][SQL] Avoid metastore access when th...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20681#discussion_r171770982 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -67,6 +67,8 @@ sparkSession <- if (windows_with_hadoop()) { sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) } sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", "getJavaSparkContext", sparkSession) +# materialize the catalog implementation +listTables() --- End diff -- `test_sparkSQL.R` is the only one uses `newJObject("org.apache.spark.sql.hive.test.TestHiveContext", ssc, FALSE)` on the `ssc`, so the catalog impl spark conf is changed. So ``test_sparkSQL.R` is the only one broken. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20681 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20681 Overall, I think this suite needs a refactoring: split to in-memory catalog one and hive catalog one. The catalog conf should not be manipulated after the spark context is created. The other way is just a hack. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20681 We can remove the test, but it is not a good practice. You don't know exactly why the test is added, which hidden assuption he wants to guarantee, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20681 My original plan to fix the test should not work, because of this test: https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_sparkSQL.R#L3343 The new plan is to run some simple catalog commands immediately after the spark session is created, so the catalog is materialized (like the old behavior). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20681 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20681: [SPARK-23518][SQL] Avoid metastore access when the users...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20681 @felixcheung Can you take a look at the changes in the R tests? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20681: [SPARK-23518][SQL] Completely remove metastore ac...
GitHub user liufengdb opened a pull request: https://github.com/apache/spark/pull/20681 [SPARK-23518][SQL] Completely remove metastore access if the query is not using tables ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/18944 added one patch, which allowed a spark session to be created when the hive metastore server is down. However, it did not allow running any commands with the spark session. So the users could not read / write data frames, when the hive metastore server is down. ## How was this patch tested? Added some unit tests to read and write data frames based on the original HiveMetastoreLazyInitializationSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liufengdb/spark completely-lazy Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20681.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20681 commit b447beab9e43595d61ef0f51c91a09fd2f72a87e Author: Feng Liu <fengliu@...> Date: 2018-02-26T07:08:47Z lazy --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20557: [SPARK-23364][SQL]'desc table' command in spark-sql add ...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20557 There may be some spark JDBC/ODBC drivers need to parse the returned results to get all the columns. We should avoid making changes on the returned "schema" from the server side. You can fix the issue on the client side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20565: SPAR[SPARK-23379][SQL] remove redundant metastore...
GitHub user liufengdb opened a pull request: https://github.com/apache/spark/pull/20565 SPAR[SPARK-23379][SQL] remove redundant metastore access ## What changes were proposed in this pull request? If the target database name is as same as the current database, we should be able to skip one metastore access. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liufengdb/spark remove-redundant Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20565.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20565 commit f1acb5ad9a12df11faf082144c676082df647ff9 Author: Feng Liu <fengliu@...> Date: 2018-02-10T00:18:44Z some commit c29aa3ecb78dc345bb2ad1e5cbaf29d3fdb3a803 Author: Feng Liu <fengliu@...> Date: 2018-02-10T01:10:53Z init --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20564: [SPARK-23378][SQL] move setCurrentDatabase from H...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20564#discussion_r167381054 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -1107,11 +1107,6 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat } } -// Note: Before altering table partitions in Hive, you *must* set the current database -// to the one that contains the table of interest. Otherwise you will end up with the -// most helpful error message ever: "Unable to alter partition. alter is not possible." -// See HIVE-2742 for more detail. -client.setCurrentDatabase(db) --- End diff -- Sorry, I meant `a special case`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20564: [SPARK-23378][SQL] move setCurrentDatabase from H...
GitHub user liufengdb opened a pull request: https://github.com/apache/spark/pull/20564 [SPARK-23378][SQL] move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl ## What changes were proposed in this pull request? This enforces the rule that no calls from `HiveExternalCatalog` reset the current database in the hive client, except the `setCurrentDatabase` method. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liufengdb/spark move Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20564.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20564 commit f1acb5ad9a12df11faf082144c676082df647ff9 Author: Feng Liu <fengliu@...> Date: 2018-02-10T00:18:44Z some --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20407#discussion_r167352400 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -156,6 +156,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val ALLOW_NESTEDJOIN_FALLBACK = buildConf("spark.sql.join.broadcastJoinFallback.enabled") --- End diff -- nit: the key can be `spark.sql.join.broadcastNestedLoopJoin.enabled` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20407: [SPARK-23124][SQL] Allow to disable BroadcastNest...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20407#discussion_r167351883 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -262,6 +262,10 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { joins.CartesianProductExec(planLater(left), planLater(right), condition) :: Nil case logical.Join(left, right, joinType, condition) => +if (!SQLConf.get.allowNestedJoinFallback) { + throw new AnalysisException("The only JOIN strategy available for this plan is " + +s"BroadcastNestedLoopJoin, but `${SQLConf.ALLOW_NESTEDJOIN_FALLBACK}` is `false`.") --- End diff -- nit: `SQLConf.ALLOW_NESTEDJOIN_FALLBACK.key` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20441: [SPARK-23275][SQL] hive/tests have been failing when run...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20441 @gatorsmile sorry for the late reply. I think the root cause is in hive metastore. I created one pr to bypass it: https://github.com/apache/spark/pull/20562 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20562: [SPARK-23275][SQL] fix the thread leaking in hive...
GitHub user liufengdb opened a pull request: https://github.com/apache/spark/pull/20562 [SPARK-23275][SQL] fix the thread leaking in hive/tests ## What changes were proposed in this pull request? The two lines actually can trigger the hive metastore bug: https://issues.apache.org/jira/browse/HIVE-16844 The two configs are not in the default `ObjectStore` properties, so any run hive commands after these two lines will set the `propsChanged` flag in the `ObjectStore.setConf` and then cause thread leaks. I don't think two lines are very useful and can be removed completed. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liufengdb/spark fix-omm Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20562.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20562 commit 4e10f34ebb17940a1f0fd54ce26cb16d06320770 Author: Feng Liu <fengliu@...> Date: 2018-02-09T21:15:36Z init: remove the two lines --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17886: [SPARK-13983][SQL] Fix HiveThriftServer2 can not ...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/17886#discussion_r165441473 --- Diff: sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java --- @@ -221,6 +227,70 @@ private void configureSession(Map<String, String> sessionConfMap) throws HiveSQL } } + // Copy from org.apache.hadoop.hive.ql.processors.SetProcessor, only change: + // setConf(varname, propName, varvalue, true) when varname.startsWith(HIVECONF_PREFIX) + public static int setVariable(String varname, String varvalue) throws Exception { +SessionState ss = SessionState.get(); +if (varvalue.contains("\n")){ + ss.err.println("Warning: Value had a \\n character in it."); +} +varname = varname.trim(); +if (varname.startsWith(ENV_PREFIX)){ + ss.err.println("env:* variables can not be set."); + return 1; +} else if (varname.startsWith(SYSTEM_PREFIX)){ + String propName = varname.substring(SYSTEM_PREFIX.length()); + System.getProperties().setProperty(propName, + new VariableSubstitution().substitute(ss.getConf(),varvalue)); +} else if (varname.startsWith(HIVECONF_PREFIX)){ + String propName = varname.substring(HIVECONF_PREFIX.length()); + setConf(varname, propName, varvalue, true); --- End diff -- The fix of hivevar is by this line: https://github.com/apache/spark/pull/17886/files#diff-9d2cd65aaeae992250b5f40d8c289287R56. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17886: [SPARK-13983][SQL] Fix HiveThriftServer2 can not get "--...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/17886 @gatorsmile this is a great patch. The test can be improved, but I think it is safe to merge as it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19219: [SPARK-21993][SQL] Close sessionState when finish
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/19219 The major issue this PR tries to cover has been fixed by https://github.com/apache/spark/pull/20029, so I think we are good if there are no calls to `HiveClientImpl.newSession`. We can close this PR with no-fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20385: [SPARK-21396][SQL] Fixes MatchError when UDTs are passed...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20385 Actually, one more thing, do you need to consider the UDT as one attribute of a structured type? https://github.com/apache/spark/pull/20385/files#diff-842e3447fc453de26c706db1cac8f2c4L467 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20385: [SPARK-21396][SQL] Fixes MatchError when UDTs are passed...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20385 LGTM! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20425: [WIP] remove the redundant code in HiveExternalCa...
GitHub user liufengdb opened a pull request: https://github.com/apache/spark/pull/20425 [WIP] remove the redundant code in HiveExternalCatlog and HiveClientImpl ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liufengdb/spark refact-hive-external Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20425.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20425 commit 41829e4fadd4b9821be68d8b2a0484eca45711a1 Author: Feng Liu <fengliu@...> Date: 2018-01-29T18:37:43Z small --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20420: [SPARK-22916][SQL][FOLLOW-UP] Update the Description of ...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20420 LGTM! Thanks for doing this! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20385: [SPARK-21396][SQL] Fixes MatchError when UDTs are...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20385#discussion_r163648777 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala --- @@ -102,6 +102,8 @@ private[hive] class SparkExecuteStatementOperation( to += from.getAs[Timestamp](ordinal) case BinaryType => to += from.getAs[Array[Byte]](ordinal) + case udt: UserDefinedType[_] => +to += from.get(ordinal).toString --- End diff -- It is possible `from.get(ordinal)` returns `null`, then a null pointer exception. I think a better way to add this case is by the method `HiveUtils.toHiveString` method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20025 lgtm --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18983: [SPARK-21771][SQL]remove useless hive client in SparkSQL...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/18983 LGTM! It is only created once though. Frankly, we should completely remove the implementation of `newSession()` method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20025: [SPARK-22837][SQL]Session timeout checker does no...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20025#discussion_r162698093 --- Diff: sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/SessionManager.java --- @@ -80,7 +76,6 @@ public synchronized void init(HiveConf hiveConf) { } createBackgroundOperationPool(); addService(operationManager); -super.init(hiveConf); --- End diff -- hmm, I think we revert keep this line too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20025 @gatorsmile @felixcheung I left one comment, otherwise lgtm. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20025: [SPARK-22837][SQL]Session timeout checker does no...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20025#discussion_r162426604 --- Diff: sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/SessionManager.java --- @@ -79,35 +75,19 @@ public synchronized void init(HiveConf hiveConf) { initOperationLogRootDir(); } createBackgroundOperationPool(); -addService(operationManager); -super.init(hiveConf); - } - - private void createBackgroundOperationPool() { -int poolSize = hiveConf.getIntVar(ConfVars.HIVE_SERVER2_ASYNC_EXEC_THREADS); -LOG.info("HiveServer2: Background operation thread pool size: " + poolSize); -int poolQueueSize = hiveConf.getIntVar(ConfVars.HIVE_SERVER2_ASYNC_EXEC_WAIT_QUEUE_SIZE); -LOG.info("HiveServer2: Background operation thread wait queue size: " + poolQueueSize); -long keepAliveTime = HiveConf.getTimeVar( -hiveConf, ConfVars.HIVE_SERVER2_ASYNC_EXEC_KEEPALIVE_TIME, TimeUnit.SECONDS); -LOG.info( -"HiveServer2: Background operation thread keepalive time: " + keepAliveTime + " seconds"); - -// Create a thread pool with #poolSize threads -// Threads terminate when they are idle for more than the keepAliveTime -// A bounded blocking queue is used to queue incoming operations, if #operations > poolSize -String threadPoolName = "HiveServer2-Background-Pool"; -backgroundOperationPool = new ThreadPoolExecutor(poolSize, poolSize, -keepAliveTime, TimeUnit.SECONDS, new LinkedBlockingQueue(poolQueueSize), -new ThreadFactoryWithGarbageCleanup(threadPoolName)); -backgroundOperationPool.allowCoreThreadTimeOut(true); - --- End diff -- I think we can keep this file unchanged. Looks like the hive thread pool is more flexible than the spark one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20174 @mgaido91 Your proposal and current approach are both with one line change. Since the issue is actually related to the hash aggregate implementation, I think it is reasonable to include it in the `ReplaceDeduplicateWithAggregate` transformation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160757582 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1221,7 +1221,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] { Alias(new First(attr).toAggregateExpression(), attr.name)(attr.exprId) } } - Aggregate(keys, aggCols, child) + // SPARK-22951: the implementation of aggregate operator treats the cases with and without + // grouping keys differently, when there are not input rows. For the aggregation after + // `dropDuplicates()` on an empty data frame, a grouping key is added here to make sure the + // aggregate operator can work correctly (returning an empty iterator). --- End diff -- ok, I like this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20202: [MINOR] fix a typo in BroadcastJoinSuite
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20202 thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160305419 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -230,6 +236,7 @@ case class HashAggregateExec( | private void $doAgg() throws java.io.IOException { | // initialize aggregation buffer | $initBufVar + | $hasInput = false; --- End diff -- If the result expression is empty and `hasInput` is `false`, it should return an empty iterator (this is exactly the corner case we want to fix). If the result expression is empty but `hasInput` is `true`, it should append empty rows. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160217718 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -230,6 +236,7 @@ case class HashAggregateExec( | private void $doAgg() throws java.io.IOException { | // initialize aggregation buffer | $initBufVar + | $hasInput = false; --- End diff -- I actually tried it, but found it was hard because we don't control the codegen of line `/* 038 */ if (shouldStop()) return;` in `doAgg`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160216751 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -102,10 +102,12 @@ case class HashAggregateExec( val beforeAgg = System.nanoTime() val hasInput = iter.hasNext - val res = if (!hasInput && groupingExpressions.nonEmpty) { -// This is a grouped aggregate and the input iterator is empty, + val res = if (!hasInput && (groupingExpressions.nonEmpty || resultExpressions.isEmpty)) { +// The input iterator is empty, and this is a grouped aggregate or without result columns, // so return an empty iterator. Iterator.empty + } else if (!hasInput && resultExpressions.isEmpty) { --- End diff -- oops, forgot to remove it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] aggregate should not produce empty ro...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20174 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] aggregate should not produce e...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160042482 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -245,11 +252,15 @@ case class HashAggregateExec( | $doAggFuncName(); | $aggTime.add((System.nanoTime() - $beforeAgg) / 100); | - | // output the result - | ${genResult.trim} + | if (!$hasInput && ${resultVars.isEmpty}) { --- End diff -- I think it hurts the code readability if the code for the two cases are defined separately. For the regular case, the generated code will look like `if (false && !hasInput) ... else ...`. This pattern should be optimized easily during jit, so we don't need to worry about the performance too much. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: S[SPARK-22951][SQL] aggregate should not produce ...
GitHub user liufengdb opened a pull request: https://github.com/apache/spark/pull/20174 S[SPARK-22951][SQL] aggregate should not produce empty rows if data frame is empty ## What changes were proposed in this pull request? WIP ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liufengdb/spark fix-duplicate Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20174.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20174 commit 89d6b7504239211cd64188be50ae32a199f1ddd8 Author: Feng Liu <fengliu@...> Date: 2018-01-06T07:51:16Z init --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20029 lgtm! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20029 By [this line](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L78), yes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20025 My understanding is that the reflection was used because we might use a different version of hive then we didn't control what it was done inside the `super.init`. However, after we inlined the hive code, it is safe to call the `super.init` method. This is a cleaner way to fix the referred and other potential bugs, IMO. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20025: [SPARK-22837][SQL]Session timeout checker does not work ...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20025 @zuotingbing I think all the code in SparkSQLSessionManager.scala should gone because they are just some reflection hacks. It is possible to call `super.init(hiveConf)` instead to get the session timeout checker and all other things start, isn't? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20109: [SPARK-22891][SQL] Make hive client creation thre...
GitHub user liufengdb opened a pull request: https://github.com/apache/spark/pull/20109 [SPARK-22891][SQL] Make hive client creation thread safe ## What changes were proposed in this pull request? This is to walk around the hive issue: https://issues.apache.org/jira/browse/HIVE-11935 ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liufengdb/spark synchronized Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20109.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20109 commit 163d3443681af2c5ff246ecc546355934c0f6dbb Author: Feng Liu <fengliu@...> Date: 2017-12-29T01:02:16Z init --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20029: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/20029 @zuotingbing I took a close look at the related code and thought the issue you raised is valid: 1. The hiveClient created for the [resourceLoader](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L45) is only used to addJar, which is, in turn, to add Jar to the shared [`IsolatedClientLoader`](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L817). Then we can just use the shared hive client for this purpose. 2. Another possible reason to use a new hive client is to run [this hive statement](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L818). But I think it just some leftovers from old spark and should be removed. So overall it is fined to use the shared `client` from `HiveExternalCatalog` without creating a new hive client. 3. Currrently, there are no ways to cleanup the resource created by a [new session of SQLContext/SparkSession](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L78). I couldn't understand the design tradeoff behind [this](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L716) (@srowen ). So it is not easy to remove the temp dirs when a session is closed. 4. To what extent, does spark need these scratch dirs? Is it possible we can make this step optional, if it is not used for all the deployment modes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20099: [SPARK-22916][SQL] shouldn't bias towards build r...
GitHub user liufengdb opened a pull request: https://github.com/apache/spark/pull/20099 [SPARK-22916][SQL] shouldn't bias towards build right if user does not specify ## What changes were proposed in this pull request? When there are no broadcast hints, the current spark strategies will prefer to build right, without considering the sizes of the two sides. This patch added the logic to consider the sizes of the two tables for the build side. To make the logic clear, the build side is determined by two steps: 1. If there are broadcast hints, the build side is determined by `broadcastSideByHints`; 2. If there are no broadcast hints, the build side is determined by `broadcastSideByConfig`; 3. If the broadcast is disabled by the config, it falls back to the next cases. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liufengdb/spark fix-spark-strategies Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20099.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20099 commit e4b63f5fab81b7637d107efe6524b2f41c681a10 Author: Feng Liu <fengliu@...> Date: 2017-12-27T23:22:30Z init --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs on requ...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/18812 I actually did not get the motivation of this PR. HiveThriftServer2 can run independently or be started with a SQL context: https://github.com/apache/spark/pull/18812/files#diff-709404b0d3defeff035ef0c4f5a960e5L57. For the later, this OOM should be handled by the upper layer, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs ...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18812#discussion_r157361046 --- Diff: sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java --- @@ -39,6 +32,8 @@ import org.apache.hive.service.cli.thrift.ThriftCLIService; import org.apache.hive.service.cli.thrift.ThriftHttpCLIService; +import java.util.Properties; --- End diff -- java imports should be the first? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs ...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/18812#discussion_r157361025 --- Diff: sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java --- @@ -37,21 +30,29 @@ import org.apache.thrift.transport.TServerSocket; import org.apache.thrift.transport.TTransportFactory; +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.SynchronousQueue; +import java.util.concurrent.TimeUnit; --- End diff -- unnecessary change --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19989: [SPARK-22793][SQL]Memory leak in Spark Thrift Server
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/19989 I think this method can take care of resource clean up automatically: https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/SessionManager.java#L151 Can you really make a heap dump and find out why the sessions are not cleaned up? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19721: [SPARK-22496][SQL]thrift server adds operation logs
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/19721 lgtm! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19460: [SPARK-22222][core] Fix the ARRAY_MAX in BufferHo...
GitHub user liufengdb opened a pull request: https://github.com/apache/spark/pull/19460 [SPARK-2][core] Fix the ARRAY_MAX in BufferHolder and add a test ## What changes were proposed in this pull request? We should not break the assumption that the length of the allocated byte array is word rounded: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L170 So we want to use `Integer.MAX_VALUE - 15` instead of `Integer.MAX_VALUE - 8` as the upper bound of an allocated byte array. ## How was this patch tested? Since the Spark unit test JVM has less than 1GB heap, here we run the test code as a submit job, so it can run on a JVM has 4GB memory. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liufengdb/spark fix_array_max Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19460.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19460 commit 92a6d2d53aea02042d47888e99df5a4f2167cd1f Author: Feng Liu <feng...@databricks.com> Date: 2017-10-09T17:43:39Z init --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19266: [SPARK-22033][CORE] BufferHolder, other size chec...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19266#discussion_r143346155 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java --- @@ -35,6 +35,11 @@ * if the fields of row are all fixed-length, as the size of result row is also fixed. */ public class BufferHolder { + + // Some JVMs can't allocate arrays of length Integer.MAX_VALUE; actual max is somewhat + // smaller. Be conservative and lower the cap a little. + private static final int ARRAY_MAX = Integer.MAX_VALUE - 8; --- End diff -- @srowen I think we can use `Integer.MAX_VALUE - 7` instead of `Integer.MAX_VALUE - 8` to make the size align with words, otherwise, this check will fail: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L170. This is the reason why all the size inputs to the methods are rounded, for example, https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java#L216. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19394: [SPARK-22170][SQL] Reduce memory consumption in b...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19394#discussion_r142290483 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala --- @@ -280,13 +280,20 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializ results.toArray } + private[spark] def executeCollectIterator(): (Long, Iterator[InternalRow]) = { +val countsAndBytes = getByteArrayRdd().collect() --- End diff -- This still fetches all the compressed rows to the driver, before building the hashed relation. Ideally, you should fetch the rows from executors incrementally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19230#discussion_r139329522 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java --- @@ -16,6 +16,7 @@ */ package org.apache.spark.sql.execution.vectorized; +import org.apache.spark.api.java.function.Function; --- End diff -- oops, reverted it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19230#discussion_r139329523 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnVectorSuite.scala --- @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.vectorized + +import org.scalatest.BeforeAndAfterEach + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +class ColumnVectorSuite extends SparkFunSuite with BeforeAndAfterEach { + + var testVector: WritableColumnVector = _ + + private def allocate(capacity: Int, dt: DataType): WritableColumnVector = { +new OnHeapColumnVector(capacity, dt) + } + + override def afterEach(): Unit = { +testVector.close() + } + + test("boolean") { +testVector = allocate(10, BooleanType) +(0 until 10).foreach { i => + testVector.appendBoolean(i % 2 == 0) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, BooleanType) === (i % 2 == 0)) +} + } + + test("byte") { +testVector = allocate(10, ByteType) +(0 until 10).foreach { i => + testVector.appendByte(i.toByte) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, ByteType) === (i.toByte)) +} + } + + test("short") { +testVector = allocate(10, ShortType) +(0 until 10).foreach { i => + testVector.appendShort(i.toShort) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, ShortType) === (i.toShort)) +} + } + + test("int") { +testVector = allocate(10, IntegerType) +(0 until 10).foreach { i => + testVector.appendInt(i) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, IntegerType) === i) +} + } + + test("long") { +testVector = allocate(10, LongType) +(0 until 10).foreach { i => + testVector.appendLong(i) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, LongType) === i) +} + } + + test("float") { +testVector = allocate(10, FloatType) +(0 until 10).foreach { i => + testVector.appendFloat(i.toFloat) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, FloatType) === i.toFloat) +} + } + + test("double") { +testVector = allocate(10, DoubleType) +(0 until 10).foreach { i => + testVector.appendDouble(i.toDouble) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, DoubleType) === i.toDouble) +} + } + + test("string") { +testVector = allocate(10, StringType) +(0 until 10).map { i => + val utf8 = s"str$i".getBytes("utf8") + testVector.appendByteArray(utf8, 0, utf8.length) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, StringType) === UTF8String.fromString(s"str$i")) +} + } + + test("binary") { +testVector = allocate(10, BinaryType) +(0 until 10).ma
[GitHub] spark issue #19230: [SPARK-22003][SQL] support array column in vectorized re...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/19230 @viirya @cloud-fan unit test updated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19230#discussion_r139072681 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java --- @@ -99,73 +100,18 @@ public ArrayData copy() { @Override public Object[] array() { DataType dt = data.dataType(); + Function<Integer, Object> getAtMethod = (Function<Integer, Object>) i -> get(i, dt); Object[] list = new Object[length]; - - if (dt instanceof BooleanType) { -for (int i = 0; i < length; i++) { - if (!data.isNullAt(offset + i)) { -list[i] = data.getBoolean(offset + i); - } -} - } else if (dt instanceof ByteType) { -for (int i = 0; i < length; i++) { - if (!data.isNullAt(offset + i)) { -list[i] = data.getByte(offset + i); - } -} - } else if (dt instanceof ShortType) { -for (int i = 0; i < length; i++) { - if (!data.isNullAt(offset + i)) { -list[i] = data.getShort(offset + i); - } -} - } else if (dt instanceof IntegerType) { -for (int i = 0; i < length; i++) { - if (!data.isNullAt(offset + i)) { -list[i] = data.getInt(offset + i); - } -} - } else if (dt instanceof FloatType) { -for (int i = 0; i < length; i++) { - if (!data.isNullAt(offset + i)) { -list[i] = data.getFloat(offset + i); - } -} - } else if (dt instanceof DoubleType) { + try { for (int i = 0; i < length; i++) { if (!data.isNullAt(offset + i)) { -list[i] = data.getDouble(offset + i); +list[i] = getAtMethod.call(i); --- End diff -- It should be `get(i, dt)`? I updated it anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...
GitHub user liufengdb opened a pull request: https://github.com/apache/spark/pull/19230 [SPARK-22003][SQL] support array column in vectorized reader with UDF ## What changes were proposed in this pull request? The UDF needs to deserialize the `UnsafeRow`. When the column type is Array, the `get` method from the `ColumnVector`, which is used by the vectorized reader, is called, but this method is not implemented. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liufengdb/spark fix_array_open Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19230.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19230 commit adbaeabf18ee1f96611ecbd6ee627bc0a457289d Author: Feng Liu <feng...@databricks.com> Date: 2017-09-12T21:56:55Z init --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18400: [SPARK-21188][CORE] releaseAllLocksForTask should...
GitHub user liufengdb opened a pull request: https://github.com/apache/spark/pull/18400 [SPARK-21188][CORE] releaseAllLocksForTask should synchronize the whole method ## What changes were proposed in this pull request? Since the objects `readLocksByTask`, `writeLocksByTask` and `info`s are coupled and supposed to be modified by other threads concurrently, all the read and writes of them in the method `releaseAllLocksForTask` should be protected by a single synchronized block like other similar methods. ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/liufengdb/spark synchronize Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18400.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18400 commit 5cdd328ee9a32969377cbdbfea229cc364dbee17 Author: Feng Liu <feng...@databricks.com> Date: 2017-06-23T05:24:07Z init --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18208: [SPARK-20991] BROADCAST_TIMEOUT conf should be a ...
GitHub user liufengdb opened a pull request: https://github.com/apache/spark/pull/18208 [SPARK-20991] BROADCAST_TIMEOUT conf should be a TimeoutConf ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) The construction of BROADCAST_TIMEOUT conf should take the TimeUnit argument as a TimeoutConf. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liufengdb/spark fix_timeout Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18208.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18208 commit 8a2d37a10cd6eb36403006b99a33a7d057905e6e Author: Feng Liu <feng...@databricks.com> Date: 2017-06-05T20:00:55Z small --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17397: [SPARK-20070][SQL] Redact DataSourceScanExec treeString
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/17397 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org