[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user eatoncys commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132368646 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -572,6 +572,14 @@ object SQLConf { "disable logging or -1 to apply no limit.") .createWithDefault(1000) + val WHOLESTAGE_MAX_LINES_PER_FUNCTION = buildConf("spark.sql.codegen.maxLinesPerFunction") +.internal() +.doc("The maximum lines of a single Java function generated by whole-stage codegen. " + + "When the generated function exceeds this threshold, " + + "the whole-stage codegen is deactivated for this subtree of the current query plan.") +.intConf +.createWithDefault(1500) --- End diff -- When I modified it to 1600, the result is: max function length of wholestagecodegen: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative codegen = F467 / 507 1.4 712.7 1.0X codegen = T maxLinesPerFunction = 16003191 / 3238 0.2 4868.7 0.1X codegen = T maxLinesPerFunction = 1500 449 / 482 1.5 685.2 1.0X --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user eatoncys commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132368484 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -370,6 +370,14 @@ case class WholeStageCodegenExec(child: SparkPlan) extends UnaryExecNode with Co override def doExecute(): RDD[InternalRow] = { val (ctx, cleanedSource) = doCodeGen() +if (ctx.isTooLongGeneratedFunction) { + logWarning("Found too long generated codes and JIT optimization might not work, " + +"Whole-stage codegen disabled for this plan, " + +"You can change the config spark.sql.codegen.MaxFunctionLength " + +"to adjust the function length limit:\n " ++ s"$treeString") + return child.execute() +} --- End diff -- I think it can tested by " max function length of wholestagecodegen" added in AggregateBenchmark.scala, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18810 Btw, can you change `[sql]` to `[SQL]` in title? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132367400 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -572,6 +572,14 @@ object SQLConf { "disable logging or -1 to apply no limit.") .createWithDefault(1000) + val WHOLESTAGE_MAX_LINES_PER_FUNCTION = buildConf("spark.sql.codegen.maxLinesPerFunction") +.internal() +.doc("The maximum lines of a single Java function generated by whole-stage codegen. " + + "When the generated function exceeds this threshold, " + + "the whole-stage codegen is deactivated for this subtree of the current query plan.") +.intConf +.createWithDefault(1500) --- End diff -- I tend to not change current behavior of whole-stage codegen. This might suddenly let user codes not run in whole-stage codegen unintentionally. Shall we make `-1` as default and skip function length check if this config is negative? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132367041 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -572,6 +572,14 @@ object SQLConf { "disable logging or -1 to apply no limit.") .createWithDefault(1000) + val WHOLESTAGE_MAX_LINES_PER_FUNCTION = buildConf("spark.sql.codegen.maxLinesPerFunction") +.internal() +.doc("The maximum lines of a single Java function generated by whole-stage codegen. " + + "When the generated function exceeds this threshold, " + + "the whole-stage codegen is deactivated for this subtree of the current query plan.") +.intConf +.createWithDefault(1500) --- End diff -- I'm not confident about this default value. Is it too small? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17995: [SPARK-20762][ML]Make String Params Case-Insensit...
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/17995 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132366896 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala --- @@ -89,6 +89,14 @@ object CodeFormatter { } new CodeAndComment(code.result().trim(), map) } + + def stripExtraNewLinesAndComments(input: String): String = { +val commentReg = + ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" + // strip /*comment*/ +"""([ |\t]*?\/\/[\s\S]*?\n)""").r // strip //comment --- End diff -- nit: align `// strip //comment` with above `// strip /*comment*/`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132366187 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -370,6 +370,14 @@ case class WholeStageCodegenExec(child: SparkPlan) extends UnaryExecNode with Co override def doExecute(): RDD[InternalRow] = { val (ctx, cleanedSource) = doCodeGen() +if (ctx.isTooLongGeneratedFunction) { + logWarning("Found too long generated codes and JIT optimization might not work, " + +"Whole-stage codegen disabled for this plan, " + +"You can change the config spark.sql.codegen.MaxFunctionLength " + +"to adjust the function length limit:\n " ++ s"$treeString") + return child.execute() +} --- End diff -- We need to add a test in which we create a query with long generated function, and check if whole-stage codegen is disabled for it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user eatoncys commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132365359 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -572,6 +572,13 @@ object SQLConf { "disable logging or -1 to apply no limit.") .createWithDefault(1000) + val WHOLESTAGE_MAX_LINES_PER_FUNCTION = buildConf("spark.sql.codegen.maxLinesPerFunction") +.internal() +.doc("The maximum lines of a function that will be supported before" + + " deactivating whole-stage codegen.") --- End diff -- Ok,updated,thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18810 **[Test build #80475 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80475/testReport)** for PR 18810 at commit [`ce544a5`](https://github.com/apache/spark/commit/ce544a56dbeaa9fecb66706f3d2bad97280835bd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user eatoncys commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132365401 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -356,6 +356,19 @@ class CodegenContext { private val placeHolderToComments = new mutable.HashMap[String, String] /** + * Returns if there is a codegen function the lines of which is greater than maxLinesPerFunction + * It will count the lines of every codegen function, if there is a function of length + * greater than spark.sql.codegen.maxLinesPerFunction, it will return true. + */ + def existTooLongFunction(): Boolean = { --- End diff -- Ok,updated,thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user eatoncys commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132365436 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -356,6 +356,19 @@ class CodegenContext { private val placeHolderToComments = new mutable.HashMap[String, String] /** + * Returns if there is a codegen function the lines of which is greater than maxLinesPerFunction + * It will count the lines of every codegen function, if there is a function of length + * greater than spark.sql.codegen.maxLinesPerFunction, it will return true. + */ + def existTooLongFunction(): Boolean = { +classFunctions.exists { case (className, functions) => + functions.exists{ case (name, code) => +val codeWithoutComments = CodeFormatter.stripExtraNewLinesAndComments(code) +codeWithoutComments.count(_ == '\n') > SQLConf.get.maxLinesPerFunction + } +} + } + /** --- End diff -- Ok, added, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18865#discussion_r132364612 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala --- @@ -114,7 +114,16 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister { } (file: PartitionedFile) => { - val parser = new JacksonParser(actualSchema, parsedOptions) + // SPARK-21610: when the `requiredSchema` only contains `_corrupt_record`, --- End diff -- Btw, some strange behaviors might occur: scala> dfFromFile.filter($"_corrupt_record".isNotNull).show +-+---+ |field|_corrupt_record| +-+---+ | null| {"field": "3"}| +-+---+ scala> dfFromFile.filter($"_corrupt_record".isNotNull).select("_corrupt_record").show +---+ |_corrupt_record| +---+ +---+ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132363994 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -356,6 +356,19 @@ class CodegenContext { private val placeHolderToComments = new mutable.HashMap[String, String] /** + * Returns if there is a codegen function the lines of which is greater than maxLinesPerFunction + * It will count the lines of every codegen function, if there is a function of length + * greater than spark.sql.codegen.maxLinesPerFunction, it will return true. + */ + def existTooLongFunction(): Boolean = { +classFunctions.exists { case (className, functions) => + functions.exists{ case (name, code) => +val codeWithoutComments = CodeFormatter.stripExtraNewLinesAndComments(code) +codeWithoutComments.count(_ == '\n') > SQLConf.get.maxLinesPerFunction + } +} + } + /** --- End diff -- Add one more space --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18865#discussion_r132363687 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala --- @@ -114,7 +114,16 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister { } (file: PartitionedFile) => { - val parser = new JacksonParser(actualSchema, parsedOptions) + // SPARK-21610: when the `requiredSchema` only contains `_corrupt_record`, --- End diff -- Oh. Got it. One issue for this behavior is we can't easily to only retrieve corrupt records by queries like `dfFromFile.select("_corrupt_record")`. This behavior is also inconsistent with RDD-based manipulation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18865#discussion_r132363283 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala --- @@ -114,7 +114,16 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister { } (file: PartitionedFile) => { - val parser = new JacksonParser(actualSchema, parsedOptions) + // SPARK-21610: when the `requiredSchema` only contains `_corrupt_record`, --- End diff -- Ah, I mean they produced 0 and 3 for each as described in the PR description. I just double checked. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18865#discussion_r132361425 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala --- @@ -114,7 +114,16 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister { } (file: PartitionedFile) => { - val parser = new JacksonParser(actualSchema, parsedOptions) + // SPARK-21610: when the `requiredSchema` only contains `_corrupt_record`, --- End diff -- I've not tried 1.6.3 or 1.5.2. So @HyukjinKwon do you mean above code returns 1 for isNotNull and 2 for isNull? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18810 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18810 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80472/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Pa...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17849#discussion_r132361043 --- Diff: python/pyspark/ml/tests.py --- @@ -1572,7 +1588,8 @@ def test_java_params(self): for name, cls in inspect.getmembers(module, inspect.isclass): if not name.endswith('Model') and issubclass(cls, JavaParams)\ and not inspect.isabstract(cls): -self.check_params(cls()) +# NOTE: disable check_params_exist until there is parity with Scala API +ParamTests.check_params(self, cls(), check_params_exist=False) --- End diff -- This skips param test for Model. Should we do similar check to all models? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18810 **[Test build #80472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80472/testReport)** for PR 18810 at commit [`d44a2f8`](https://github.com/apache/spark/commit/d44a2f8499b4f7b9235fd138349005a4e3c960a5). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18900 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132360895 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -356,6 +356,19 @@ class CodegenContext { private val placeHolderToComments = new mutable.HashMap[String, String] /** + * Returns if there is a codegen function the lines of which is greater than maxLinesPerFunction + * It will count the lines of every codegen function, if there is a function of length + * greater than spark.sql.codegen.maxLinesPerFunction, it will return true. + */ + def existTooLongFunction(): Boolean = { --- End diff -- > isTooLongGeneratedFunction Nit: remove `()` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18900: [SPARK-21687][SQL] Spark SQL should set createTime for H...
Github user debugger87 commented on the issue: https://github.com/apache/spark/pull/18900 @cloud-fan could you please help me to review this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18900: [SPARK-21687][SQL] Spark SQL should set createTim...
GitHub user debugger87 opened a pull request: https://github.com/apache/spark/pull/18900 [SPARK-21687][SQL] Spark SQL should set createTime for Hive partition ## What changes were proposed in this pull request? Set createTime for every hive partition created in Spark SQL, which could be used to manage data lifecycle in Hive warehouse. ## How was this patch tested? No tests Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/debugger87/spark fix/set-create-time-for-hive-partition Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18900.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18900 commit 71a660ac8dad869d9ba3b4e206b74f5c44660ee6 Author: debugger87Date: 2017-08-10T04:17:00Z [SPARK-21687][SQL] Spark SQL should set createTime for Hive partition --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132360710 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -572,6 +572,13 @@ object SQLConf { "disable logging or -1 to apply no limit.") .createWithDefault(1000) + val WHOLESTAGE_MAX_LINES_PER_FUNCTION = buildConf("spark.sql.codegen.maxLinesPerFunction") +.internal() +.doc("The maximum lines of a function that will be supported before" + + " deactivating whole-stage codegen.") --- End diff -- > The maximum lines of a single Java function generated by whole-stage codegen. When the generated function exceeds this threshold, the whole-stage codegen is deactivated for this subtree of the current query plan. Could you also update the code comments in the other places based on my above update? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Pa...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17849#discussion_r132360643 --- Diff: python/pyspark/ml/classification.py --- @@ -1325,7 +1325,7 @@ def __init__(self, featuresCol="features", labelCol="label", predictionCol="pred super(MultilayerPerceptronClassifier, self).__init__() self._java_obj = self._new_java_obj( "org.apache.spark.ml.classification.MultilayerPerceptronClassifier", self.uid) -self._setDefault(maxIter=100, tol=1E-4, blockSize=128, stepSize=0.03, solver="l-bfgs") +self._setDefault(maxIter=100, tol=1E-6, blockSize=128, stepSize=0.03, solver="l-bfgs") --- End diff -- Looks like 1e-6 is correct default value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18810 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18810 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80471/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18810 **[Test build #80471 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80471/testReport)** for PR 18810 at commit [`d3238e9`](https://github.com/apache/spark/commit/d3238e9800f73b39b55e47419c5409b8111ea080). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Pa...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17849#discussion_r132360069 --- Diff: python/pyspark/ml/tests.py --- @@ -417,6 +417,54 @@ def test_logistic_regression_check_thresholds(self): LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] ) +@staticmethod +def check_params(test_self, py_stage, check_params_exist=True): +""" +Checks common requirements for Params.params: + - set of params exist in Java and Python and are ordered by names + - param parent has the same UID as the object's UID + - default param value from Java matches value in Python + - optionally check if all params from Java also exist in Python +""" +py_stage_str = "%s %s" % (type(py_stage), py_stage) +if not hasattr(py_stage, "_to_java"): +return +java_stage = py_stage._to_java() +if java_stage is None: +return +test_self.assertEqual(py_stage.uid, java_stage.uid(), msg=py_stage_str) +if check_params_exist: +param_names = [p.name for p in py_stage.params] +java_params = list(java_stage.params()) +java_param_names = [jp.name() for jp in java_params] +test_self.assertEqual( +param_names, sorted(java_param_names), +"Param list in Python does not match Java for %s:\nJava = %s\nPython = %s" +% (py_stage_str, java_param_names, param_names)) --- End diff -- Line 436-443 is the only change to `check_params`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132359678 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -370,6 +370,15 @@ case class WholeStageCodegenExec(child: SparkPlan) extends UnaryExecNode with Co override def doExecute(): RDD[InternalRow] = { val (ctx, cleanedSource) = doCodeGen() +val existLongFunction = ctx.existTooLongFunction +if (existLongFunction) { + logWarning(s"Found too long generated codes and JIT optimization might not work, " + +s"Whole-stage codegen disabled for this plan, " + +s"You can change the config spark.sql.codegen.MaxFunctionLength " + +s"to adjust the function length limit:\n " --- End diff -- Please remove the useless `s` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Pa...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17849#discussion_r132359369 --- Diff: python/pyspark/ml/wrapper.py --- @@ -144,7 +158,9 @@ def _transfer_params_from_java(self): if self._java_obj.hasParam(param.name): java_param = self._java_obj.getParam(param.name) # SPARK-14931: Only check set params back to avoid default params mismatch. -if self._java_obj.isSet(java_param): +if self._java_obj.isSet(java_param) or ( +# SPARK-10931: Temporary fix for params that have a default in Java +self._java_obj.hasDefault(java_param) and not self.isDefined(param)): --- End diff -- This change will make a default value for a param in java side as an user-provided param value in python side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17972: [SPARK-20723][ML]Add intermediate storage level to tree ...
Github user phatak-dev commented on the issue: https://github.com/apache/spark/pull/17972 @MLnick Any updates on this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Pa...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17849#discussion_r132358656 --- Diff: python/pyspark/ml/wrapper.py --- @@ -263,7 +284,8 @@ def _fit_java(self, dataset): def _fit(self, dataset): java_model = self._fit_java(dataset) -return self._create_model(java_model) +model = self._create_model(java_model) +return self._copyValues(model) --- End diff -- Here I think it is going to copy values from the estimator to the created model. So I think we assume that the params in estimator and model are the same? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18810 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80470/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18810 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18810 **[Test build #80470 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80470/testReport)** for PR 18810 at commit [`d0c753a`](https://github.com/apache/spark/commit/d0c753a5d3f5fbb5e14da0eebbd5e9bd3778126c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Val...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17849 Sorry, let me try and take a look tomorrow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Pa...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17849#discussion_r132357684 --- Diff: python/pyspark/ml/wrapper.py --- @@ -135,6 +135,20 @@ def _transfer_param_map_to_java(self, pyParamMap): paramMap.put([pair]) return paramMap +def _create_params_from_java(self): +""" +SPARK-10931: Temporary fix to create params that are defined in the Java obj but not here +""" +java_params = list(self._java_obj.params()) +from pyspark.ml.param import Param +for java_param in java_params: +java_param_name = java_param.name() +if not hasattr(self, java_param_name): --- End diff -- If self contains a same name attribute which is not a `Param`, should we process it like throw exception? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18544 **[Test build #80474 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80474/testReport)** for PR 18544 at commit [`c41475e`](https://github.com/apache/spark/commit/c41475e3c5a217e5778bbddcd1b4a4210ce5d180). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18865#discussion_r132357070 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala --- @@ -114,7 +114,16 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister { } (file: PartitionedFile) => { - val parser = new JacksonParser(actualSchema, parsedOptions) + // SPARK-21610: when the `requiredSchema` only contains `_corrupt_record`, --- End diff -- I am actually rather -0 on this change. Both the current way and the previous way sound not quite compelling to me but the current way at least does arguably unnecessary parsing tries and we started to have this behaviour long time ago.. (at least I tried this in 1.6.3 and 1.5.2): ```scala import org.apache.spark.sql.types._ val schema = new StructType().add("field", ByteType).add("_corrupt_record", StringType) val file = "/tmp/sample.json" val dfFromFile = sqlContext.read.schema(schema).json(file) dfFromFile.filter($"_corrupt_record".isNotNull).count() dfFromFile.filter($"_corrupt_record".isNull).count() ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18899 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18899 **[Test build #80473 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80473/testReport)** for PR 18899 at commit [`5dc5c89`](https://github.com/apache/spark/commit/5dc5c89242a0c2a5ac6a693c3703eef8ee160616). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18899 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80473/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17849#discussion_r132355421 --- Diff: python/pyspark/ml/wrapper.py --- @@ -135,6 +135,20 @@ def _transfer_param_map_to_java(self, pyParamMap): paramMap.put([pair]) return paramMap +def _create_params_from_java(self): +""" +SPARK-10931: Temporary fix to create params that are defined in the Java obj but not here +""" +java_params = list(self._java_obj.params()) +from pyspark.ml.param import Param +for java_param in java_params: +java_param_name = java_param.name() +if not hasattr(self, java_param_name): +param = Param(self, java_param_name, java_param.doc()) +setattr(param, "created_from_java_param", True) --- End diff -- BTW, would you mind if I ask where `created_from_java_param` is used? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17342: [SPARK-12868][SQL] Allow adding jars from hdfs
Github user weiqingy commented on the issue: https://github.com/apache/spark/pull/17342 @steveloughran Thanks Steve. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18893: [SPARK-21675][WebUI]Add a navigation bar at the bottom o...
Github user ajbozarth commented on the issue: https://github.com/apache/spark/pull/18893 Since they're both small and this is already open I'd say leave it, unless someone ends up having issues with one of the fixes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18865: [SPARK-21610][SQL] Corrupt records are not handle...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18865#discussion_r132352189 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala --- @@ -114,7 +114,16 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister { } (file: PartitionedFile) => { - val parser = new JacksonParser(actualSchema, parsedOptions) + // SPARK-21610: when the `requiredSchema` only contains `_corrupt_record`, --- End diff -- What do you think? @cloud-fan @HyukjinKwon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18756: [SPARK-21548][SQL] "Support insert into serial columns o...
Github user lvdongr commented on the issue: https://github.com/apache/spark/pull/18756 You mean we can provide the different type of values with different default values? like int with 0 ,and string with "" ?Or we set the default values when define the table? @gatorsmile @maropu I set the default to Null ,because the "insert into ..." sentence in hive handle in this way, and I want to correspond with Hive. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18895: [SPARK-21658][SQL][PYSPARK] Add default None for value i...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18895 @byakuinss Please add a doc test in `DataFrame.replace`. There is an example `df4.na.replace('Alice', None).show()`. We want to make sure it works with default value. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Val...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17849 Oh, wait, this looks not requiring ML bit much. Will try to give a pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Val...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17849 I am rather a backend developer and work together with data scientists. So, my ML knowledge is limited (am studying hard :)). Will leave few comments together if there are some nits and someone starts to review so that they can be addressed together. cc @viirya who I believe knows ML bit and @zero323 who I believe should be able to review this (but now is inactive though), are you maybe able to make a pass for this one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18899 **[Test build #80473 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80473/testReport)** for PR 18899 at commit [`5dc5c89`](https://github.com/apache/spark/commit/5dc5c89242a0c2a5ac6a693c3703eef8ee160616). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/18899 [SPARK-21680][ML][MLLIB]optimzie Vector coompress ## What changes were proposed in this pull request? When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse. This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse. When the length of the vector is large, there is significant performance difference between this two method. ## How was this patch tested? The existing UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mpjlu/spark optVectorCompress Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18899.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18899 commit 5dc5c89242a0c2a5ac6a693c3703eef8ee160616 Author: Peng MengDate: 2017-08-10T01:59:17Z optimzie Vector coompress --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18648: [SPARK-21428] Turn IsolatedClientLoader off while using ...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/18648 ping @jiangxb1987 @cloud-fan anymore suggestionsï¼ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18630 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80468/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18630 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18630 **[Test build #80468 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80468/testReport)** for PR 18630 at commit [`c0b0a7d`](https://github.com/apache/spark/commit/c0b0a7d79ca27bbcf91245b3d80070d5d4188174). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18893: [SPARK-21675][WebUI]Add a navigation bar at the bottom o...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/18893 @ajbozarth do we need another pr to separate these? if necessary, I will do that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user eatoncys commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132347436 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -356,6 +356,18 @@ class CodegenContext { private val placeHolderToComments = new mutable.HashMap[String, String] /** + * Returns if the length of codegen function is too long or not --- End diff -- Ok, I have modified it, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user eatoncys commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132347148 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -356,6 +356,18 @@ class CodegenContext { private val placeHolderToComments = new mutable.HashMap[String, String] /** + * Returns if the length of codegen function is too long or not + * It will count the lines of every codegen function, if there is a function of length + * greater than spark.sql.codegen.MaxFunctionLength, it will return true. + */ + def existTooLongFunction(): Boolean = { +classFunctions.exists { case (className, functions) => + functions.exists{ case (name, code) => +CodeFormatter.stripExtraNewLines(code).count(_ == '\n') > SQLConf.get.maxFunctionLength --- End diff -- Ok, I have modified it to count lines without comments and extra new lines --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user eatoncys commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132347198 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -572,6 +572,13 @@ object SQLConf { "disable logging or -1 to apply no limit.") .createWithDefault(1000) + val WHOLESTAGE_MAX_FUNCTION_LEN = buildConf("spark.sql.codegen.MaxFunctionLength") --- End diff -- Ok, I have modified it, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18810 **[Test build #80472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80472/testReport)** for PR 18810 at commit [`d44a2f8`](https://github.com/apache/spark/commit/d44a2f8499b4f7b9235fd138349005a4e3c960a5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
Github user eatoncys commented on a diff in the pull request: https://github.com/apache/spark/pull/18810#discussion_r132347018 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala --- @@ -301,6 +301,61 @@ class AggregateBenchmark extends BenchmarkBase { */ } + ignore("max function length of wholestagecodegen") { +val N = 20 << 15 + +val benchmark = new Benchmark("max function length of wholestagecodegen", N) +def f(): Unit = sparkSession.range(N) + .selectExpr( +"id", +"(id & 1023) as k1", +"cast(id & 1023 as double) as k2", +"cast(id & 1023 as int) as k3", +"case when id > 100 and id <= 200 then 1 else 0 end as v1", +"case when id > 200 and id <= 300 then 1 else 0 end as v2", +"case when id > 300 and id <= 400 then 1 else 0 end as v3", +"case when id > 400 and id <= 500 then 1 else 0 end as v4", +"case when id > 500 and id <= 600 then 1 else 0 end as v5", +"case when id > 600 and id <= 700 then 1 else 0 end as v6", +"case when id > 700 and id <= 800 then 1 else 0 end as v7", +"case when id > 800 and id <= 900 then 1 else 0 end as v8", +"case when id > 900 and id <= 1000 then 1 else 0 end as v9", +"case when id > 1000 and id <= 1100 then 1 else 0 end as v10", +"case when id > 1100 and id <= 1200 then 1 else 0 end as v11", +"case when id > 1200 and id <= 1300 then 1 else 0 end as v12", +"case when id > 1300 and id <= 1400 then 1 else 0 end as v13", +"case when id > 1400 and id <= 1500 then 1 else 0 end as v14", +"case when id > 1500 and id <= 1600 then 1 else 0 end as v15", +"case when id > 1600 and id <= 1700 then 1 else 0 end as v16", +"case when id > 1700 and id <= 1800 then 1 else 0 end as v17", +"case when id > 1800 and id <= 1900 then 1 else 0 end as v18") + .groupBy("k1", "k2", "k3") + .sum() + .collect() + +benchmark.addCase(s"codegen = F") { iter => + sparkSession.conf.set("spark.sql.codegen.wholeStage", "false") + f() +} + +benchmark.addCase(s"codegen = T") { iter => + sparkSession.conf.set("spark.sql.codegen.wholeStage", "true") + sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "1") --- End diff -- Ok, I have added a test use the default number 1500, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18893: [SPARK-21675][WebUI]Add a navigation bar at the bottom o...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/18893 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18810 **[Test build #80471 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80471/testReport)** for PR 18810 at commit [`d3238e9`](https://github.com/apache/spark/commit/d3238e9800f73b39b55e47419c5409b8111ea080). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18895: [SPARK-21658][SQL][PYSPARK] Add default None for value i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18895 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18895: [SPARK-21658][SQL][PYSPARK] Add default None for value i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18895 **[Test build #80469 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80469/testReport)** for PR 18895 at commit [`8af1e15`](https://github.com/apache/spark/commit/8af1e15f37c750dda53542b5a854f832ff006773). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18895: [SPARK-21658][SQL][PYSPARK] Add default None for value i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18895 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80469/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18810 **[Test build #80470 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80470/testReport)** for PR 18810 at commit [`d0c753a`](https://github.com/apache/spark/commit/d0c753a5d3f5fbb5e14da0eebbd5e9bd3778126c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18756: [SPARK-21548][SQL] "Support insert into serial columns o...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/18756 In the most cases of `SELECT` statements, `default_value` is `NULL` by default. So, I firstly thought non-specified columns were filled with `NULL`. Anyway, we still have any chance to implement the concept of `DEFAULT`, too? ``` postgresql doc: DEFAULT default_expr ... The default expression will be used in any insert operation that does not specify a value for the column. If there is no default for a column, then the default is null. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18895: [SPARK-21658][SQL][PYSPARK] Add default None for value i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18895 **[Test build #80469 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80469/testReport)** for PR 18895 at commit [`8af1e15`](https://github.com/apache/spark/commit/8af1e15f37c750dda53542b5a854f832ff006773). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18895: [SPARK-21658][SQL][PYSPARK] Add default None for value i...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18895 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18895: [SPARK-21658][SQL][PYSPARK] Add default None for value i...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18895 Could we add the example in the doctest (under 1362L) so that this can be tested and shown in the documentation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/18882 Any activity for cost-based inference? Anyway, thanks! I'll close this for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18882: [SPARK-21652][SQL] Filter out meaningless constra...
Github user maropu closed the pull request at: https://github.com/apache/spark/pull/18882 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18882: [SPARK-21652][SQL] Filter out meaningless constraints in...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18882 Thanks for working on it, but the inferred one is not useless. The removal has to be cost based. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18820 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18820 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18898: [SPARK-21245][ML] Resolve code duplication for classific...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18898 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18898: [SPARK-21245][ML] Resolve code duplication for cl...
GitHub user bravo-zhang opened a pull request: https://github.com/apache/spark/pull/18898 [SPARK-21245][ML] Resolve code duplication for classification/regression summarizers ## Why the change? In several places (LogReg, LinReg, SVC) in Spark ML, we collect summary information about training data using `MultivariateOnlineSummarizer` and `MulticlassSummarizer`. We have the same code appearing in several places (including test suites). We can eliminate this by creating a common implementation. ## What changes were proposed in this pull request? 1. A new class `ml.stat.Summarizers.scala` with `def getRegressionSummarizers` and `def getClassificationSummarizers` that provides a pair of feature and label summarizers. This centralizes the duplicated code in: `LinearRegression`, `LinearSVC`, `LogisticRegression` and `DifferentiableLossAggregatorSuite`. 2. Moves `MultiClassSummarizer.scala`(and testSuite) out of `LogisticRegression.scala` to new file `ml.stat.MultiClassSummarizer.scala`, because it is also used by `LinearSVC` and can be generalized. ## How was this patch tested? `ml.stat.SummarizersSuite.scala` You can merge this pull request into a Git repository by running: $ git pull https://github.com/bravo-zhang/spark spark-21245 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18898.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18898 commit 1f5209f7e40c520e1c6b6b5943ef87fde7d5b254 Author: bravo-zhangDate: 2017-08-09T16:05:23Z Resolve code duplication for classification/regression summarizers --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18630 Ok, thanks for checking. It doesn't look like it's coming from your changes, so I'm sure it's just me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18734: [SPARK-21070][PYSPARK] Attempt to update cloudpickle aga...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/18734 huzzah! I'm in the middle of getting some code working for a talk tomorrow so I'll circle back on this on Friday. If @davies has any opinions though it would be great to hear them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18734: [SPARK-21070][PYSPARK] Attempt to update cloudpic...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/18734#discussion_r132334455 --- Diff: python/pyspark/cloudpickle.py --- @@ -397,42 +625,7 @@ def save_global(self, obj, name=None, pack=struct.pack): typ = type(obj) if typ is not obj and isinstance(obj, (type, types.ClassType)): -d = dict(obj.__dict__) # copy dict proxy to a dict -if not isinstance(d.get('__dict__', None), property): -# don't extract dict that are properties -d.pop('__dict__', None) -d.pop('__weakref__', None) - -# hack as __new__ is stored differently in the __dict__ -new_override = d.get('__new__', None) -if new_override: -d['__new__'] = obj.__new__ - -# workaround for namedtuple (hijacked by PySpark) -if getattr(obj, '_is_namedtuple_', False): -self.save_reduce(_load_namedtuple, (obj.__name__, obj._fields)) -return - -self.save(_load_class) -self.save_reduce(typ, (obj.__name__, obj.__bases__, {"__doc__": obj.__doc__}), obj=obj) -d.pop('__doc__', None) -# handle property and staticmethod -dd = {} -for k, v in d.items(): --- End diff -- Gentle re-ping to @davies - do you have an opinion on this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user skonto commented on the issue: https://github.com/apache/spark/pull/18630 This is how I build things: ./build/mvn -Pmesos -Phadoop-2.7 -Dhadoop.version=2.7.0 -DskipTests clean package # -DskipTests clean package export JAVA_HOME=/usr/lib/jvm/java-8-oracle/jre/ ./dev/make-distribution.sh --name 18630 --tgz -Phadoop-2.7 -Pmesos --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user skonto commented on the issue: https://github.com/apache/spark/pull/18630 @BryanCutler sure check here, it works: https://gist.github.com/skonto/dc2070d1529c97ec5de32e99983a834f --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18630 Maybe it was just something with my env - but I was running it locally, can you just verify that works too? Just don't specify the `--master` conf and run out of your spark home dir --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16158: [SPARK-18724][ML] Add TuningSummary for TrainValidationS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16158 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user skonto commented on the issue: https://github.com/apache/spark/pull/18630 > spark-2.3.0-SNAPSHOT-bin-18630/bin$ ./spark-shell --verbose --master spark://ip-10-10-1-79:7077 Using properties file: null Parsed arguments: master spark://ip-10-10-1-79:7077 deployMode null executorMemory null executorCores null totalExecutorCores null propertiesFile null driverMemorynull driverCores null driverExtraClassPathnull driverExtraLibraryPath null driverExtraJavaOptions null supervise false queue null numExecutorsnull files null pyFiles null archivesnull mainClass org.apache.spark.repl.Main primaryResource spark-shell nameSpark shell childArgs [] jarsnull packagesnull packagesExclusions null repositoriesnull verbose true Spark properties used, including those specified through --conf and those from the properties file null: Main class: org.apache.spark.repl.Main Arguments: System properties: (SPARK_SUBMIT,true) (spark.app.name,Spark shell) (spark.jars,) (spark.submit.deployMode,client) (spark.master,spark://ip-10-10-1-79:7077) Classpath elements: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/08/09 23:28:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://10.10.1.79:4040 Spark context available as 'sc' (master = spark://ip-10-10-1-79:7077, app id = app-20170809232804-0003). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16158: [SPARK-18724][ML] Add TuningSummary for TrainValidationS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16158 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80467/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16158: [SPARK-18724][ML] Add TuningSummary for TrainValidationS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16158 **[Test build #80467 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80467/testReport)** for PR 16158 at commit [`72aea62`](https://github.com/apache/spark/commit/72aea626bb1fef4a2834e1054bac99451f04c0e2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18630 Yeah, just by running `bin/spark-shell` it failed immediately with that error. I double-check by rebuilding and same thing but I'm not sure if was something from your changes or not. Are you able to startup the shell? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user skonto commented on the issue: https://github.com/apache/spark/pull/18630 @BryanCutler you just started spark shell and it failed? How can I reproduce it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/18630 Sure, python support could be added at a later point, I was just thinking if it was only a small addition to what's already here, but no problem. Btw, after checking out this PR I tried spark-shell and got the error below. Not sure if it was my environment, but after switching back to master it worked fine ``` bin/spark-shell Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.hadoop.conf.Configuration.(Configuration.java:178) at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:324) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:155) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/18630 I wasn't really expecting python support to be added here. I wonder if there's a bug open for that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Pa...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/17849#discussion_r132330891 --- Diff: python/pyspark/ml/wrapper.py --- @@ -263,7 +284,8 @@ def _fit_java(self, dataset): def _fit(self, dataset): java_model = self._fit_java(dataset) -return self._create_model(java_model) +model = self._create_model(java_model) +return self._copyValues(model) --- End diff -- This is the crucial line being added in this PR. Without this, if a Python model defines a param (matching one from Scala), then when the model is fit in Scala that param value will never be sent back to Python. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18630: [SPARK-12559][SPARK SUBMIT] fix --packages for stand-alo...
Github user skonto commented on the issue: https://github.com/apache/spark/pull/18630 @BryanCutler @vanzin to make things testable DriverWrapper needs refactoring from a quick look I took. py files are resolved in client mode, let's fix it in another PR (I could do it). The docs (https://spark.apache.org/docs/latest/submitting-applications.html) state: "Currently, standalone mode does not support cluster mode for Python applications." So is the file distribution the only thing to do? I havent scoped the work needed to support python apps. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18734: [SPARK-21070][PYSPARK] Attempt to update cloudpickle aga...
Github user rgbkrk commented on the issue: https://github.com/apache/spark/pull/18734 Just a note that we just shipped the fixes from @HyukjinKwon within cloudpickle (as v0.4.0), so we're at least roughly in sync now. ð --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17849: [SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Val...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/17849 ping @holdenk , also @HyukjinKwon if you are able to take a look --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org