[GitHub] spark issue #19511: [SPARK-22293][SQL] Avoid unnecessary traversal in Resolv...
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19511 OK, close it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19511: [SPARK-22293][SQL] Avoid unnecessary traversal in Resolv...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19511 @ConeyLiu I just saw extra codes/logics are added. Maybe close it first? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19480 @kiszk Does this look good to you? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19269 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82872/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19269 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19269 **[Test build #82872 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82872/testReport)** for PR 19269 at commit [`9e12d9f`](https://github.com/apache/spark/commit/9e12d9ffc4e2368e709b018b6117cbeef743d6f3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19480 cc @rednaxelafx Do you have a bandwidth to review this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18747: [SPARK-20822][SQL] Generate code to directly get value f...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/18747 ping @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19521 `SPARK-15474 ` is zero row. The above case is zero column. Are they the same issues? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19524 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19524 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82871/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19524 **[Test build #82871 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82871/testReport)** for PR 19524 at commit [`26eeae1`](https://github.com/apache/spark/commit/26eeae140093896155d0b7196cd2de7dd4c6bda6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19522 **[Test build #82873 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82873/testReport)** for PR 19522 at commit [`e95bc7b`](https://github.com/apache/spark/commit/e95bc7b395e027aa3d1e719d987b4f5a4461c34b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19521 Thank you for review, @gatorsmile . 1. The test case was added at #15898 (SPARK-18457). I guess Parquet returns `null`, but we had better have explicit test cases. I will try to extend that test case for parquet next time. 2. Thanks for bringing that up. Yes. We can resolve that empty ORC file issue, SPARK-15474 (ORC data source fails to write and read back empty dataframe), with new ORC source by creating an empty file with the correct schema, not `struct<>`. BTW, I've linked all related ORC issues into [SPARK-20901](https://issues.apache.org/jira/browse/SPARK-20901) and am working on it. You can monitor ORC progress there. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19523: [SPARK-22301][SQL] Add rule to Optimizer for In w...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19523#discussion_r145318964 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -102,7 +102,8 @@ case class InMemoryTableScanExec( case IsNull(a: Attribute) => statsFor(a).nullCount > 0 case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0 -case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral +// We rely on the optimizations in org.apache.spark.sql.catalyst.optimizer.OptimizeIn +// to be sure that the list cannot be empty --- End diff -- We can remove this line after we merge this PR https://github.com/apache/spark/pull/19522 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19522 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19272 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82870/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19272 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19272 **[Test build #82870 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82870/testReport)** for PR 19272 at commit [`e522150`](https://github.com/apache/spark/commit/e522150c32ffca9f116197f80530554d3cb68b6c). * This patch passes all tests. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19521 We can save an empty DataFrame as an ORC table, but we are unable to fetch it from the table. ```Scala val rddNoCols = sparkContext.parallelize(1 to 10).map(_ => Row.empty) val dfNoCols = spark.createDataFrame(rddNoCols, StructType(Seq.empty)) dfNoCols.write.format("orc").saveAsTable("t") spark.sql("select 1 from t").show() ``` This is not related to this upgrade, but you might be interested in this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19521 Also LGTM Regarding the test case you posted, does Parquet return `null` or `empty string`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user shaneknapp commented on the issue: https://github.com/apache/spark/pull/19524 yeah, this makes sense. i don't think we've officially supported 2.6 in a while, esp for tests and i'd be ok w/removing the backports. this makes for a much clearer exit case. @rxin @sameeragarwal --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19521 Thank you for review, @cloud-fan ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19521 looks good, no new dependencies introduced, just upgrading. cc @srowen to double check. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19523: [SPARK-22301][SQL] Add rule to Optimizer for In w...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19523#discussion_r145306880 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -204,6 +204,7 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate { override def children: Seq[Expression] = value +: list lazy val inSetConvertible = list.forall(_.isInstanceOf[Literal]) + lazy val isListEmpty = list.isEmpty --- End diff -- Do we really need this val? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19523: [SPARK-22301][SQL] Add rule to Optimizer for In w...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19523#discussion_r145307826 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -102,7 +102,8 @@ case class InMemoryTableScanExec( case IsNull(a: Attribute) => statsFor(a).nullCount > 0 case IsNotNull(a: Attribute) => statsFor(a).count - statsFor(a).nullCount > 0 -case In(_: AttributeReference, list: Seq[Expression]) if list.isEmpty => Literal.FalseLiteral +// We rely on the optimizations in org.apache.spark.sql.catalyst.optimizer.OptimizeIn +// to be sure that the list cannot be empty --- End diff -- IMHO this comment is not accurate, since in optimizer we only deal with the case attribute is not nullable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19495: [SPARK-22278][SS] Expose current event time water...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19495 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19495 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19495 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82869/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19495 **[Test build #82869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82869/testReport)** for PR 19495 at commit [`ed9d3a2`](https://github.com/apache/spark/commit/ed9d3a2e4fd4b1a0ec0e0ceb34419ede4e2303b8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19269 **[Test build #82872 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82872/testReport)** for PR 19269 at commit [`9e12d9f`](https://github.com/apache/spark/commit/9e12d9ffc4e2368e709b018b6117cbeef743d6f3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19459 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19459 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82866/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19515: [SPARK-22287][MESOS] SPARK_DAEMON_MEMORY not honored by ...
Github user pmackles commented on the issue: https://github.com/apache/spark/pull/19515 Hi @ArtRand - Based on my testing and interpretation of the code, ```SPARK_DRIVER_MEMORY``` has no affect on MesosClusterDispatcher. Heap size always winds up being set to ```-Xmx1G``` (the default). I went with ```SPARK_DAEMON_MEMORY``` because that's what is used for the Master/Worker in a standalone cluster and I think Dispatcher is pretty similar in nature to those daemons. That said, I don't have a strong opinion and ```SPARK_DISPATCHER_MEMORY``` works fine for my use case so I am happy to switch it in the name of getting this PR accepted. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19459 **[Test build #82866 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82866/testReport)** for PR 19459 at commit [`81ddfa9`](https://github.com/apache/spark/commit/81ddfa9afa03531edd9ac7b805a09be2be96d88c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19488 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19488 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82865/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19488 **[Test build #82865 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82865/testReport)** for PR 19488 at commit [`ece2062`](https://github.com/apache/spark/commit/ece206223822bde6e36d654a0d34593c405f8ffd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19524 **[Test build #82871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82871/testReport)** for PR 19524 at commit [`26eeae1`](https://github.com/apache/spark/commit/26eeae140093896155d0b7196cd2de7dd4c6bda6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19524 cc @JoshRosen, @holdenk and @shaneknapp, could you take a look and see if makes sense please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19524: [SPARK-22302][INFRA] Remove manual backports for ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/19524 [SPARK-22302][INFRA] Remove manual backports for subprocess and print explicit message for < Python 2.7 ## What changes were proposed in this pull request? Seems there is a mistake from missing import for `subprocess.call` which should be used for Python 2.6 backports for some missing functions in `subprocess`. Reproduction is: ``` cd dev && python2.6 ``` ``` >>> from sparktestsupport import shellutils >>> shellutils.subprocess_check_call("ls") Traceback (most recent call last): File "", line 1, in File "sparktestsupport/shellutils.py", line 46, in subprocess_check_call retcode = call(*popenargs, **kwargs) NameError: global name 'call' is not defined ``` For Jenkins logs, please see https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3950/console Since we dropped the Python 2.6.x support, looks better we remove those workarounds and print out explicit error messages in order to duplicate the efforts to find out the root causes for such cases. ## How was this patch tested? Manually tested: ``` cd dev && python2.6 ``` ``` >>> from sparktestsupport import shellutils [error] Python versions prior to 2.7 are not supported. ``` ``` cd dev && python2.7 ``` ``` >>> from sparktestsupport import shellutils >>> shellutils.subprocess_check_output("ls") 'README.md\nappveyor-guide.md\nappveyor-install-dependencies.ps1\nchange-scala-version.sh\ncheck-license\ncheckstyle-suppressions.xml\ncheckstyle.xml\ncreate-release\ndeps\ngithub_jira_sync.py\ngithub_jira_sync.pyc\nlint-java\nlint-python\nlint-r\nlint-r-report.log\nlint-r.R\nlint-scala\nmake-distribution.sh\nmerge_spark_pr.py\nmerge_spark_pr.pyc\nmerge_spark_pr_jira.py\nmerge_spark_pr_jira.pyc\nmima\npep8-1.7.0.py\npep8-1.7.0.pyc\npip-sanity-check.py\npip-sanity-check.pyc\nrequirements.txt\nrun-pip-tests\nrun-tests\nrun-tests-jenkins\nrun-tests-jenkins.py\nrun-tests-jenkins.pyc\nrun-tests.py\nrun-tests.pyc\nscalastyle\nsparktestsupport\ntest-dependencies.sh\ntests\ntox.ini\n' >>> shellutils.subprocess_check_call("ls") README.md checkstyle-suppressions.xml github_jira_sync.pyc lint-r.R merge_spark_pr_jira.pypip-sanity-check.py run-tests-jenkins scalastyle appveyor-guide.md checkstyle.xml lint-java lint-scala merge_spark_pr_jira.pyc pip-sanity-check.pyc run-tests-jenkins.py sparktestsupport appveyor-install-dependencies.ps1 create-release lint-python make-distribution.sh mima requirements.txt run-tests-jenkins.pyc test-dependencies.sh change-scala-version.sh deps lint-r merge_spark_pr.py pep8-1.7.0.py run-pip-tests run-tests.py tests check-license github_jira_sync.py lint-r-report.log merge_spark_pr.pyc pep8-1.7.0.pycrun-tests run-tests.pyc tox.ini 0 ``` ``` cd dev && python3.6 ``` ``` >>> from sparktestsupport import shellutils >>> shellutils.subprocess_check_output("ls") b'README.md\nappveyor-guide.md\nappveyor-install-dependencies.ps1\nchange-scala-version.sh\ncheck-license\ncheckstyle-suppressions.xml\ncheckstyle.xml\ncreate-release\ndeps\ngithub_jira_sync.py\ngithub_jira_sync.pyc\nlint-java\nlint-python\nlint-r\nlint-r-report.log\nlint-r.R\nlint-scala\nmake-distribution.sh\nmerge_spark_pr.py\nmerge_spark_pr.pyc\nmerge_spark_pr_jira.py\nmerge_spark_pr_jira.pyc\nmima\npep8-1.7.0.py\npep8-1.7.0.pyc\npip-sanity-check.py\npip-sanity-check.pyc\nrequirements.txt\nrun-pip-tests\nrun-tests\nrun-tests-jenkins\nrun-tests-jenkins.py\nrun-tests-jenkins.pyc\nrun-tests.py\nrun-tests.pyc\nscalastyle\nsparktestsupport\ntest-dependencies.sh\ntests\ntox.ini\n' >>> shellutils.subprocess_check_call("ls") README.md checkstyle-suppressions.xml github_jira_sync.pyc lint-r.R merge_spark_pr_jira.pypip-sanity-check.py run-tests-jenkins scalastyle appveyor-guide.md checkstyle.xml lint-java lint-scala merge_spark_pr_jira.pyc pip-sanity-check.pyc
[GitHub] spark pull request #19272: [Spark-21842][Mesos] Support Kerberos ticket rene...
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/19272#discussion_r145301221 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala --- @@ -194,6 +198,27 @@ private[spark] class MesosCoarseGrainedSchedulerBackend( sc.conf.getOption("spark.mesos.driver.frameworkId").map(_ + suffix) ) +// check that the credentials are defined, even though it's likely that auth would have failed +// already if you've made it this far +if (principal != null && hadoopDelegationCreds.isDefined) { + logDebug(s"Principal found ($principal) starting token renewer") + val credentialRenewerThread = new Thread { +setName("MesosCredentialRenewer") +override def run(): Unit = { + val rt = MesosCredentialRenewer.getTokenRenewalTime(hadoopDelegationCreds.get, conf) + val credentialRenewer = +new MesosCredentialRenewer( + conf, + hadoopDelegationTokenManager.get, + MesosCredentialRenewer.getNextRenewalTime(rt), + driverEndpoint) + credentialRenewer.scheduleTokenRenewal() +} + } + + credentialRenewerThread.start() + credentialRenewerThread.join() --- End diff -- Yes, I believe so. If you look here (https://github.com/apache/spark/blob/6735433cde44b3430fb44edfff58ef8149c66c13/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L271) it's the same pattern. The thread needs to run as long as the application driver, correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19272: [Spark-21842][Mesos] Support Kerberos ticket rene...
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/19272#discussion_r145300847 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCredentialRenewer.scala --- @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler.cluster.mesos + +import java.security.PrivilegedExceptionAction +import java.util.concurrent.{Executors, TimeUnit} + +import scala.collection.JavaConverters._ +import scala.util.Try + +import org.apache.hadoop.security.UserGroupInformation + +import org.apache.spark.SparkConf +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.deploy.security.HadoopDelegationTokenManager +import org.apache.spark.internal.Logging +import org.apache.spark.rpc.RpcEndpointRef +import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages.UpdateDelegationTokens +import org.apache.spark.util.ThreadUtils + + +class MesosCredentialRenewer( +conf: SparkConf, +tokenManager: HadoopDelegationTokenManager, +nextRenewal: Long, +de: RpcEndpointRef) extends Logging { + private val credentialRenewerThread = +Executors.newSingleThreadScheduledExecutor( --- End diff -- I also changed `AMCredentialRenewer` to the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19272 **[Test build #82870 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82870/testReport)** for PR 19272 at commit [`e522150`](https://github.com/apache/spark/commit/e522150c32ffca9f116197f80530554d3cb68b6c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19519 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82863/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19519 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19519 **[Test build #82863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82863/testReport)** for PR 19519 at commit [`d4466f2`](https://github.com/apache/spark/commit/d4466f269451ef23d9c4d45fcd5b1c03988ac0e7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19272: [Spark-21842][Mesos] Support Kerberos ticket rene...
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/19272#discussion_r145298313 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCredentialRenewer.scala --- @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler.cluster.mesos + +import java.security.PrivilegedExceptionAction +import java.util.concurrent.{Executors, TimeUnit} + +import scala.collection.JavaConverters._ +import scala.util.Try + +import org.apache.hadoop.security.UserGroupInformation + +import org.apache.spark.SparkConf +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.deploy.security.HadoopDelegationTokenManager +import org.apache.spark.internal.Logging +import org.apache.spark.rpc.RpcEndpointRef +import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages.UpdateDelegationTokens +import org.apache.spark.util.ThreadUtils + + +class MesosCredentialRenewer( +conf: SparkConf, +tokenManager: HadoopDelegationTokenManager, +nextRenewal: Long, +de: RpcEndpointRef) extends Logging { --- End diff -- fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/19390#discussion_r145298003 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala --- @@ -380,7 +389,8 @@ private[spark] class MesosCoarseGrainedSchedulerBackend( } else { declineOffer( driver, - offer) + offer, --- End diff -- Yeah for now that's fine. I was thinking more specific messages but we can address that on a later ticket. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19495 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19495 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82864/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19495 **[Test build #82864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82864/testReport)** for PR 19495 at commit [`0d788fe`](https://github.com/apache/spark/commit/0d788fe1d9f5cf430ac548bbda3947b3d879505b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/19269#discussion_r145296142 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Command.scala --- @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import org.apache.spark.{SparkException, TaskContext} +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder} +import org.apache.spark.sql.catalyst.expressions.UnsafeRow +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.execution.command.RunnableCommand +import org.apache.spark.sql.sources.v2.writer._ +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.Utils + +case class WriteToDataSourceV2Command(writer: DataSourceV2Writer, query: LogicalPlan) + extends RunnableCommand { --- End diff -- I've also observed this issue where the explain output of commands behaves differently than from logical plans, and have a repro at https://issues.apache.org/jira/browse/SPARK-22204 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/19437#discussion_r145294156 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala --- @@ -173,6 +178,90 @@ private[mesos] object MesosSchedulerBackendUtil extends Logging { containerInfo } + private def getSecrets(conf: SparkConf, secretConfig: MesosSecretConfig): Seq[Secret] = { +def createValueSecret(data: String): Secret = { + Secret.newBuilder() +.setType(Secret.Type.VALUE) + .setValue(Secret.Value.newBuilder().setData(ByteString.copyFrom(data.getBytes))) +.build() +} + +def createReferenceSecret(name: String): Secret = { + Secret.newBuilder() +.setReference(Secret.Reference.newBuilder().setName(name)) +.setType(Secret.Type.REFERENCE) +.build() +} + +val referenceSecrets: Seq[Secret] = + conf.get(secretConfig.SECRET_NAMES).getOrElse(Nil).map(s => createReferenceSecret(s)) + +val valueSecrets: Seq[Secret] = { + conf.get(secretConfig.SECRET_VALUES).getOrElse(Nil).map(s => createValueSecret(s)) +} + +if (valueSecrets.nonEmpty && referenceSecrets.nonEmpty) { + throw new SparkException("Cannot specify VALUE type secrets and REFERENCE types ones") +} + +if (referenceSecrets.nonEmpty) referenceSecrets else valueSecrets + } + + private def illegalSecretInput(dest: Seq[String], secrets: Seq[Secret]): Boolean = { +if (dest.nonEmpty) { + // make sure there is a one-to-one correspondence between destinations and secrets + if (dest.length != secrets.length) { +return true + } +} +false + } + + def getSecretVolume(conf: SparkConf, secretConfig: MesosSecretConfig): List[Volume] = { +val secrets = getSecrets(conf, secretConfig) +val secretPaths: Seq[String] = + conf.get(secretConfig.SECRET_FILENAMES).getOrElse(Nil) + +if (illegalSecretInput(secretPaths, secrets)) { + throw new SparkException( +s"Need to give equal numbers of secrets and file paths for file-based " + + s"reference secrets got secrets $secrets, and paths $secretPaths") +} + +secrets.zip(secretPaths).map { + case (s, p) => +val source = Volume.Source.newBuilder() + .setType(Volume.Source.Type.SECRET) + .setSecret(s) +Volume.newBuilder() + .setContainerPath(p) + .setSource(source) + .setMode(Volume.Mode.RO) + .build +}.toList + } + + def getSecretEnvVar(conf: SparkConf, secretConfig: MesosSecretConfig): + List[Variable] = { --- End diff -- indentation? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/19437#discussion_r145294352 --- Diff: docs/running-on-mesos.md --- @@ -501,23 +503,74 @@ See the [configuration page](configuration.html) for information on Spark config spark.mesos.driver.secret.names or spark.mesos.driver.secret.values will be written to the provided file. Paths are relative to the container's work directory. Absolute paths must already exist. Consult the Mesos Secret -protobuf for more information. +protobuf for more information. Example: --- End diff -- I'm not sure what the policy should be on this, but IIUC file-based secrets does require a backend module. Should we mention that here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/19437#discussion_r145294074 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala --- @@ -17,10 +17,14 @@ package org.apache.spark.scheduler.cluster.mesos --- End diff -- Ah ok. One of these days let's make a "clean up" JIRA and harmonize all of this code. The naming is also all over the place.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/19437#discussion_r145294610 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala --- @@ -122,7 +126,8 @@ private[mesos] object MesosSchedulerBackendUtil extends Logging { .toList } - def containerInfo(conf: SparkConf): ContainerInfo.Builder = { + def buildContainerInfo(conf: SparkConf): + ContainerInfo.Builder = { --- End diff -- indentation? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18664 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18664 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82862/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18664 **[Test build #82862 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82862/testReport)** for PR 18664 at commit [`e428cbe`](https://github.com/apache/spark/commit/e428cbe2f8e626510e6660b38faff7f03b229e92). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19480 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19480 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82861/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19480 **[Test build #82861 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82861/testReport)** for PR 19480 at commit [`bce3616`](https://github.com/apache/spark/commit/bce3616cccfb5c1b1dc7fca32d17764fda142e7d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19488 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19488 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82860/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145294433 --- Diff: python/pyspark/sql/session.py --- @@ -414,6 +415,43 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _createFromPandasWithArrow(self, df, schema): +""" +Create a DataFrame from a given pandas.DataFrame by slicing the into partitions, converting +to Arrow data, then reading into the JVM to parallelsize. If a schema is passed in, the +data types will be used to coerce the data in Pandas to Arrow conversion. +""" +import os +from tempfile import NamedTemporaryFile +from pyspark.serializers import ArrowSerializer +from pyspark.sql.types import from_arrow_schema, to_arrow_schema +import pyarrow as pa + +# Slice the DataFrame into batches +step = -(-len(df) // self.sparkContext.defaultParallelism) # round int up +df_slices = (df[start:start + step] for start in xrange(0, len(df), step)) +arrow_schema = to_arrow_schema(schema) if schema is not None else None +batches = [pa.RecordBatch.from_pandas(df_slice, schema=arrow_schema, preserve_index=False) + for df_slice in df_slices] + +# write batches to temp file, read by JVM (borrowed from context.parallelize) +tempFile = NamedTemporaryFile(delete=False, dir=self._sc._temp_dir) +try: +serializer = ArrowSerializer() +serializer.dump_stream(batches, tempFile) +tempFile.close() +readRDDFromFile = self._jvm.PythonRDD.readRDDFromFile +jrdd = readRDDFromFile(self._jsc, tempFile.name, len(batches)) +finally: +# readRDDFromFile eagerily reads the file so we can delete right after. +os.unlink(tempFile.name) + +# Create the Spark DataFrame, there will be at least 1 batch +schema = from_arrow_schema(batches[0].schema) --- End diff -- H.. still get the same exception with pyarrow 0.4.1 .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19488 **[Test build #82860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82860/testReport)** for PR 19488 at commit [`9c33a0c`](https://github.com/apache/spark/commit/9c33a0cf82aff058f117fd309643006118ea3b08). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...
Github user ConeyLiu commented on a diff in the pull request: https://github.com/apache/spark/pull/19317#discussion_r145294297 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -180,6 +180,56 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) * as in scala.TraversableOnce. The former operation is used for merging values within a * partition, and the latter is used for merging values between partitions. To avoid memory * allocation, both of these functions are allowed to modify and return their first argument + * instead of creating a new U. This method is different from the ordinary "aggregateByKey" + * method, it directly returns a map to the driver, rather than a rdd. This will also perform + * the merging locally on each mapper before sending results to a reducer, similarly to a --- End diff -- thansk for the advance, I'll close it and try `mapPartitions(...).collect` in `NaiveBayes`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...
Github user ConeyLiu closed the pull request at: https://github.com/apache/spark/pull/19317 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19511: [SPARK-22293][SQL] Avoid unnecessary traversal in Resolv...
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19511 Hi @gatorsmile, if we can combine the two traverse, this should be simplify the code not complicate. However, this can't get big performance improvement. And I can close it if this change unnecessary. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19459 BTW, https://github.com/apache/spark/pull/19459#discussion_r145034007 looks missed :). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145293860 --- Diff: python/pyspark/sql/session.py --- @@ -414,6 +415,39 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _createFromPandasWithArrow(self, pdf, schema): +""" +Create a DataFrame from a given pandas.DataFrame by slicing it into partitions, converting +to Arrow data, then sending to the JVM to parallelize. If a schema is passed in, the +data types will be used to coerce the data in Pandas to Arrow conversion. +""" +from pyspark.serializers import ArrowSerializer +from pyspark.sql.types import from_arrow_schema, to_arrow_schema +import pyarrow as pa + +# Slice the DataFrame into batches +step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up +pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step)) +arrow_schema = to_arrow_schema(schema) if schema is not None else None +batches = [pa.RecordBatch.from_pandas(pdf_slice, schema=arrow_schema, preserve_index=False) + for pdf_slice in pdf_slices] --- End diff -- However, if we go 1. way, I think we should avoid creating whole batches first. I think falling back might make sense if its cost is cheap. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19480 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19480 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82858/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19480 **[Test build #82858 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82858/testReport)** for PR 19480 at commit [`95b0ad8`](https://github.com/apache/spark/commit/95b0ad82a68828055c45dbd729c753fae2007a5c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145293209 --- Diff: python/pyspark/sql/session.py --- @@ -414,6 +415,39 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _createFromPandasWithArrow(self, pdf, schema): +""" +Create a DataFrame from a given pandas.DataFrame by slicing it into partitions, converting +to Arrow data, then sending to the JVM to parallelize. If a schema is passed in, the +data types will be used to coerce the data in Pandas to Arrow conversion. +""" +from pyspark.serializers import ArrowSerializer +from pyspark.sql.types import from_arrow_schema, to_arrow_schema +import pyarrow as pa + +# Slice the DataFrame into batches +step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up +pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step)) +arrow_schema = to_arrow_schema(schema) if schema is not None else None +batches = [pa.RecordBatch.from_pandas(pdf_slice, schema=arrow_schema, preserve_index=False) + for pdf_slice in pdf_slices] + +# Verify schema, there will be at least 1 batch from pandas.DataFrame +schema_from_arrow = from_arrow_schema(batches[0].schema) +if schema is not None and schema != schema_from_arrow: +raise ValueError("Supplied schema does not match result from Arrow\nsupplied: " + + "%s\n!=\nfrom Arrow: %s" % (str(schema), str(schema_from_arrow))) --- End diff -- I'd prefer 1. if anything becomes too complicated to match the results for now. I guess 2. could be failed for any unexpected reason comparing `createDataFrame` without Arrow so I thought 1. is required as a guarantee and 2. is good to do. I wonder what @ueshin thinks about this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19437 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82868/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19515: [SPARK-22287][MESOS] SPARK_DAEMON_MEMORY not honored by ...
Github user ArtRand commented on the issue: https://github.com/apache/spark/pull/19515 Hello @pmackles, thanks for this. It would be set to the value of `SPARK_DRIVER_MEMORY` by default correct? What to you think about introducing a new envvar (`SPARK_DISPATCHER_MEMORY`) so that it can be individually configured? Only motivated by the fact that the Dispatcher is a pretty unique beast compared to the other Spark daemons. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19437 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19437 **[Test build #82868 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82868/testReport)** for PR 19437 at commit [`b2a3675`](https://github.com/apache/spark/commit/b2a36753b41820ec5e2d85a6a29fd8677bc0029a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19495 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19495 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82859/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145292822 --- Diff: python/pyspark/sql/session.py --- @@ -414,6 +415,39 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _createFromPandasWithArrow(self, pdf, schema): +""" +Create a DataFrame from a given pandas.DataFrame by slicing it into partitions, converting +to Arrow data, then sending to the JVM to parallelize. If a schema is passed in, the +data types will be used to coerce the data in Pandas to Arrow conversion. +""" +from pyspark.serializers import ArrowSerializer +from pyspark.sql.types import from_arrow_schema, to_arrow_schema +import pyarrow as pa + +# Slice the DataFrame into batches +step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up +pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step)) +arrow_schema = to_arrow_schema(schema) if schema is not None else None +batches = [pa.RecordBatch.from_pandas(pdf_slice, schema=arrow_schema, preserve_index=False) + for pdf_slice in pdf_slices] + +# Verify schema, there will be at least 1 batch from pandas.DataFrame +schema_from_arrow = from_arrow_schema(batches[0].schema) +if schema is not None and schema != schema_from_arrow: +raise ValueError("Supplied schema does not match result from Arrow\nsupplied: " + + "%s\n!=\nfrom Arrow: %s" % (str(schema), str(schema_from_arrow))) --- End diff -- Oh, actually we do have for warning in Python: ```python import warnings warnings.warn("...") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19488 LGTM except one comment, thanks for working on it! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19495 **[Test build #82859 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82859/testReport)** for PR 19495 at commit [`037e036`](https://github.com/apache/spark/commit/037e03658b2102cb1025019b2a22535a5ecb3f50). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19488: [SPARK-22266][SQL] The same aggregate function wa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19488#discussion_r145292665 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2677,4 +2678,29 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { checkAnswer(df, Row(1, 1, 1)) } } + + test("SRARK-22266: the same aggregate function was calculated multiple times") { +val query = "SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a" +val df = sql(query) +val physical = df.queryExecution.sparkPlan +val aggregates = physical.collect { + case agg : HashAggregateExec => agg +} +aggregates.foreach { agg => --- End diff -- nit: ``` assert(aggregates.length == 1) assert(aggregates.head.aggregateExpressions.size == 1) ``` If the implementation changed and we execute the query with other aggregate implementations, we should make sure this test is not ignored. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19488: [SPARK-22266][SQL] The same aggregate function wa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19488#discussion_r145292679 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2677,4 +2678,29 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { checkAnswer(df, Row(1, 1, 1)) } } + + test("SRARK-22266: the same aggregate function was calculated multiple times") { +val query = "SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a" +val df = sql(query) +val physical = df.queryExecution.sparkPlan +val aggregates = physical.collect { + case agg : HashAggregateExec => agg +} +aggregates.foreach { agg => + assert (agg.aggregateExpressions.size == 1) +} +checkAnswer(df, Row(1, 3, 4) :: Row(2, 3, 4) :: Row(3, 3, 4) :: Nil) + } + + test("Non-deterministic aggregate functions should not be deduplicated") { +val query = "SELECT a, first_value(b), first_value(b) + 1 FROM testData2 GROUP BY a" +val df = sql(query) +val physical = df.queryExecution.sparkPlan +val aggregates = physical.collect { + case agg : HashAggregateExec => agg +} +aggregates.foreach { agg => --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19437 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19437 **[Test build #82867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82867/testReport)** for PR 19437 at commit [`770d307`](https://github.com/apache/spark/commit/770d307bd67796b93f20d7dba105905059af7a0e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19437 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82867/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19374: [SPARK-22145][MESOS] fix supervise with checkpoin...
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/19374#discussion_r145292254 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala --- @@ -135,22 +135,24 @@ private[spark] class MesosClusterScheduler( private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", false) private val schedulerState = engineFactory.createEngine("scheduler") private val stateLock = new Object() + // Keyed by submission id private val finishedDrivers = new mutable.ArrayBuffer[MesosClusterSubmissionState](retainedDrivers) private var frameworkId: String = null - // Holds all the launched drivers and current launch state, keyed by driver id. + // Holds all the launched drivers and current launch state, keyed by submission id. private val launchedDrivers = new mutable.HashMap[String, MesosClusterSubmissionState]() // Holds a map of driver id to expected slave id that is passed to Mesos for reconciliation. // All drivers that are loaded after failover are added here, as we need get the latest - // state of the tasks from Mesos. + // state of the tasks from Mesos. Keyed by task Id. private val pendingRecover = new mutable.HashMap[String, SlaveID]() - // Stores all the submitted drivers that hasn't been launched. + // Stores all the submitted drivers that hasn't been launched, keyed by submission id private val queuedDrivers = new ArrayBuffer[MesosDriverDescription]() - // All supervised drivers that are waiting to retry after termination. + // All supervised drivers that are waiting to retry after termination, keyed by submission id private val pendingRetryDrivers = new ArrayBuffer[MesosDriverDescription]() private val queuedDriversState = engineFactory.createEngine("driverQueue") private val launchedDriversState = engineFactory.createEngine("launchedDrivers") private val pendingRetryDriversState = engineFactory.createEngine("retryList") + private final val RETRY_ID = "-retry-" --- End diff -- Sorry, super nit, maybe `RETRY_SEP`. Feel free to ignore. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19374: [SPARK-22145][MESOS] fix supervise with checkpointing on...
Github user ArtRand commented on the issue: https://github.com/apache/spark/pull/19374 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19495 **[Test build #82869 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82869/testReport)** for PR 19495 at commit [`ed9d3a2`](https://github.com/apache/spark/commit/ed9d3a2e4fd4b1a0ec0e0ceb34419ede4e2303b8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145291702 --- Diff: python/pyspark/sql/session.py --- @@ -414,6 +415,39 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _createFromPandasWithArrow(self, pdf, schema): +""" +Create a DataFrame from a given pandas.DataFrame by slicing it into partitions, converting +to Arrow data, then sending to the JVM to parallelize. If a schema is passed in, the +data types will be used to coerce the data in Pandas to Arrow conversion. +""" +from pyspark.serializers import ArrowSerializer +from pyspark.sql.types import from_arrow_schema, to_arrow_schema +import pyarrow as pa + +# Slice the DataFrame into batches +step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up +pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step)) +arrow_schema = to_arrow_schema(schema) if schema is not None else None +batches = [pa.RecordBatch.from_pandas(pdf_slice, schema=arrow_schema, preserve_index=False) + for pdf_slice in pdf_slices] + +# Verify schema, there will be at least 1 batch from pandas.DataFrame +schema_from_arrow = from_arrow_schema(batches[0].schema) +if schema is not None and schema != schema_from_arrow: +raise ValueError("Supplied schema does not match result from Arrow\nsupplied: " + + "%s\n!=\nfrom Arrow: %s" % (str(schema), str(schema_from_arrow))) --- End diff -- @ueshin and @HyukjinKwon after thinking about what to do when the schema is not equal, I have some concerns: 1. Fallback to `createDataFrame` without Arrow - I implemented this and works fine, but there is no logging in python (afaik) so my concern is that it does this silently and causes bad performance and the user will not know why. 2. Cast types using `astype` similar to `ArrowPandasSerializer.dump_stream` - The issue I see with that is if there are null values and ints have been promoted to floats, this works fine in `dump_stream` because we are working with pd.Series and pyarrow allows us to pass a validity mask, which ignores the filled values. There aren't options to pass in masks for pd.DataFrames, so I believe it will try to interpret whatever fill values are there and cause an error. I can look into this more though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19495: [SPARK-22278][SS] Expose current event time watermark an...
Github user brkyvz commented on the issue: https://github.com/apache/spark/pull/19495 LGTM! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19523: [SPARK-22301][SQL] Add rule to Optimizer for In with emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19523 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19437 **[Test build #82868 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82868/testReport)** for PR 19437 at commit [`b2a3675`](https://github.com/apache/spark/commit/b2a36753b41820ec5e2d85a6a29fd8677bc0029a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user susanxhuynh commented on a diff in the pull request: https://github.com/apache/spark/pull/19437#discussion_r145287767 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala --- @@ -170,9 +175,119 @@ private[mesos] object MesosSchedulerBackendUtil extends Logging { containerInfo.addNetworkInfos(info) } +getSecretVolume(conf, secretConfig).foreach { volume => + if (volume.getSource.getSecret.getReference.isInitialized) { +logInfo(s"Setting reference secret ${volume.getSource.getSecret.getReference.getName}" + + s"on file ${volume.getContainerPath}") + } else { +logInfo(s"Setting secret on file name=${volume.getContainerPath}") + } + containerInfo.addVolumes(volume) +} + containerInfo } + def addSecretEnvVar( --- End diff -- I've removed this method. To be more consistent, I've moved this code back into MesosClusterScheduler. There's a little duplication, because MesosCoarseGrainedSchedulerBackend now has a similar code snippet, but it does avoid the mutation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user susanxhuynh commented on a diff in the pull request: https://github.com/apache/spark/pull/19437#discussion_r145290625 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala --- @@ -170,9 +175,119 @@ private[mesos] object MesosSchedulerBackendUtil extends Logging { containerInfo.addNetworkInfos(info) } +getSecretVolume(conf, secretConfig).foreach { volume => + if (volume.getSource.getSecret.getReference.isInitialized) { +logInfo(s"Setting reference secret ${volume.getSource.getSecret.getReference.getName}" + + s"on file ${volume.getContainerPath}") + } else { +logInfo(s"Setting secret on file name=${volume.getContainerPath}") + } + containerInfo.addVolumes(volume) +} + containerInfo } + def addSecretEnvVar( + envBuilder: Environment.Builder, + conf: SparkConf, + secretConfig: MesosSecretConfig): Unit = { +getSecretEnvVar(conf, secretConfig).foreach { variable => + if (variable.getSecret.getReference.isInitialized) { +logInfo(s"Setting reference secret ${variable.getSecret.getReference.getName}" + --- End diff -- Fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user susanxhuynh commented on a diff in the pull request: https://github.com/apache/spark/pull/19437#discussion_r145290603 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala --- @@ -170,9 +175,119 @@ private[mesos] object MesosSchedulerBackendUtil extends Logging { containerInfo.addNetworkInfos(info) } +getSecretVolume(conf, secretConfig).foreach { volume => + if (volume.getSource.getSecret.getReference.isInitialized) { +logInfo(s"Setting reference secret ${volume.getSource.getSecret.getReference.getName}" + + s"on file ${volume.getContainerPath}") + } else { +logInfo(s"Setting secret on file name=${volume.getContainerPath}") + } + containerInfo.addVolumes(volume) +} + containerInfo } + def addSecretEnvVar( + envBuilder: Environment.Builder, + conf: SparkConf, + secretConfig: MesosSecretConfig): Unit = { +getSecretEnvVar(conf, secretConfig).foreach { variable => + if (variable.getSecret.getReference.isInitialized) { +logInfo(s"Setting reference secret ${variable.getSecret.getReference.getName}" + + s"on file ${variable.getName}") + } else { +logInfo(s"Setting secret on environment variable name=${variable.getName}") + } + envBuilder.addVariables(variable) +} + } + + private def getSecrets(conf: SparkConf, secretConfig: MesosSecretConfig): + Seq[Secret] = { --- End diff -- Fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org