[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21758 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21758 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93446/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21823: [SPARK-24870][SQL]Cache can't work normally if th...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21823#discussion_r204482993 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SameResultSuite.scala --- @@ -58,4 +61,16 @@ class SameResultSuite extends QueryTest with SharedSQLContext { val df4 = spark.range(10).agg(sumDistinct($"id")) assert(df3.queryExecution.executedPlan.sameResult(df4.queryExecution.executedPlan)) } + + test("Canonicalized result is not case-insensitive") { --- End diff -- `Canonicalized result is not case-insensitive` -> `Canonicalized result is case-insensitive` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21650: [SPARK-24624][SQL][PYTHON] Support mixture of Pyt...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/21650#discussion_r204482591 --- Diff: python/pyspark/sql/tests.py --- @@ -5060,6 +5049,144 @@ def test_type_annotation(self): df = self.spark.range(1).select(pandas_udf(f=_locals['noop'], returnType='bigint')('id')) self.assertEqual(df.first()[0], 0) +def test_mixed_udf(self): +import pandas as pd +from pyspark.sql.functions import udf, pandas_udf + +df = self.spark.range(0, 1).toDF('v') + +# Test mixture of multiple UDFs and Pandas UDFs + +@udf('int') +def f1(x): +assert type(x) == int +return x + 1 + +@pandas_udf('int') +def f2(x): +assert type(x) == pd.Series +return x + 10 + +@udf('int') +def f3(x): +assert type(x) == int +return x + 100 + +@pandas_udf('int') +def f4(x): +assert type(x) == pd.Series +return x + 1000 + +# Test mixed udfs in a single projection +df1 = df.withColumn('f1', f1(df['v'])) +df1 = df1.withColumn('f2', f2(df1['v'])) +df1 = df1.withColumn('f3', f3(df1['v'])) +df1 = df1.withColumn('f4', f4(df1['v'])) +df1 = df1.withColumn('f2_f1', f2(df1['f1'])) +df1 = df1.withColumn('f3_f1', f3(df1['f1'])) +df1 = df1.withColumn('f4_f1', f4(df1['f1'])) +df1 = df1.withColumn('f3_f2', f3(df1['f2'])) +df1 = df1.withColumn('f4_f2', f4(df1['f2'])) +df1 = df1.withColumn('f4_f3', f4(df1['f3'])) +df1 = df1.withColumn('f3_f2_f1', f3(df1['f2_f1'])) +df1 = df1.withColumn('f4_f2_f1', f4(df1['f2_f1'])) +df1 = df1.withColumn('f4_f3_f1', f4(df1['f3_f1'])) +df1 = df1.withColumn('f4_f3_f2', f4(df1['f3_f2'])) +df1 = df1.withColumn('f4_f3_f2_f1', f4(df1['f3_f2_f1'])) + +# Test mixed udfs in a single expression +df2 = df.withColumn('f1', f1(df['v'])) +df2 = df2.withColumn('f2', f2(df['v'])) +df2 = df2.withColumn('f3', f3(df['v'])) +df2 = df2.withColumn('f4', f4(df['v'])) +df2 = df2.withColumn('f2_f1', f2(f1(df['v']))) +df2 = df2.withColumn('f3_f1', f3(f1(df['v']))) +df2 = df2.withColumn('f4_f1', f4(f1(df['v']))) +df2 = df2.withColumn('f3_f2', f3(f2(df['v']))) +df2 = df2.withColumn('f4_f2', f4(f2(df['v']))) +df2 = df2.withColumn('f4_f3', f4(f3(df['v']))) +df2 = df2.withColumn('f3_f2_f1', f3(f2(f1(df['v'] +df2 = df2.withColumn('f4_f2_f1', f4(f2(f1(df['v'] +df2 = df2.withColumn('f4_f3_f1', f4(f3(f1(df['v'] +df2 = df2.withColumn('f4_f3_f2', f4(f3(f2(df['v'] +df2 = df2.withColumn('f4_f3_f2_f1', f4(f3(f2(f1(df['v']) + +# expected result +df3 = df.withColumn('f1', df['v'] + 1) +df3 = df3.withColumn('f2', df['v'] + 10) +df3 = df3.withColumn('f3', df['v'] + 100) +df3 = df3.withColumn('f4', df['v'] + 1000) +df3 = df3.withColumn('f2_f1', df['v'] + 11) +df3 = df3.withColumn('f3_f1', df['v'] + 101) +df3 = df3.withColumn('f4_f1', df['v'] + 1001) +df3 = df3.withColumn('f3_f2', df['v'] + 110) +df3 = df3.withColumn('f4_f2', df['v'] + 1010) +df3 = df3.withColumn('f4_f3', df['v'] + 1100) +df3 = df3.withColumn('f3_f2_f1', df['v'] + 111) +df3 = df3.withColumn('f4_f2_f1', df['v'] + 1011) +df3 = df3.withColumn('f4_f3_f1', df['v'] + 1101) +df3 = df3.withColumn('f4_f3_f2', df['v'] + 1110) +df3 = df3.withColumn('f4_f3_f2_f1', df['v'] + ) + +self.assertEquals(df3.collect(), df1.collect()) +self.assertEquals(df3.collect(), df2.collect()) + +def test_mixed_udf_and_sql(self): +import pandas as pd +from pyspark.sql.functions import udf, pandas_udf + +df = self.spark.range(0, 1).toDF('v') + +# Test mixture of UDFs, Pandas UDFs and SQL expression. + +@udf('int') +def f1(x): +assert type(x) == int +return x + 1 + +def f2(x): +return x + 10 + +@pandas_udf('int') +def f3(x): +assert type(x) == pd.Series +return x + 100 + +df1 = df.withColumn('f1', f1(df['v'])) +df1 = df1.withColumn('f2', f2(df['v'])) +df1 = df1.withColumn('f3', f3(df['v'])) +df1 = df1.withColumn('f1_f2', f1(f2(df['v']))) +df1 = df1.wi
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21758 **[Test build #93446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93446/testReport)** for PR 21758 at commit [`c16a47f`](https://github.com/apache/spark/commit/c16a47f0d15998133b9d61d8df5310f1f66b11b0). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21758 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93445/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21758 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21650: [SPARK-24624][SQL][PYTHON] Support mixture of Python UDF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21650 **[Test build #93451 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93451/testReport)** for PR 21650 at commit [`4c9c007`](https://github.com/apache/spark/commit/4c9c007858aef65c2c190b35673404dd61279369). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21758 **[Test build #93445 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93445/testReport)** for PR 21758 at commit [`9ae56d1`](https://github.com/apache/spark/commit/9ae56d12b580f0a3cecb90dfe275d067e8f3a7f3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 @cloud-fan, any update on merging this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21845: [SPARK-24886][INFRA] Fix the testing script to increase ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21845 of course i am as usual. I actually already have been being taking care of it. Thing is the tests are just being added even if they are duplicated of something. I feel like it's a bit excessive so far. In genetal, I don't think there are particular tests especially taking a lot of time IMHO. What we should do is that we put some efforts to deduplicate the tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21805: [SPARK-24850][SQL] fix str representation of Cach...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21805 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r204479024 --- Diff: python/pyspark/sql/dataframe.py --- @@ -2095,9 +2095,11 @@ def toPandas(self): _check_dataframe_localize_timestamps import pyarrow -tables = self._collectAsArrow() -if tables: -table = pyarrow.concat_tables(tables) +# Collect un-ordered list of batches, and list of correct order indices +batches, batch_order = self._collectAsArrow() +if batches: --- End diff -- Sure, I was playing around with this being an iterator, but I will change it since it is a list now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21805: [SPARK-24850][SQL] fix str representation of CachedRDDBu...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21805 LGTM Thanks! Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21840: [WIP] New copy() method for Column of StructType
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/21840#discussion_r204476440 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -3858,3 +3858,29 @@ object ArrayUnion { new GenericArrayData(arrayBuffer) } } + +case class StructCopy( +struct: Expression, +fieldName: String, +fieldValue: Expression) extends Expression with CodegenFallback { + + override def children: Seq[Expression] = Seq(struct, fieldValue) + override def nullable: Boolean = struct.nullable + + lazy val fieldIndex = struct.dataType.asInstanceOf[StructType].fieldIndex(fieldName) + + override def dataType: DataType = { +val structType = struct.dataType.asInstanceOf[StructType] +val field = structType.fields(fieldIndex).copy(dataType = fieldValue.dataType) + +structType.copy(fields = structType.fields.updated(fieldIndex, field)) + } + + override def eval(input: InternalRow): Any = { +val newFieldValue = fieldValue.eval(input) +val structValue = struct.eval(input).asInstanceOf[GenericInternalRow] --- End diff -- You cannot assume this and you also cannot update the row in-place. You will need to copy the row I am affraid. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21826: [SPARK-24872] Remove the symbol â||â of the âORâ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21826 No we can't because you can still use string concat in filters, e.g. colA || colB == "ab" What is "||" here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...
Github user maryannxue commented on the issue: https://github.com/apache/spark/pull/21821 I just ran a test with once-strategy check and found out that a few batches/rules do not stop, e.g. AggregatePushDown, "Convert to Spark client exec", PartitionPruning. I believe most of them are edge rules and none of them are analyzer rules. Still, let's keep this fix until just to be on the safe side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21845: [SPARK-24886][INFRA] Fix the testing script to increase ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21845 This helps, but it is not sustainable to keep increasing the threshold. What we need to do is to look at test time distribution and figure out what test suites are unnecessarily long and actually cut down the time there. @HyukjinKwon Would you be interested in doing that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21805: [SPARK-24850][SQL] fix str representation of CachedRDDBu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21805 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93444/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21805: [SPARK-24850][SQL] fix str representation of CachedRDDBu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21805 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21805: [SPARK-24850][SQL] fix str representation of CachedRDDBu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21805 **[Test build #93444 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93444/testReport)** for PR 21805 at commit [`2a21c80`](https://github.com/apache/spark/commit/2a21c80f0b751277e8ac975279474378c765af6c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21847: [SPARK-24855][SQL][EXTERNAL][WIP]: Built-in AVRO support...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21847 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21847: [SPARK-24855][SQL][EXTERNAL][WIP]: Built-in AVRO support...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21847 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21847: [SPARK-24855][SQL][EXTERNAL][WIP]: Built-in AVRO support...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21847 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21847: [SPARK-24855][SQL][EXTERNAL][WIP]: Built-in AVRO ...
GitHub user lindblombr opened a pull request: https://github.com/apache/spark/pull/21847 [SPARK-24855][SQL][EXTERNAL][WIP]: Built-in AVRO support should support specified schema on write ## What changes were proposed in this pull request? Allows `avroSchema` option to be specified on write, allowing a user to specify a schema in cases where this is required. A trivial use case is reading in an avro dataset, making some small adjustment to a column or columns and writing out using the same schema. Implicit schema creation from SQL Struct results in a schema that while for the most part, is functionally similar, is not necessarily compatible. Allows `fixed` Field type to be utilized for records of specified `avroSchema` ## How was this patch tested? Unit tests in AvroSuite are extended to write out included sample datasets using new `avroSchema` option and verify readability of records. Ideally, tests should be added for each `Schema.Type` to ensure records are written with full fidelity. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lindblombr/spark specify_schema_on_write Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21847.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21847 commit 30fc1ae98a2c29824b83e956c2cbe87e9bbb3749 Author: Brian Lindblom Date: 2018-07-21T20:43:26Z SPARK-24855: Built-in AVRO support should support specified schema on write --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21650: [SPARK-24624][SQL][PYTHON] Support mixture of Python UDF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21650 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1238/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21650: [SPARK-24624][SQL][PYTHON] Support mixture of Python UDF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21650 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21650: [SPARK-24624][SQL][PYTHON] Support mixture of Python UDF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21650 **[Test build #93450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93450/testReport)** for PR 21650 at commit [`78f2ebf`](https://github.com/apache/spark/commit/78f2ebf3b11fe8849fe0d41300f74319ca174d42). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17174: [SPARK-19145][SQL] Timestamp to String casting is slowin...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/17174 @tanejagagan Can you update? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19434: [SPARK-21785][SQL]Support create table from a parquet fi...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19434 @CrazyJacky Can you close this for now cuz it's not active for a long time? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21839: [SPARK-24339][SQL] Prunes the unused columns from child ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21839 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21805: [SPARK-24850][SQL] fix str representation of CachedRDDBu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21805 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93443/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21805: [SPARK-24850][SQL] fix str representation of CachedRDDBu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21805 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19773: [SPARK-22546][SQL] Supporting for changing column dataTy...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/19773 I'll resolve the conflicts today, thanks for ping me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21805: [SPARK-24850][SQL] fix str representation of CachedRDDBu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21805 **[Test build #93443 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93443/testReport)** for PR 21805 at commit [`de3f63e`](https://github.com/apache/spark/commit/de3f63e61fb45b61bc6544b6b5487e7f276f7ac7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19745: [SPARK-2926][Core][Follow Up] Sort shuffle reader...
Github user xuanyuanking closed the pull request at: https://github.com/apache/spark/pull/19745 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19745: [SPARK-2926][Core][Follow Up] Sort shuffle reader for Sp...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/19745 No problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20854: [SPARK-23712][SQL] Interpreted UnsafeRowJoiner [WIP]
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20854 @hvanhovell What's the status of this pr? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21839: [SPARK-24339][SQL] Prunes the unused columns from child ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21839 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21839: [SPARK-24339][SQL] Prunes the unused columns from child ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21839 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1237/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21846: [SPARK-24887][SQL]Avro: use SerializableConfigura...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21846 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21839: [SPARK-24339][SQL] Prunes the unused columns from child ...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/21839 @gatorsmile Thanks for your advice, added ut in ScriptTransformationSuite. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20699: [SPARK-23544][SQL]Remove redundancy ShuffleExchange in t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20699 **[Test build #93449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93449/testReport)** for PR 20699 at commit [`be96e39`](https://github.com/apache/spark/commit/be96e390c87ecf1550a4297a92a68497caaedca4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21650: [SPARK-24624][SQL][PYTHON] Support mixture of Pyt...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/21650#discussion_r204447950 --- Diff: python/pyspark/sql/tests.py --- @@ -5060,6 +5049,144 @@ def test_type_annotation(self): df = self.spark.range(1).select(pandas_udf(f=_locals['noop'], returnType='bigint')('id')) self.assertEqual(df.first()[0], 0) +def test_mixed_udf(self): +import pandas as pd +from pyspark.sql.functions import udf, pandas_udf + +df = self.spark.range(0, 1).toDF('v') + +# Test mixture of multiple UDFs and Pandas UDFs + +@udf('int') +def f1(x): +assert type(x) == int +return x + 1 + +@pandas_udf('int') +def f2(x): +assert type(x) == pd.Series +return x + 10 + +@udf('int') +def f3(x): +assert type(x) == int +return x + 100 + +@pandas_udf('int') +def f4(x): +assert type(x) == pd.Series +return x + 1000 + +# Test mixed udfs in a single projection +df1 = df.withColumn('f1', f1(df['v'])) +df1 = df1.withColumn('f2', f2(df1['v'])) +df1 = df1.withColumn('f3', f3(df1['v'])) +df1 = df1.withColumn('f4', f4(df1['v'])) +df1 = df1.withColumn('f2_f1', f2(df1['f1'])) +df1 = df1.withColumn('f3_f1', f3(df1['f1'])) +df1 = df1.withColumn('f4_f1', f4(df1['f1'])) +df1 = df1.withColumn('f3_f2', f3(df1['f2'])) +df1 = df1.withColumn('f4_f2', f4(df1['f2'])) +df1 = df1.withColumn('f4_f3', f4(df1['f3'])) +df1 = df1.withColumn('f3_f2_f1', f3(df1['f2_f1'])) +df1 = df1.withColumn('f4_f2_f1', f4(df1['f2_f1'])) +df1 = df1.withColumn('f4_f3_f1', f4(df1['f3_f1'])) +df1 = df1.withColumn('f4_f3_f2', f4(df1['f3_f2'])) +df1 = df1.withColumn('f4_f3_f2_f1', f4(df1['f3_f2_f1'])) + +# Test mixed udfs in a single expression +df2 = df.withColumn('f1', f1(df['v'])) +df2 = df2.withColumn('f2', f2(df['v'])) +df2 = df2.withColumn('f3', f3(df['v'])) +df2 = df2.withColumn('f4', f4(df['v'])) +df2 = df2.withColumn('f2_f1', f2(f1(df['v']))) +df2 = df2.withColumn('f3_f1', f3(f1(df['v']))) +df2 = df2.withColumn('f4_f1', f4(f1(df['v']))) +df2 = df2.withColumn('f3_f2', f3(f2(df['v']))) +df2 = df2.withColumn('f4_f2', f4(f2(df['v']))) +df2 = df2.withColumn('f4_f3', f4(f3(df['v']))) +df2 = df2.withColumn('f3_f2_f1', f3(f2(f1(df['v'] +df2 = df2.withColumn('f4_f2_f1', f4(f2(f1(df['v'] +df2 = df2.withColumn('f4_f3_f1', f4(f3(f1(df['v'] +df2 = df2.withColumn('f4_f3_f2', f4(f3(f2(df['v'] +df2 = df2.withColumn('f4_f3_f2_f1', f4(f3(f2(f1(df['v']) + +# expected result +df3 = df.withColumn('f1', df['v'] + 1) +df3 = df3.withColumn('f2', df['v'] + 10) +df3 = df3.withColumn('f3', df['v'] + 100) +df3 = df3.withColumn('f4', df['v'] + 1000) +df3 = df3.withColumn('f2_f1', df['v'] + 11) +df3 = df3.withColumn('f3_f1', df['v'] + 101) +df3 = df3.withColumn('f4_f1', df['v'] + 1001) +df3 = df3.withColumn('f3_f2', df['v'] + 110) +df3 = df3.withColumn('f4_f2', df['v'] + 1010) +df3 = df3.withColumn('f4_f3', df['v'] + 1100) +df3 = df3.withColumn('f3_f2_f1', df['v'] + 111) +df3 = df3.withColumn('f4_f2_f1', df['v'] + 1011) +df3 = df3.withColumn('f4_f3_f1', df['v'] + 1101) +df3 = df3.withColumn('f4_f3_f2', df['v'] + 1110) +df3 = df3.withColumn('f4_f3_f2_f1', df['v'] + ) + +self.assertEquals(df3.collect(), df1.collect()) +self.assertEquals(df3.collect(), df2.collect()) + +def test_mixed_udf_and_sql(self): +import pandas as pd +from pyspark.sql.functions import udf, pandas_udf + +df = self.spark.range(0, 1).toDF('v') + +# Test mixture of UDFs, Pandas UDFs and SQL expression. + +@udf('int') +def f1(x): +assert type(x) == int +return x + 1 + +def f2(x): +return x + 10 + +@pandas_udf('int') +def f3(x): +assert type(x) == pd.Series +return x + 100 + +df1 = df.withColumn('f1', f1(df['v'])) +df1 = df1.withColumn('f2', f2(df['v'])) +df1 = df1.withColumn('f3', f3(df['v'])) +df1 = df1.withColumn('f1_f2', f1(f2(df['v']))) +df1 = df1.wi
[GitHub] spark pull request #21839: [SPARK-24339][SQL] Prunes the unused columns from...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21839#discussion_r204447671 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -450,13 +450,16 @@ object ColumnPruning extends Rule[LogicalPlan] { case d @ DeserializeToObject(_, _, child) if (child.outputSet -- d.references).nonEmpty => d.copy(child = prunedChild(child, d.references)) -// Prunes the unused columns from child of Aggregate/Expand/Generate +// Prunes the unused columns from child of Aggregate/Expand/Generate/ScriptTransformation case a @ Aggregate(_, _, child) if (child.outputSet -- a.references).nonEmpty => a.copy(child = prunedChild(child, a.references)) case f @ FlatMapGroupsInPandas(_, _, _, child) if (child.outputSet -- f.references).nonEmpty => f.copy(child = prunedChild(child, f.references)) case e @ Expand(_, _, child) if (child.outputSet -- e.references).nonEmpty => e.copy(child = prunedChild(child, e.references)) +case s @ ScriptTransformation(_, _, _, child, _) + if (child.outputSet -- s.references).nonEmpty => --- End diff -- Thanks, fix in 2cf131f. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21839: [SPARK-24339][SQL] Prunes the unused columns from child ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21839 **[Test build #93448 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93448/testReport)** for PR 21839 at commit [`2cf131f`](https://github.com/apache/spark/commit/2cf131fe1b6af368a10964ddb6067b1b52898420). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19773: [SPARK-22546][SQL] Supporting for changing column dataTy...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19773 @xuanyuanking Any update? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20699: [SPARK-23544][SQL]Remove redundancy ShuffleExchange in t...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20699 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21764: [SPARK-24802][SQL] Add a new config for Optimizat...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21764 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19745: [SPARK-2926][Core][Follow Up] Sort shuffle reader for Sp...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19745 @xuanyuanking Can you close this for now because it's not active for a long time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21764: [SPARK-24802][SQL] Add a new config for Optimization Rul...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21764 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15970: [SPARK-18134][SQL] Comparable MapTypes [POC]
Github user maropu commented on the issue: https://github.com/apache/spark/pull/15970 @hvanhovell We still need to keep this pr open? Either way, we need rework based on this pr. If so, can you close this for now? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21821: [SPARK-24867] [SQL] Add AnalysisBarrier to DataFrameWrit...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21821 @hvanhovell The question is whether `HandleNullInputsForUDF ` is the only rule that caused the issue. If not, we still need to add an AnalysisBarrier. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18697: [SPARK-16683][SQL] Repeated joins to same table can leak...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/18697 @aray Can you close this for now because it's not active for a long time? (I'm not sure the current master still has this issue..., so you should check it first) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15334: [SPARK-10367][SQL] Support Parquet logical type INTERVAL
Github user maropu commented on the issue: https://github.com/apache/spark/pull/15334 oh, I noticed the jira ticket has already been closed as later, so can you close this? @dilipbiswal --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15334: [SPARK-10367][SQL] Support Parquet logical type INTERVAL
Github user maropu commented on the issue: https://github.com/apache/spark/pull/15334 IIUC we have no plan to expose Interval types now, so can we close this for now? cc: @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18954: [SPARK-17654] [SQL] Enable populating hive bucketed tabl...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/18954 @tejasapatil Can you close this for now because it's not active for a long time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15071: [SPARK-17517][SQL]Improve generated Code for BroadcastHa...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/15071 @yaooqinn Can you close this because it's not long time for a long time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21811: [SPARK-24801][CORE] Avoid memory waste by empty byte[] a...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21811 > Does it make sense to release byteChannel at deallocate()? you could, just to let GC kick in a *bit* earlier, but I don't think its going to make a big difference. (Netty's ByteBufs must be since they may be owned by netty's internal pools and never gc'ed) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21805: [SPARK-24850][SQL] fix str representation of CachedRDDBu...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21805 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21102 I want to hear opinion of others about the order of a result. cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21811: [SPARK-24801][CORE] Avoid memory waste by empty byte[] a...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21811 > I like it, but this will still create the byte channel right? is there a way to reuse it? we could create a pool, though management becomes a bit more complex. would you ever shrink the pool, or would it always stay the same size? I guess it shouldn't grow more than 1 byte array per io thread. I'd rather get this simple fix in first before doing anything more complicated ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21828: Update regression.py
Github user woodthom2 commented on the issue: https://github.com/apache/spark/pull/21828 OK thank you. I will close --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21828: Update regression.py
Github user woodthom2 closed the pull request at: https://github.com/apache/spark/pull/21828 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21650: [SPARK-24624][SQL][PYTHON] Support mixture of Pyt...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/21650#discussion_r204429892 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala --- @@ -94,36 +95,59 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] { */ object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper { - private def hasPythonUDF(e: Expression): Boolean = { + private def hasScalarPythonUDF(e: Expression): Boolean = { e.find(PythonUDF.isScalarPythonUDF).isDefined } - private def canEvaluateInPython(e: PythonUDF): Boolean = { -e.children match { - // single PythonUDF child could be chained and evaluated in Python - case Seq(u: PythonUDF) => canEvaluateInPython(u) - // Python UDF can't be evaluated directly in JVM - case children => !children.exists(hasPythonUDF) + private def canEvaluateInPython(e: PythonUDF, evalType: Int): Boolean = { +if (e.evalType != evalType) { + false +} else { + e.children match { +// single PythonUDF child could be chained and evaluated in Python +case Seq(u: PythonUDF) => canEvaluateInPython(u, evalType) +// Python UDF can't be evaluated directly in JVM +case children => !children.exists(hasScalarPythonUDF) + } } } - private def collectEvaluatableUDF(expr: Expression): Seq[PythonUDF] = expr match { -case udf: PythonUDF if PythonUDF.isScalarPythonUDF(udf) && canEvaluateInPython(udf) => Seq(udf) -case e => e.children.flatMap(collectEvaluatableUDF) + private def collectEvaluableUDF(expr: Expression, evalType: Int): Seq[PythonUDF] = expr match { +case udf: PythonUDF if PythonUDF.isScalarPythonUDF(udf) && canEvaluateInPython(udf, evalType) => + Seq(udf) +case e => e.children.flatMap(collectEvaluableUDF(_, evalType)) + } + + /** + * Collect evaluable UDFs from the current node. + * + * This function collects Python UDFs or Scalar Python UDFs from expressions of the input node, + * and returns a list of UDFs of the same eval type. + * + * If expressions contain both UDFs eval types, this function will only return Python UDFs. + * + * The caller should call this function multiple times until all evaluable UDFs are collected. + */ + private def collectEvaluableUDFs(plan: SparkPlan): Seq[PythonUDF] = { +val pythonUDFs = + plan.expressions.flatMap(collectEvaluableUDF(_, PythonEvalType.SQL_BATCHED_UDF)) + +if (pythonUDFs.isEmpty) { + plan.expressions.flatMap(collectEvaluableUDF(_, PythonEvalType.SQL_SCALAR_PANDAS_UDF)) +} else { + pythonUDFs --- End diff -- What you said makes sense and that's actually my first attempt but end up being pretty complicated. The issue is that it is hard to do a one traversal of the expression tree to find the UDFs because we need to pass the evalType to all subtree and the result of one subtree can affect the result of another (i.e, if we find one type of UDF in one subtree, we need to pass the type to all other subtree because they must agree on evalType), this makes the code more complicated... Another way is to do two traversals where in the first traversal, we look for eval type and in the second traversal, we look for UDFs of the eval type, but this isn't much different from what I have now in terms of efficiency and I find the current logic is simpler and less likely to have bugs. I actually tried these approaches and found the current way to be the easiest to implement and least likely to have bugs. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21635: [SPARK-24594][YARN] Introducing metrics for YARN
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21635 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93439/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21635: [SPARK-24594][YARN] Introducing metrics for YARN
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21635 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21635: [SPARK-24594][YARN] Introducing metrics for YARN
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21635 **[Test build #93439 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93439/testReport)** for PR 21635 at commit [`0b86788`](https://github.com/apache/spark/commit/0b86788e7ec7b367c779cb5517f9dd294f99dd4b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21811: [SPARK-24801][CORE] Avoid memory waste by empty byte[] a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21811 **[Test build #4221 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4221/testReport)** for PR 21811 at commit [`8b46534`](https://github.com/apache/spark/commit/8b465341314b1cbbef726950d589d5fb49db341b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21653 **[Test build #93447 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93447/testReport)** for PR 21653 at commit [`b6585da`](https://github.com/apache/spark/commit/b6585da0f137d3d3675925368c4668c884de900c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21653: [SPARK-13343] speculative tasks that didn't commit shoul...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21653 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21758 **[Test build #93446 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93446/testReport)** for PR 21758 at commit [`c16a47f`](https://github.com/apache/spark/commit/c16a47f0d15998133b9d61d8df5310f1f66b11b0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21758 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21758 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1236/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21758 **[Test build #93445 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93445/testReport)** for PR 21758 at commit [`9ae56d1`](https://github.com/apache/spark/commit/9ae56d12b580f0a3cecb90dfe275d067e8f3a7f3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21758 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21758 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1235/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21758: [SPARK-24795][CORE] Implement barrier execution m...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21758#discussion_r204394561 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1411,6 +1420,76 @@ class DAGScheduler( } } + case failure: TaskFailedReason if task.isBarrier => +// Also handle the task failed reasons here. +failure match { + case Resubmitted => +handleResubmittedFailure(task, stage) + + case _ => // Do nothing. +} + +// Always fail the current stage and retry all the tasks when a barrier task fail. +val failedStage = stageIdToStage(task.stageId) +logInfo(s"Marking $failedStage (${failedStage.name}) as failed due to a barrier task " + + "failed.") +val message = s"Stage failed because barrier task $task finished unsuccessfully. " + + s"${failure.toErrorString}" --- End diff -- sure, it can be just `failure.toErrorString`, no need to wrap it into `s"..."` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21758: [SPARK-24795][CORE] Implement barrier execution m...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/21758#discussion_r204391134 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1411,6 +1420,76 @@ class DAGScheduler( } } + case failure: TaskFailedReason if task.isBarrier => +// Also handle the task failed reasons here. +failure match { + case Resubmitted => +handleResubmittedFailure(task, stage) + + case _ => // Do nothing. +} + +// Always fail the current stage and retry all the tasks when a barrier task fail. +val failedStage = stageIdToStage(task.stageId) +logInfo(s"Marking $failedStage (${failedStage.name}) as failed due to a barrier task " + + "failed.") +val message = s"Stage failed because barrier task $task finished unsuccessfully. " + + s"${failure.toErrorString}" --- End diff -- Why is it not needed? Could you expend more on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21758: [SPARK-24795][CORE] Implement barrier execution m...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/21758#discussion_r204390597 --- Diff: core/src/main/scala/org/apache/spark/BarrierTaskContext.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import org.apache.spark.annotation.{Experimental, Since} + +/** A [[TaskContext]] with extra info and tooling for a barrier stage. */ +trait BarrierTaskContext extends TaskContext { + + /** + * :: Experimental :: + * Sets a global barrier and waits until all tasks in this stage hit this barrier. Similar to + * MPI_Barrier function in MPI, the barrier() function call blocks until all tasks in the same + * stage have reached this routine. + */ + @Experimental + @Since("2.4.0") + def barrier(): Unit + + /** + * :: Experimental :: + * Returns the all task infos in this barrier stage, the task infos are ordered by partitionId. --- End diff -- The major reason is that each tasks within the same barrier stage may need to communicate with each other, we order the task infos by partitionId so a task can find its peer tasks by index. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16374: [SPARK-18925][STREAMING] Reduce memory usage of m...
Github user vpchelko closed the pull request at: https://github.com/apache/spark/pull/16374 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21805: [SPARK-24850][SQL] fix str representation of CachedRDDBu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21805 **[Test build #93444 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93444/testReport)** for PR 21805 at commit [`2a21c80`](https://github.com/apache/spark/commit/2a21c80f0b751277e8ac975279474378c765af6c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21789: [SPARK-24829][STS]In Spark Thrift Server, CAST AS FLOAT ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21789 **[Test build #93440 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93440/testReport)** for PR 21789 at commit [`05cbdc8`](https://github.com/apache/spark/commit/05cbdc8cf25b50b832febe73e0963885431679a6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21789: [SPARK-24829][STS]In Spark Thrift Server, CAST AS FLOAT ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21789 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93440/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21789: [SPARK-24829][STS]In Spark Thrift Server, CAST AS FLOAT ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21789 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21789: [SPARK-24829][STS]In Spark Thrift Server, CAST AS FLOAT ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21789 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21789: [SPARK-24829][STS]In Spark Thrift Server, CAST AS FLOAT ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21789 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93442/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21789: [SPARK-24829][STS]In Spark Thrift Server, CAST AS FLOAT ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21789 **[Test build #93442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93442/testReport)** for PR 21789 at commit [`05cbdc8`](https://github.com/apache/spark/commit/05cbdc8cf25b50b832febe73e0963885431679a6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21846: [SPARK-24887][SQL]Avro: use SerializableConfiguration in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21846 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21846: [SPARK-24887][SQL]Avro: use SerializableConfiguration in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21846 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93441/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21846: [SPARK-24887][SQL]Avro: use SerializableConfiguration in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21846 **[Test build #93441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93441/testReport)** for PR 21846 at commit [`a8dd96d`](https://github.com/apache/spark/commit/a8dd96df3eef66bab29f83b141a67386676603fa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21805: [SPARK-24850][SQL] fix str representation of Cach...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21805#discussion_r204379093 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala --- @@ -207,4 +207,7 @@ case class InMemoryRelation( } override protected def otherCopyArgs: Seq[AnyRef] = Seq(statsOfPlanToCache) + + override def simpleString: String = s"InMemoryRelation(${output}, ${cacheBuilder.storageLevel})" + --- End diff -- nit: remove the blank line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21805: [SPARK-24850][SQL] fix str representation of Cach...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21805#discussion_r204378903 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala --- @@ -206,4 +206,20 @@ class DatasetCacheSuite extends QueryTest with SharedSQLContext with TimeLimits // first time use, load cache checkDataset(df5, Row(10)) } + + test("SPARK-24850 InMemoryRelation string representation does not include cached plan") { +val dummyQueryExecution = spark.range(0, 1).toDF().queryExecution +val inMemoryRelation = InMemoryRelation( + true, + 1000, + StorageLevel.MEMORY_ONLY, + dummyQueryExecution.sparkPlan, + Some("test-relation"), + dummyQueryExecution.logical) + + assert(!inMemoryRelation.simpleString.contains(dummyQueryExecution.sparkPlan.toString)) +assert(inMemoryRelation.simpleString == + s"InMemoryRelation(${inMemoryRelation.output}," + + " StorageLevel(memory, deserialized, 1 replicas))") + } --- End diff -- How about just comparing explain results? ``` val df = Seq((1, 2)).toDF("a", "b").cache val outputStream = new java.io.ByteArrayOutputStream() Console.withOut(outputStream) { df.explain(false) } assert(outputStream.toString.replaceAll("#\\d+", "#x").contains( "InMemoryRelation [a#x, b#x], StorageLevel(disk, memory, deserialized, 1 replicas)")) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21805: [SPARK-24850][SQL] fix str representation of Cach...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21805#discussion_r204378696 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala --- @@ -207,4 +207,7 @@ case class InMemoryRelation( } override protected def otherCopyArgs: Seq[AnyRef] = Seq(statsOfPlanToCache) + + override def simpleString: String = s"InMemoryRelation(${output}, ${cacheBuilder.storageLevel})" --- End diff -- How about `s"InMemoryRelation [${Utils.truncatedString(output, ", ")}], ${cacheBuilder.storageLevel}"`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21805: [SPARK-24850][SQL] fix str representation of CachedRDDBu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21805 **[Test build #93443 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93443/testReport)** for PR 21805 at commit [`de3f63e`](https://github.com/apache/spark/commit/de3f63e61fb45b61bc6544b6b5487e7f276f7ac7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21846: [SPARK-24887][SQL]Avro: use SerializableConfiguration in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21846 **[Test build #93441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93441/testReport)** for PR 21846 at commit [`a8dd96d`](https://github.com/apache/spark/commit/a8dd96df3eef66bab29f83b141a67386676603fa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21846: [SPARK-24887][SQL]Avro: use SerializableConfiguration in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21846 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21789: [SPARK-24829][STS]In Spark Thrift Server, CAST AS FLOAT ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21789 **[Test build #93440 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93440/testReport)** for PR 21789 at commit [`05cbdc8`](https://github.com/apache/spark/commit/05cbdc8cf25b50b832febe73e0963885431679a6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21846: [SPARK-24887][SQL]Avro: use SerializableConfiguration in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21846 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1234/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21789: [SPARK-24829][STS]In Spark Thrift Server, CAST AS FLOAT ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21789 **[Test build #93442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93442/testReport)** for PR 21789 at commit [`05cbdc8`](https://github.com/apache/spark/commit/05cbdc8cf25b50b832febe73e0963885431679a6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org