[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r156874375 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -154,6 +154,44 @@ class CodegenContext { val mutableStates: mutable.ArrayBuffer[(String, String, String)] = mutable.ArrayBuffer.empty[(String, String, String)] + // An map keyed by mutable states' types holds the status of mutableStateArray + var mutableStateArrayMap: mutable.Map[String, MutableStateArrays] = +mutable.Map.empty[String, MutableStateArrays] + + // An map keyed by mutable states' types holds the current name of the mutableStateArray + // into which state of the given key will be compacted + var mutableStateArrayCurrentNames: mutable.Map[String, String] = +mutable.Map.empty[String, String] + + // An array holds the code that will initialize each element of the mutableStateArray + var mutableStateArrayInitCodes: mutable.ArrayBuffer[String] = +mutable.ArrayBuffer.empty[String] + + // Holding names and current index of mutableStateArrays for a certain type + class MutableStateArrays { +val arrayNames = mutable.ListBuffer.empty[String] +createNewArray() + +private[this] var currentIndex = 0 + +private def createNewArray() = arrayNames.append(freshName("mutableStateArray")) + +def getCurrentIndex: Int = { currentIndex } --- End diff -- nit `def getCurrentIndex: Int = currentIndex` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19892: [SPARK-20542][FollowUp][PySpark] Bucketizer support mult...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19892 **[Test build #84898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84898/testReport)** for PR 19892 at commit [`5efc94e`](https://github.com/apache/spark/commit/5efc94e8c5a28ab908c61140fdfd00e6e242a8db). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r156874146 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -154,6 +154,44 @@ class CodegenContext { val mutableStates: mutable.ArrayBuffer[(String, String, String)] = mutable.ArrayBuffer.empty[(String, String, String)] + // An map keyed by mutable states' types holds the status of mutableStateArray + var mutableStateArrayMap: mutable.Map[String, MutableStateArrays] = +mutable.Map.empty[String, MutableStateArrays] + + // An map keyed by mutable states' types holds the current name of the mutableStateArray + // into which state of the given key will be compacted + var mutableStateArrayCurrentNames: mutable.Map[String, String] = --- End diff -- it's not needed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r156874184 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -154,6 +154,44 @@ class CodegenContext { val mutableStates: mutable.ArrayBuffer[(String, String, String)] = mutable.ArrayBuffer.empty[(String, String, String)] + // An map keyed by mutable states' types holds the status of mutableStateArray + var mutableStateArrayMap: mutable.Map[String, MutableStateArrays] = +mutable.Map.empty[String, MutableStateArrays] + + // An map keyed by mutable states' types holds the current name of the mutableStateArray + // into which state of the given key will be compacted + var mutableStateArrayCurrentNames: mutable.Map[String, String] = +mutable.Map.empty[String, String] + + // An array holds the code that will initialize each element of the mutableStateArray + var mutableStateArrayInitCodes: mutable.ArrayBuffer[String] = --- End diff -- why var? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r156874096 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -154,6 +154,44 @@ class CodegenContext { val mutableStates: mutable.ArrayBuffer[(String, String, String)] = mutable.ArrayBuffer.empty[(String, String, String)] + // An map keyed by mutable states' types holds the status of mutableStateArray + var mutableStateArrayMap: mutable.Map[String, MutableStateArrays] = --- End diff -- why var? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19892: [SPARK-20542][FollowUp][PySpark] Bucketizer support mult...
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19892 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19925: [SPARK-22732] Add Structured Streaming APIs to DataSourc...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19925 shall we move these new interfaces to `org.apache.spark.sql.sources.v2.reader/write.streaming` package? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19811 **[Test build #84897 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84897/testReport)** for PR 19811 at commit [`15e967e`](https://github.com/apache/spark/commit/15e967ec278abdae190bd3f0d3e937a54f674e19). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19925#discussion_r156872558 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/ContinuousWriter.java --- @@ -0,0 +1,41 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.writer; + +import org.apache.spark.annotation.InterfaceStability; + +/** + * A {@link DataSourceV2Writer} for use with continuous stream processing. --- End diff -- `A variation of ...`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19925#discussion_r156872002 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/MicroBatchReader.java --- @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import org.apache.spark.sql.sources.v2.reader.Offset; +import org.apache.spark.sql.execution.streaming.BaseStreamingSource; + +import java.util.Optional; + +/** + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this + * interface to indicate they allow micro-batch streaming reads. + */ +public interface MicroBatchReader extends DataSourceV2Reader, BaseStreamingSource { +/** + * Set the desired offset range for read tasks created from this reader. Read tasks will --- End diff -- do you mean this method must be called before `createReadTasks`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19925#discussion_r156871819 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/MicroBatchReader.java --- @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import org.apache.spark.sql.sources.v2.reader.Offset; +import org.apache.spark.sql.execution.streaming.BaseStreamingSource; + +import java.util.Optional; + +/** + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19925#discussion_r156871677 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ContinuousReader.java --- @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import org.apache.spark.sql.sources.v2.reader.PartitionOffset; +import org.apache.spark.sql.execution.streaming.BaseStreamingSource; + +import java.util.Optional; + +/** + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this --- End diff -- It's not a mix-in interface but a variation on `DataSourceV2Reader`, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19925#discussion_r156871374 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ContinuousDataReader.java --- @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import org.apache.spark.sql.sources.v2.reader.PartitionOffset; + +import java.io.IOException; + +/** + * A variation on {@link DataReader} for use with streaming in continuous processing mode. + */ +public interface ContinuousDataReader extends DataReader { +/** + * Get the offset of the current record, or the start offset if no records have been read. + * + * The execution engine will call this method along with get() to keep track of the current --- End diff -- better to use a real java doc link, e.g. `{@link DataReader#get}` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19925#discussion_r156870726 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/MicroBatchWriteSupport.java --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2; + +import java.util.Optional; + +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.execution.streaming.BaseStreamingSink; +import org.apache.spark.sql.sources.v2.writer.DataSourceV2Writer; +import org.apache.spark.sql.streaming.OutputMode; +import org.apache.spark.sql.types.StructType; + +/** + * A mix-in interface for {@link DataSourceV2}. Data sources can implement this interface to + * provide data writing ability and save the data from a microbatch to the data source. + */ +@InterfaceStability.Evolving +public interface MicroBatchWriteSupport extends BaseStreamingSink { + + /** + * Creates an optional {@link DataSourceV2Writer} to save the data to this data source. Data + * sources can return None if there is no writing needed to be done. + * + * @param queryId A unique string for the writing query. It's possible that there are many writing + *queries running at the same time, and the returned {@link DataSourceV2Writer} + *can use this id to distinguish itself from others. + * @param epochId The uniquenumeric ID of the batch within this writing query. This is an --- End diff -- typo: `unique numberic` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19811 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19811 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84895/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19811 **[Test build #84895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84895/testReport)** for PR 19811 at commit [`49119a9`](https://github.com/apache/spark/commit/49119a9de53cd7bb2bf91df6b691358da22e1b00). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19975 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19975 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84893/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19975 **[Test build #84893 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84893/testReport)** for PR 19975 at commit [`1ab78ed`](https://github.com/apache/spark/commit/1ab78ed24c81903ae33d48f34d2e89f0a4eaa24e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19972: [SPARK-22778][Kubernetes] Added the missing service meta...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19972 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19865: [SPARK-22668][SQL] Ensure no global variables in ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19865#discussion_r156868316 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -217,6 +217,18 @@ class CodegenContext { splitExpressions(expressions = initCodes, funcName = "init", arguments = Nil) } + /** + * Return true if a given variable has been described as a global variable + */ + def isDeclaredMutableState(varName: String): Boolean = { --- End diff -- hmm, even only in `assert`, it still can fail compilation, doesn't it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19970: [SPARK-22775][SQL] move dictionary related APIs from Col...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19970 retest this pease --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19975 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19975 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84892/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19975 **[Test build #84892 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84892/testReport)** for PR 19975 at commit [`66475b7`](https://github.com/apache/spark/commit/66475b737e2db96cb2bf491cf22d133bdcfe9b02). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19865: [SPARK-22668][SQL] Ensure no global variables in ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19865#discussion_r156866988 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -842,7 +856,10 @@ class CodegenContext { blocks.head } else { val func = freshName(funcName) - val argString = arguments.map { case (t, name) => s"$t $name" }.mkString(", ") + val argString = arguments.map { case (t, name) => +assert(!isDeclaredMutableState(name), --- End diff -- Yeah, this problem occurs not only in split method also in normal method. Let us check this in `addNewFunction` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19865: [SPARK-22668][SQL] Ensure no global variables in ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19865#discussion_r156866827 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -217,6 +217,18 @@ class CodegenContext { splitExpressions(expressions = initCodes, funcName = "init", arguments = Nil) } + /** + * Return true if a given variable has been described as a global variable + */ + def isDeclaredMutableState(varName: String): Boolean = { +val j = varName.indexOf("[") --- End diff -- I see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19865: [SPARK-22668][SQL] Ensure no global variables in ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19865#discussion_r156866697 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -217,6 +217,18 @@ class CodegenContext { splitExpressions(expressions = initCodes, funcName = "init", arguments = Nil) } + /** + * Return true if a given variable has been described as a global variable + */ + def isDeclaredMutableState(varName: String): Boolean = { --- End diff -- Of course, this PR uses the method only in `assert`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19813 > Probably, we'd better to move this discussion to the jira? I summary it and also post the design doc link to the jira. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19974: [SPARK-22779][sql] Resolve default values for fal...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19974 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19974 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19974 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19813 > can we have a google doc so that we can leave comments inline? thanks! Sure. The google doc is at https://docs.google.com/document/d/1By_V-A2sxCWbP7dZ5EzHIuMSe8K0fQL9lqovGWXnsfs/edit?usp=sharing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14151 **[Test build #84896 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84896/testReport)** for PR 14151 at commit [`021039b`](https://github.com/apache/spark/commit/021039bd1382392282faaec1e1f5c0d39e650a93). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19925 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19974: [SPARK-22779][sql] Resolve default values for fal...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19974#discussion_r156863889 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala --- @@ -280,4 +280,34 @@ class SQLConfSuite extends QueryTest with SharedSQLContext { spark.sessionState.conf.clear() } + + test("SPARK-22779: correctly compute default value for fallback configs") { +val fallback = SQLConf.buildConf("spark.sql.__test__.spark_22779") + .fallbackConf(SQLConf.PARQUET_COMPRESSION) --- End diff -- In SQLConf, all our conf are `TypedConfigBuilder`, instead of `ConfigBuilder`. The type-safe ConfigBuilder is unable to call `fallbackConf `, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19925: [SPARK-22732] Add Structured Streaming APIs to DataSourc...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19925 Thanks! Merging to master to unblock #19926. If there are more comments, we can address them in #19926. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19865: [SPARK-22668][SQL] Ensure no global variables in ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19865#discussion_r156863124 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -217,6 +217,18 @@ class CodegenContext { splitExpressions(expressions = initCodes, funcName = "init", arguments = Nil) } + /** + * Return true if a given variable has been described as a global variable + */ + def isDeclaredMutableState(varName: String): Boolean = { +val j = varName.indexOf("[") --- End diff -- we don't need to deal with `[]` here, using them as parameter name will fail to compile. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19974 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84890/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19974 **[Test build #84890 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84890/testReport)** for PR 19974 at commit [`4402fc7`](https://github.com/apache/spark/commit/4402fc7ae33cf3bd1d5c2c8f2dcf380a90615f7c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19974 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19865: [SPARK-22668][SQL] Ensure no global variables in ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19865#discussion_r156862608 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -217,6 +217,18 @@ class CodegenContext { splitExpressions(expressions = initCodes, funcName = "init", arguments = Nil) } + /** + * Return true if a given variable has been described as a global variable + */ + def isDeclaredMutableState(varName: String): Boolean = { --- End diff -- let's only enable this check in test environment, in case it has bugs and break production jobs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19970: [SPARK-22775][SQL] move dictionary related APIs from Col...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19970 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84891/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19970: [SPARK-22775][SQL] move dictionary related APIs from Col...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19970 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19970: [SPARK-22775][SQL] move dictionary related APIs from Col...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19970 **[Test build #84891 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84891/testReport)** for PR 19970 at commit [`c38f58e`](https://github.com/apache/spark/commit/c38f58e80ef5e8c4c3b732695ea94681f22d85b4). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19971: [SPARK-22774][SQL][Test] Add compilation check in...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19971#discussion_r156861190 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala --- @@ -348,13 +351,38 @@ class TPCDSQuerySuite extends QueryTest with SharedSQLContext with BeforeAndAfte "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90", "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99") + private def checkGeneratedCode(plan: SparkPlan): Unit = { +val codegenSubtrees = new collection.mutable.HashSet[WholeStageCodegenExec]() +plan foreach { + case s: WholeStageCodegenExec => +codegenSubtrees += s + case s => s +} +codegenSubtrees.toSeq.map { subtree => + val code = subtree.doCodeGen()._2 + try { +// Just check the generated code can be properly compiled +CodeGenerator.compile(code) + } catch { +case e: Exception => + logError(s"failed to compile: $e", e) + val msg = +s"Subtree:\n$subtree\n" + +s"Generated code:\n${CodeFormatter.format(code)}\n" + logDebug(msg) --- End diff -- it's more useful to include the error message in the exception, e.g. `throw new Exception(msg, e)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19971: [SPARK-22774][SQL][Test] Add compilation check in...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19971#discussion_r156860978 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala --- @@ -348,13 +351,38 @@ class TPCDSQuerySuite extends QueryTest with SharedSQLContext with BeforeAndAfte "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90", "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99") + private def checkGeneratedCode(plan: SparkPlan): Unit = { +val codegenSubtrees = new collection.mutable.HashSet[WholeStageCodegenExec]() +plan foreach { + case s: WholeStageCodegenExec => +codegenSubtrees += s + case s => s +} +codegenSubtrees.toSeq.map { subtree => --- End diff -- nit: `map` -> `foreach` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19594#discussion_r156861642 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala --- @@ -191,8 +191,16 @@ case class JoinEstimation(join: Join) extends Logging { val rInterval = ValueInterval(rightKeyStat.min, rightKeyStat.max, rightKey.dataType) if (ValueInterval.isIntersected(lInterval, rInterval)) { val (newMin, newMax) = ValueInterval.intersect(lInterval, rInterval, leftKey.dataType) -val (card, joinStat) = computeByNdv(leftKey, rightKey, newMin, newMax) -keyStatsAfterJoin += (leftKey -> joinStat, rightKey -> joinStat) +val (card, joinStat) = (leftKeyStat.histogram, rightKeyStat.histogram) match { + case (Some(l: Histogram), Some(r: Histogram)) => +computeByEquiHeightHistogram(leftKey, rightKey, l, r, newMin, newMax) + case _ => +computeByNdv(leftKey, rightKey, newMin, newMax) +} +keyStatsAfterJoin += ( + leftKey -> joinStat.copy(histogram = leftKeyStat.histogram), + rightKey -> joinStat.copy(histogram = rightKeyStat.histogram) --- End diff -- Actually keeping it unchanged is more memory efficient. We just pass around pointers, but updating the histogram means creating a new one. Let's keep it, and add some comments to explain it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r156860058 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -179,11 +209,64 @@ class CodegenContext { * the list of default imports available. * Also, generic type arguments are accepted but ignored. * @param variableName Name of the field. - * @param initCode The statement(s) to put into the init() method to initialize this field. - * If left blank, the field will be default-initialized. + * @param initFunc Function includes statement(s) to put into the init() method to initialize + *this field. An argument is the name of the mutable state variable. + *If left blank, the field will be default-initialized. + * @param inline whether the declaration and initialization code may be inlined rather than + * compacted. + * @param useFreshName If false and inline is true, the name is not changed + * @return the name of the mutable state variable, which is either the original name if the + * variable is inlined to the outer class, or an array access if the variable is to be + * stored in an array of variables of the same type and initialization. + * There are two use cases. One is to use the original name for global variable instead + * of fresh name. Second is to use the original initialization statement since it is + * complex (e.g. allocate multi-dimensional array or object constructor has varibles). + * Primitive type variables will be inlined into outer class when the total number of + * mutable variables is less than `CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD` + * the max size of an array for compaction is given by + * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`. */ - def addMutableState(javaType: String, variableName: String, initCode: String = ""): Unit = { -mutableStates += ((javaType, variableName, initCode)) + def addMutableState( + javaType: String, + variableName: String, + initFunc: String => String = _ => "", + inline: Boolean = false, + useFreshName: Boolean = true): String = { +val varName = if (useFreshName) freshName(variableName) else variableName + +if (inline || +// want to put a primitive type variable at outerClass for performance +isPrimitiveType(javaType) && + (mutableStates.length < CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD) || +// type is multi-dimensional array --- End diff -- We are changing a declared type for compaction (i.e. adding one dimension at the first dimension) at `declareMutableStates`. In the logic at `declareMutableStates` assumes that a given variable is scalar or one-dimensional array. We could support multi-dimensional array. However, it requires slightly complicated string operations. Should we support multi-dimensional array or keep it simple? @cloud-fan WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19813 Probably, we'd better to move this discussion to the jira? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19813 can we have a google doc so that we can leave comments inline? thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19971 **[Test build #84894 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84894/testReport)** for PR 19971 at commit [`1480ae3`](https://github.com/apache/spark/commit/1480ae31ea28ddf0d5ea3beffb99ae9995932be2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19811 **[Test build #84895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84895/testReport)** for PR 19811 at commit [`49119a9`](https://github.com/apache/spark/commit/49119a9de53cd7bb2bf91df6b691358da22e1b00). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19811 Just resolved conflicts. Next commit will address review comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...
Github user gczsjdy commented on the issue: https://github.com/apache/spark/pull/19862 cc @cloud-fan @hvanhovell @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19813 I wrote a design doc at https://paper.dropbox.com/doc/Split-deeply-nested-expressions-under-wholestage-codegen-WXkQ9iIlN3zkGdn8MHgiW The design is based on what did in this PR. Please give me feedbacks if you have time to go through it. Thank you. cc @cloud-fan @kiszk @mgaido91 @maropu @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/19955 @ericvandenbergfb - In the testing section, let's put the benchmark numbers for large shuffle heavy jobs we ran internally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19955 and also what's the behavior if the memory is used up but `spark.shuffle.async.num.sorter` is not hit yet? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19955 shall we do it for all the external sorters? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19955 cc @JoshRosen @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17619 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19020 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19020: [SPARK-3181] [ML] Implement huber loss for LinearRegress...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/19020 Merged into master, thanks for all your reviewing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19020#discussion_r156856635 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -480,10 +640,14 @@ object LinearRegression extends DefaultParamsReadable[LinearRegression] { class LinearRegressionModel private[ml] ( @Since("1.4.0") override val uid: String, @Since("2.0.0") val coefficients: Vector, -@Since("1.3.0") val intercept: Double) +@Since("1.3.0") val intercept: Double, +@Since("2.3.0") val scale: Double) extends RegressionModel[Vector, LinearRegressionModel] with LinearRegressionParams with MLWritable { + def this(uid: String, coefficients: Vector, intercept: Double) = +this(uid, coefficients, intercept, 1.0) --- End diff -- ```scale``` denotes that ```|y - X'w - c|``` is scaled down, I think it make sense to be set 1.0 for least squares regression. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r156855675 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -179,11 +209,64 @@ class CodegenContext { * the list of default imports available. * Also, generic type arguments are accepted but ignored. * @param variableName Name of the field. - * @param initCode The statement(s) to put into the init() method to initialize this field. - * If left blank, the field will be default-initialized. + * @param initFunc Function includes statement(s) to put into the init() method to initialize + *this field. An argument is the name of the mutable state variable. + *If left blank, the field will be default-initialized. + * @param inline whether the declaration and initialization code may be inlined rather than + * compacted. + * @param useFreshName If false and inline is true, the name is not changed + * @return the name of the mutable state variable, which is either the original name if the + * variable is inlined to the outer class, or an array access if the variable is to be + * stored in an array of variables of the same type and initialization. + * There are two use cases. One is to use the original name for global variable instead + * of fresh name. Second is to use the original initialization statement since it is + * complex (e.g. allocate multi-dimensional array or object constructor has varibles). + * Primitive type variables will be inlined into outer class when the total number of + * mutable variables is less than `CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD` + * the max size of an array for compaction is given by + * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`. */ - def addMutableState(javaType: String, variableName: String, initCode: String = ""): Unit = { -mutableStates += ((javaType, variableName, initCode)) + def addMutableState( + javaType: String, + variableName: String, + initFunc: String => String = _ => "", + inline: Boolean = false, + useFreshName: Boolean = true): String = { +val varName = if (useFreshName) freshName(variableName) else variableName + +if (inline || +// want to put a primitive type variable at outerClass for performance +isPrimitiveType(javaType) && + (mutableStates.length < CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD) || +// type is multi-dimensional array +javaType.contains("[][]")) { + val initCode = initFunc(varName) + mutableStates += ((javaType, varName, initCode)) + varName +} else { + // mutableStateArray has not been declared yet for the given type and name. Create a new name + // for the array, The mutableStateArray for the given type and name has been declared, + // update the max index of the array. Then, add an entry to keep track of current array name + // for type and nit code is stored for code generation. Finally, return an array element + val (arrayName, newIdx) = { +val compactArrayName = "mutableStateArray" --- End diff -- how about ``` class MutableStateArrays { val arrayNames = mutable.ListBuffer.empty[String] createNewArray() private[this] var currentIndex = 0 def getNextSlot(): String = { if (currentIndex < CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT) { val res = s"${arrayNames.last}[$currentIndex]" currentIndex += 1 res } else { createNewArray() currentIndex = 1 s"${arrayNames.last}[0]" } } private def createNewArray() = arrayNames.append(freshName("mutableStateArray")) } val mutableStateArrayMap: Map[String, MutableStateArrays] = ... // type name -> MutableStateArrays ``` and here the logic can be very simple ``` val arrays = mutableStateArrayMap.getOrElseUpdate(javaType, new MutableStateArrays) arrays.getNextSlot() ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r156853851 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -205,18 +287,35 @@ class CodegenContext { def declareMutableStates(): String = { // It's possible that we add same mutable state twice, e.g. the `mergeExpressions` in // `TypedAggregateExpression`, we should call `distinct` here to remove the duplicated ones. -mutableStates.distinct.map { case (javaType, variableName, _) => +val inlinedStates = mutableStates.distinct.map { case (javaType, variableName, _) => s"private $javaType $variableName;" -}.mkString("\n") +} + +val arrayStates = mutableStateArrayIdx.keys.map { case (javaType, arrayName) => + val length = mutableStateArrayIdx((javaType, arrayName)) + 1 + if (javaType.matches("^.*\\[\\]$")) { --- End diff -- shall we just check `javaType.contains("[]")`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19975 **[Test build #84893 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84893/testReport)** for PR 19975 at commit [`1ab78ed`](https://github.com/apache/spark/commit/1ab78ed24c81903ae33d48f34d2e89f0a4eaa24e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r156853203 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -179,11 +209,64 @@ class CodegenContext { * the list of default imports available. * Also, generic type arguments are accepted but ignored. * @param variableName Name of the field. - * @param initCode The statement(s) to put into the init() method to initialize this field. - * If left blank, the field will be default-initialized. + * @param initFunc Function includes statement(s) to put into the init() method to initialize + *this field. An argument is the name of the mutable state variable. + *If left blank, the field will be default-initialized. + * @param inline whether the declaration and initialization code may be inlined rather than + * compacted. + * @param useFreshName If false and inline is true, the name is not changed + * @return the name of the mutable state variable, which is either the original name if the + * variable is inlined to the outer class, or an array access if the variable is to be + * stored in an array of variables of the same type and initialization. + * There are two use cases. One is to use the original name for global variable instead + * of fresh name. Second is to use the original initialization statement since it is + * complex (e.g. allocate multi-dimensional array or object constructor has varibles). + * Primitive type variables will be inlined into outer class when the total number of + * mutable variables is less than `CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD` + * the max size of an array for compaction is given by + * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`. */ - def addMutableState(javaType: String, variableName: String, initCode: String = ""): Unit = { -mutableStates += ((javaType, variableName, initCode)) + def addMutableState( + javaType: String, + variableName: String, + initFunc: String => String = _ => "", + inline: Boolean = false, + useFreshName: Boolean = true): String = { +val varName = if (useFreshName) freshName(variableName) else variableName + +if (inline || +// want to put a primitive type variable at outerClass for performance +isPrimitiveType(javaType) && + (mutableStates.length < CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD) || +// type is multi-dimensional array --- End diff -- Good catch. Now, we can do. This limitation came from creating a loop for initialization. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r156852077 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -179,11 +209,64 @@ class CodegenContext { * the list of default imports available. * Also, generic type arguments are accepted but ignored. * @param variableName Name of the field. - * @param initCode The statement(s) to put into the init() method to initialize this field. - * If left blank, the field will be default-initialized. + * @param initFunc Function includes statement(s) to put into the init() method to initialize + *this field. An argument is the name of the mutable state variable. + *If left blank, the field will be default-initialized. + * @param inline whether the declaration and initialization code may be inlined rather than + * compacted. + * @param useFreshName If false and inline is true, the name is not changed + * @return the name of the mutable state variable, which is either the original name if the + * variable is inlined to the outer class, or an array access if the variable is to be + * stored in an array of variables of the same type and initialization. + * There are two use cases. One is to use the original name for global variable instead + * of fresh name. Second is to use the original initialization statement since it is + * complex (e.g. allocate multi-dimensional array or object constructor has varibles). + * Primitive type variables will be inlined into outer class when the total number of + * mutable variables is less than `CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD` + * the max size of an array for compaction is given by + * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`. */ - def addMutableState(javaType: String, variableName: String, initCode: String = ""): Unit = { -mutableStates += ((javaType, variableName, initCode)) + def addMutableState( + javaType: String, + variableName: String, + initFunc: String => String = _ => "", + inline: Boolean = false, + useFreshName: Boolean = true): String = { +val varName = if (useFreshName) freshName(variableName) else variableName + +if (inline || +// want to put a primitive type variable at outerClass for performance +isPrimitiveType(javaType) && + (mutableStates.length < CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD) || +// type is multi-dimensional array +javaType.contains("[][]")) { + val initCode = initFunc(varName) + mutableStates += ((javaType, varName, initCode)) + varName +} else { + // mutableStateArray has not been declared yet for the given type and name. Create a new name --- End diff -- nit: `If ...` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r156851936 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -179,11 +209,64 @@ class CodegenContext { * the list of default imports available. * Also, generic type arguments are accepted but ignored. * @param variableName Name of the field. - * @param initCode The statement(s) to put into the init() method to initialize this field. - * If left blank, the field will be default-initialized. + * @param initFunc Function includes statement(s) to put into the init() method to initialize + *this field. An argument is the name of the mutable state variable. + *If left blank, the field will be default-initialized. + * @param inline whether the declaration and initialization code may be inlined rather than + * compacted. + * @param useFreshName If false and inline is true, the name is not changed + * @return the name of the mutable state variable, which is either the original name if the + * variable is inlined to the outer class, or an array access if the variable is to be + * stored in an array of variables of the same type and initialization. + * There are two use cases. One is to use the original name for global variable instead + * of fresh name. Second is to use the original initialization statement since it is + * complex (e.g. allocate multi-dimensional array or object constructor has varibles). + * Primitive type variables will be inlined into outer class when the total number of + * mutable variables is less than `CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD` + * the max size of an array for compaction is given by + * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`. */ - def addMutableState(javaType: String, variableName: String, initCode: String = ""): Unit = { -mutableStates += ((javaType, variableName, initCode)) + def addMutableState( + javaType: String, + variableName: String, + initFunc: String => String = _ => "", + inline: Boolean = false, + useFreshName: Boolean = true): String = { +val varName = if (useFreshName) freshName(variableName) else variableName + +if (inline || +// want to put a primitive type variable at outerClass for performance +isPrimitiveType(javaType) && + (mutableStates.length < CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD) || +// type is multi-dimensional array --- End diff -- why can't support `multi-dimensional array`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r156851771 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -179,11 +209,64 @@ class CodegenContext { * the list of default imports available. * Also, generic type arguments are accepted but ignored. * @param variableName Name of the field. - * @param initCode The statement(s) to put into the init() method to initialize this field. - * If left blank, the field will be default-initialized. + * @param initFunc Function includes statement(s) to put into the init() method to initialize + *this field. An argument is the name of the mutable state variable. --- End diff -- nit: `The Argument...` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19975 **[Test build #84892 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84892/testReport)** for PR 19975 at commit [`66475b7`](https://github.com/apache/spark/commit/66475b737e2db96cb2bf491cf22d133bdcfe9b02). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19811#discussion_r156851696 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -179,11 +209,64 @@ class CodegenContext { * the list of default imports available. * Also, generic type arguments are accepted but ignored. * @param variableName Name of the field. - * @param initCode The statement(s) to put into the init() method to initialize this field. - * If left blank, the field will be default-initialized. + * @param initFunc Function includes statement(s) to put into the init() method to initialize + *this field. An argument is the name of the mutable state variable. --- End diff -- nit: indentation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19975: [SPARK-22781][SS] Support creating streaming data...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/19975 [SPARK-22781][SS] Support creating streaming dataset with ORC files ## What changes were proposed in this pull request? This PR supports creating streaming dataset with ORC file format. ## How was this patch tested? Pass the newly added test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-22781 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19975.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19975 commit 66475b737e2db96cb2bf491cf22d133bdcfe9b02 Author: Dongjoon HyunDate: 2017-12-14T04:14:53Z [SPARK-22781][SS] Support creating streaming dataset with ORC file format --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19970: [SPARK-22775][SQL] move dictionary related APIs from Col...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19970 **[Test build #84891 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84891/testReport)** for PR 19970 at commit [`c38f58e`](https://github.com/apache/spark/commit/c38f58e80ef5e8c4c3b732695ea94681f22d85b4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19972: [SPARK-22778][Kubernetes] Added the missing service meta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19972 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84888/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19972: [SPARK-22778][Kubernetes] Added the missing service meta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19972 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19972: [SPARK-22778][Kubernetes] Added the missing service meta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19972 **[Test build #84888 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84888/testReport)** for PR 19972 at commit [`c91c9a6`](https://github.com/apache/spark/commit/c91c9a606bb00ea57c50fe27081273e0d5a8bcee). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19813 I'm not sure, it depends on how many places we already doing it and what's the drawback if we forbid it. Let's have a design doc and gather more feedbacks. Thanks for your understanding! both PRs are reverted. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19864#discussion_r156847927 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -80,6 +80,14 @@ class CacheManager extends Logging { cachedData.isEmpty } + private def extractStatsOfPlanForCache(plan: LogicalPlan): Option[Statistics] = { +if (plan.stats.rowCount.isDefined) { --- End diff -- The expected value should be `rowCount * avgRowSize`. Without CBO, I think the file size is the best we can get, although it may not be correct. That is to say, without CBO, parquet relation may have underestimated size and cause OOM, users need to turn on CBO to fix it. So the same thing should happen in table cache. We can fix this by defining `sizeInBytes` as `file size * some factor`, but it should belong to another PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19594#discussion_r156847785 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala --- @@ -191,8 +191,16 @@ case class JoinEstimation(join: Join) extends Logging { val rInterval = ValueInterval(rightKeyStat.min, rightKeyStat.max, rightKey.dataType) if (ValueInterval.isIntersected(lInterval, rInterval)) { val (newMin, newMax) = ValueInterval.intersect(lInterval, rInterval, leftKey.dataType) -val (card, joinStat) = computeByNdv(leftKey, rightKey, newMin, newMax) -keyStatsAfterJoin += (leftKey -> joinStat, rightKey -> joinStat) +val (card, joinStat) = (leftKeyStat.histogram, rightKeyStat.histogram) match { + case (Some(l: Histogram), Some(r: Histogram)) => +computeByEquiHeightHistogram(leftKey, rightKey, l, r, newMin, newMax) + case _ => +computeByNdv(leftKey, rightKey, newMin, newMax) +} +keyStatsAfterJoin += ( + leftKey -> joinStat.copy(histogram = leftKeyStat.histogram), + rightKey -> joinStat.copy(histogram = rightKeyStat.histogram) --- End diff -- Currently we don't update histogram since min/max can help us to know which bins are valid. It doesn't affect correctness. But updating histograms helps to reduce memory usage for histogram propagation. We can do this in both filter and join estimation in following PRs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19813 @cloud-fan Do you think we should still allow something like `a + 1` as the output of codegen? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19594#discussion_r156847046 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +115,183 @@ object EstimationUtils { } } + /** + * Returns overlapped ranges between two histograms, in the given value range [newMin, newMax]. + */ + def getOverlappedRanges( + leftHistogram: Histogram, + rightHistogram: Histogram, + newMin: Double, + newMax: Double): Seq[OverlappedRange] = { +val overlappedRanges = new ArrayBuffer[OverlappedRange]() +// Only bins whose range intersect [newMin, newMax] have join possibility. +val leftBins = leftHistogram.bins + .filter(b => b.lo <= newMax && b.hi >= newMin) +val rightBins = rightHistogram.bins + .filter(b => b.lo <= newMax && b.hi >= newMin) + +leftBins.foreach { lb => + rightBins.foreach { rb => +val (left, leftHeight) = trimBin(lb, leftHistogram.height, newMin, newMax) +val (right, rightHeight) = trimBin(rb, rightHistogram.height, newMin, newMax) +// Only collect overlapped ranges. +if (left.lo <= right.hi && left.hi >= right.lo) { + // Collect overlapped ranges. + val range = if (left.lo == left.hi) { +// Case1: the left bin has only one value +OverlappedRange( + lo = left.lo, + hi = left.lo, + leftNdv = 1, + rightNdv = 1, + leftNumRows = leftHeight, + rightNumRows = rightHeight / right.ndv +) + } else if (right.lo == right.hi) { +// Case2: the right bin has only one value +OverlappedRange( + lo = right.lo, + hi = right.lo, + leftNdv = 1, + rightNdv = 1, + leftNumRows = leftHeight / left.ndv, + rightNumRows = rightHeight +) + } else if (right.lo >= left.lo && right.hi >= left.hi) { +// Case3: the left bin is "smaller" than the right bin +// left.loright.lo left.hi right.hi +// +--+++---> +val leftRatio = (left.hi - right.lo) / (left.hi - left.lo) +val rightRatio = (left.hi - right.lo) / (right.hi - right.lo) +if (leftRatio == 0) { + // The overlapped range has only one value. + OverlappedRange( +lo = right.lo, +hi = right.lo, +leftNdv = 1, +rightNdv = 1, +leftNumRows = leftHeight / left.ndv, +rightNumRows = rightHeight / right.ndv + ) +} else { + OverlappedRange( +lo = right.lo, +hi = left.hi, +leftNdv = left.ndv * leftRatio, +rightNdv = right.ndv * rightRatio, +leftNumRows = leftHeight * leftRatio, +rightNumRows = rightHeight * rightRatio + ) +} + } else if (right.lo <= left.lo && right.hi <= left.hi) { +// Case4: the left bin is "larger" than the right bin +// right.lo left.lo right.hi left.hi +// +--+++---> +val leftRatio = (right.hi - left.lo) / (left.hi - left.lo) +val rightRatio = (right.hi - left.lo) / (right.hi - right.lo) +if (leftRatio == 0) { + // The overlapped range has only one value. + OverlappedRange( +lo = right.hi, +hi = right.hi, +leftNdv = 1, +rightNdv = 1, +leftNumRows = leftHeight / left.ndv, +rightNumRows = rightHeight / right.ndv + ) +} else { + OverlappedRange( +lo = left.lo, +hi = right.hi, +leftNdv = left.ndv * leftRatio, +rightNdv = right.ndv * rightRatio, +leftNumRows = leftHeight * leftRatio, +rightNumRows = rightHeight * rightRatio + ) +} + } else if (right.lo >= left.lo && right.hi <= left.hi) { +// Case5: the left bin contains the right bin +// left.loright.lo
[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19594#discussion_r156846872 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +115,183 @@ object EstimationUtils { } } + /** + * Returns overlapped ranges between two histograms, in the given value range [newMin, newMax]. + */ + def getOverlappedRanges( + leftHistogram: Histogram, + rightHistogram: Histogram, + newMin: Double, + newMax: Double): Seq[OverlappedRange] = { --- End diff -- yea I think `upperBound/lowerBound` is better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19813 Also needs to revert #19969 which is based on this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19813 @cloud-fan Ok. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19813 After some more thoughts, this PR makes a new contract that Spark doesn't promise before: `Expression.genCode` must output something that can be used as parameter name or literal. I do remember in some places we just output statement like `a + 1` for codegen, but I could be wrong. At least we need to check all the places and document this new contract before merging this PR. Another solution is to not make this contract. By a quick look this seems hard to do, because at the time of doing this, the code(method body) is already generated and we don't know how to replace statement like `a + 1` with the generated parameter name, inside the method body. We may need to do this fix earlier in the codegen procedure. I'm going to revert it, let's have a proper design doc and resubmit this. Sorry for the inconvenient! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19955 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84889/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19955 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19955 **[Test build #84889 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84889/testReport)** for PR 19955 at commit [`a3ea93c`](https://github.com/apache/spark/commit/a3ea93cd829debc13de6aac3fc94e49c3a03dabc). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19970: [SPARK-22775][SQL] move dictionary related APIs f...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19970#discussion_r156843445 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java --- @@ -145,39 +135,39 @@ public abstract double[] getDoubles(int rowId, int count); /** - * Returns the length of the array at rowid. + * Returns the length of the array for rowId. */ public abstract int getArrayLength(int rowId); /** - * Returns the offset of the array at rowid. + * Returns the offset of the array for rowId. */ public abstract int getArrayOffset(int rowId); /** - * Returns a utility object to get structs. + * Returns the struct for rowId. */ public final ColumnarRow getStruct(int rowId) { return new ColumnarRow(this, rowId); } /** - * Returns a utility object to get structs. - * provided to keep API compatibility with InternalRow for code generation + * A special version of {@link #getShort(int)}, which is only used as an adapter for Spark codegen --- End diff -- `getStruct(int)` instead of `getShort(int)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19974 **[Test build #84890 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84890/testReport)** for PR 19974 at commit [`4402fc7`](https://github.com/apache/spark/commit/4402fc7ae33cf3bd1d5c2c8f2dcf380a90615f7c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19751 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84885/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19751 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19751 **[Test build #84885 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84885/testReport)** for PR 19751 at commit [`b02ea2c`](https://github.com/apache/spark/commit/b02ea2c4a96e27918c02f75c19d6639daae7cb2d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19974 @rxin I believe this is what you were trying to do? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19974: [SPARK-22779][sql] Resolve default values for fal...
GitHub user vanzin opened a pull request: https://github.com/apache/spark/pull/19974 [SPARK-22779][sql] Resolve default values for fallback configs. SQLConf allows some callers to define a custom default value for configs, and that complicates a little bit the handling of fallback config entries, since most of the default value resolution is hidden by the config code. This change peaks into the internals of these fallback configs to figure out the correct default value, and also returns the current human-readable default when showing the default value (e.g. through "set -v"). You can merge this pull request into a Git repository by running: $ git pull https://github.com/vanzin/spark SPARK-22779 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19974.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19974 commit 4402fc7ae33cf3bd1d5c2c8f2dcf380a90615f7c Author: Marcelo VanzinDate: 2017-12-14T02:49:28Z [SPARK-22779][sql] Resolve default values for fallback configs. SQLConf allows some callers to define a custom default value for configs, and that complicates a little bit the handling of fallback config entries, since most of the default value resolution is hidden by the config code. This change peaks into the internals of these fallback configs to figure out the correct default value, and also returns the current human-readable default when showing the default value (e.g. through "set -v"). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19971 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19971 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84886/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org