date:20171213

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156874375
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -154,6 +154,44 @@ class CodegenContext {
   val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
 mutable.ArrayBuffer.empty[(String, String, String)]
 
+  // An map keyed by mutable states' types holds the status of 
mutableStateArray
+  var mutableStateArrayMap: mutable.Map[String, MutableStateArrays] =
+mutable.Map.empty[String, MutableStateArrays]
+
+  // An map keyed by mutable states' types holds the current name of the 
mutableStateArray
+  // into which state of the given key will be compacted
+  var mutableStateArrayCurrentNames: mutable.Map[String, String] =
+mutable.Map.empty[String, String]
+
+  // An array holds the code that will initialize each element of the 
mutableStateArray
+  var mutableStateArrayInitCodes: mutable.ArrayBuffer[String] =
+mutable.ArrayBuffer.empty[String]
+
+  // Holding names and current index of mutableStateArrays for a certain 
type
+  class MutableStateArrays {
+val arrayNames = mutable.ListBuffer.empty[String]
+createNewArray()
+
+private[this] var currentIndex = 0
+
+private def createNewArray() = 
arrayNames.append(freshName("mutableStateArray"))
+
+def getCurrentIndex: Int = { currentIndex }
--- End diff --

nit `def getCurrentIndex: Int = currentIndex`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19892: [SPARK-20542][FollowUp][PySpark] Bucketizer support mult...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19892
  
**[Test build #84898 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84898/testReport)**
 for PR 19892 at commit 
[`5efc94e`](https://github.com/apache/spark/commit/5efc94e8c5a28ab908c61140fdfd00e6e242a8db).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156874146
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -154,6 +154,44 @@ class CodegenContext {
   val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
 mutable.ArrayBuffer.empty[(String, String, String)]
 
+  // An map keyed by mutable states' types holds the status of 
mutableStateArray
+  var mutableStateArrayMap: mutable.Map[String, MutableStateArrays] =
+mutable.Map.empty[String, MutableStateArrays]
+
+  // An map keyed by mutable states' types holds the current name of the 
mutableStateArray
+  // into which state of the given key will be compacted
+  var mutableStateArrayCurrentNames: mutable.Map[String, String] =
--- End diff --

it's not needed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156874184
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -154,6 +154,44 @@ class CodegenContext {
   val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
 mutable.ArrayBuffer.empty[(String, String, String)]
 
+  // An map keyed by mutable states' types holds the status of 
mutableStateArray
+  var mutableStateArrayMap: mutable.Map[String, MutableStateArrays] =
+mutable.Map.empty[String, MutableStateArrays]
+
+  // An map keyed by mutable states' types holds the current name of the 
mutableStateArray
+  // into which state of the given key will be compacted
+  var mutableStateArrayCurrentNames: mutable.Map[String, String] =
+mutable.Map.empty[String, String]
+
+  // An array holds the code that will initialize each element of the 
mutableStateArray
+  var mutableStateArrayInitCodes: mutable.ArrayBuffer[String] =
--- End diff --

why var?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156874096
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -154,6 +154,44 @@ class CodegenContext {
   val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
 mutable.ArrayBuffer.empty[(String, String, String)]
 
+  // An map keyed by mutable states' types holds the status of 
mutableStateArray
+  var mutableStateArrayMap: mutable.Map[String, MutableStateArrays] =
--- End diff --

why var?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19892: [SPARK-20542][FollowUp][PySpark] Bucketizer support mult...

2017-12-13 Thread zhengruifeng

Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/19892
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19925: [SPARK-22732] Add Structured Streaming APIs to DataSourc...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19925
  
shall we move these new interfaces to 
`org.apache.spark.sql.sources.v2.reader/write.streaming` package?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19811
  
**[Test build #84897 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84897/testReport)**
 for PR 19811 at commit 
[`15e967e`](https://github.com/apache/spark/commit/15e967ec278abdae190bd3f0d3e937a54f674e19).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19925#discussion_r156872558
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/ContinuousWriter.java
 ---
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.writer;
+
+import org.apache.spark.annotation.InterfaceStability;
+
+/**
+ * A {@link DataSourceV2Writer} for use with continuous stream processing.
--- End diff --

`A variation  of ...`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19925#discussion_r156872002
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/MicroBatchReader.java
 ---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import org.apache.spark.sql.sources.v2.reader.Offset;
+import org.apache.spark.sql.execution.streaming.BaseStreamingSource;
+
+import java.util.Optional;
+
+/**
+ * A mix-in interface for {@link DataSourceV2Reader}. Data source readers 
can implement this
+ * interface to indicate they allow micro-batch streaming reads.
+ */
+public interface MicroBatchReader extends DataSourceV2Reader, 
BaseStreamingSource {
+/**
+ * Set the desired offset range for read tasks created from this 
reader. Read tasks will
--- End diff --

do you mean this method must be called before `createReadTasks`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19925#discussion_r156871819
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/MicroBatchReader.java
 ---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import org.apache.spark.sql.sources.v2.reader.Offset;
+import org.apache.spark.sql.execution.streaming.BaseStreamingSource;
+
+import java.util.Optional;
+
+/**
+ * A mix-in interface for {@link DataSourceV2Reader}. Data source readers 
can implement this
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19925#discussion_r156871677
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ContinuousReader.java
 ---
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import org.apache.spark.sql.sources.v2.reader.PartitionOffset;
+import org.apache.spark.sql.execution.streaming.BaseStreamingSource;
+
+import java.util.Optional;
+
+/**
+ * A mix-in interface for {@link DataSourceV2Reader}. Data source readers 
can implement this
--- End diff --

It's not a mix-in interface but a variation on `DataSourceV2Reader`, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19925#discussion_r156871374
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ContinuousDataReader.java
 ---
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.reader;
+
+import org.apache.spark.sql.sources.v2.reader.PartitionOffset;
+
+import java.io.IOException;
+
+/**
+ * A variation on {@link DataReader} for use with streaming in continuous 
processing mode.
+ */
+public interface ContinuousDataReader extends DataReader {
+/**
+ * Get the offset of the current record, or the start offset if no 
records have been read.
+ *
+ * The execution engine will call this method along with get() to keep 
track of the current
--- End diff --

better to use a real java doc link, e.g. `{@link DataReader#get}`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19925#discussion_r156870726
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/MicroBatchWriteSupport.java
 ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2;
+
+import java.util.Optional;
+
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.execution.streaming.BaseStreamingSink;
+import org.apache.spark.sql.sources.v2.writer.DataSourceV2Writer;
+import org.apache.spark.sql.streaming.OutputMode;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A mix-in interface for {@link DataSourceV2}. Data sources can implement 
this interface to
+ * provide data writing ability and save the data from a microbatch to the 
data source.
+ */
+@InterfaceStability.Evolving
+public interface MicroBatchWriteSupport extends BaseStreamingSink {
+
+  /**
+   * Creates an optional {@link DataSourceV2Writer} to save the data to 
this data source. Data
+   * sources can return None if there is no writing needed to be done.
+   *
+   * @param queryId A unique string for the writing query. It's possible 
that there are many writing
+   *queries running at the same time, and the returned 
{@link DataSourceV2Writer}
+   *can use this id to distinguish itself from others.
+   * @param epochId The uniquenumeric ID of the batch within this writing 
query. This is an
--- End diff --

typo: `unique numberic`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19811
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19811
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84895/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19811
  
**[Test build #84895 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84895/testReport)**
 for PR 19811 at commit 
[`49119a9`](https://github.com/apache/spark/commit/49119a9de53cd7bb2bf91df6b691358da22e1b00).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19975
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19975
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84893/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19975
  
**[Test build #84893 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84893/testReport)**
 for PR 19975 at commit 
[`1ab78ed`](https://github.com/apache/spark/commit/1ab78ed24c81903ae33d48f34d2e89f0a4eaa24e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19972: [SPARK-22778][Kubernetes] Added the missing service meta...

2017-12-13 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19972
  
LGTM




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19865: [SPARK-22668][SQL] Ensure no global variables in ...

2017-12-13 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19865#discussion_r156868316
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -217,6 +217,18 @@ class CodegenContext {
 splitExpressions(expressions = initCodes, funcName = "init", arguments 
= Nil)
   }
 
+  /**
+   * Return true if a given variable has been described as a global 
variable
+   */
+  def isDeclaredMutableState(varName: String): Boolean = {
--- End diff --

hmm, even only in `assert`, it still can fail compilation, doesn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19970: [SPARK-22775][SQL] move dictionary related APIs from Col...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19970
  
retest this pease


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19975
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19975
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84892/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19975
  
**[Test build #84892 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84892/testReport)**
 for PR 19975 at commit 
[`66475b7`](https://github.com/apache/spark/commit/66475b737e2db96cb2bf491cf22d133bdcfe9b02).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19865: [SPARK-22668][SQL] Ensure no global variables in ...

2017-12-13 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19865#discussion_r156866988
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -842,7 +856,10 @@ class CodegenContext {
   blocks.head
 } else {
   val func = freshName(funcName)
-  val argString = arguments.map { case (t, name) => s"$t $name" 
}.mkString(", ")
+  val argString = arguments.map { case (t, name) =>
+assert(!isDeclaredMutableState(name),
--- End diff --

Yeah, this problem occurs not only in split method also in normal method. 
Let us check this in `addNewFunction`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19865: [SPARK-22668][SQL] Ensure no global variables in ...

2017-12-13 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19865#discussion_r156866827
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -217,6 +217,18 @@ class CodegenContext {
 splitExpressions(expressions = initCodes, funcName = "init", arguments 
= Nil)
   }
 
+  /**
+   * Return true if a given variable has been described as a global 
variable
+   */
+  def isDeclaredMutableState(varName: String): Boolean = {
+val j = varName.indexOf("[")
--- End diff --

I see.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19865: [SPARK-22668][SQL] Ensure no global variables in ...

2017-12-13 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19865#discussion_r156866697
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -217,6 +217,18 @@ class CodegenContext {
 splitExpressions(expressions = initCodes, funcName = "init", arguments 
= Nil)
   }
 
+  /**
+   * Return true if a given variable has been described as a global 
variable
+   */
+  def isDeclaredMutableState(varName: String): Boolean = {
--- End diff --

Of course, this PR uses the method only in `assert`.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-13 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19813
  
> Probably, we'd better to move this discussion to the jira?

I summary it and also post the design doc link to the jira.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19974: [SPARK-22779][sql] Resolve default values for fal...

2017-12-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19974


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...

2017-12-13 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19974
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...

2017-12-13 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19974
  
LGTM 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-13 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19813
  
> can we have a google doc so that we can leave comments inline? thanks!

Sure. The google doc is at 
https://docs.google.com/document/d/1By_V-A2sxCWbP7dZ5EzHIuMSe8K0fQL9lqovGWXnsfs/edit?usp=sharing



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14151
  
**[Test build #84896 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84896/testReport)**
 for PR 14151 at commit 
[`021039b`](https://github.com/apache/spark/commit/021039bd1382392282faaec1e1f5c0d39e650a93).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...

2017-12-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19925


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19974: [SPARK-22779][sql] Resolve default values for fal...

2017-12-13 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19974#discussion_r156863889
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala ---
@@ -280,4 +280,34 @@ class SQLConfSuite extends QueryTest with 
SharedSQLContext {
 
 spark.sessionState.conf.clear()
   }
+
+  test("SPARK-22779: correctly compute default value for fallback 
configs") {
+val fallback = SQLConf.buildConf("spark.sql.__test__.spark_22779")
+  .fallbackConf(SQLConf.PARQUET_COMPRESSION)
--- End diff --

In SQLConf, all our conf are `TypedConfigBuilder`, instead of 
`ConfigBuilder`. The type-safe ConfigBuilder is unable to call `fallbackConf `, 
right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19925: [SPARK-22732] Add Structured Streaming APIs to DataSourc...

2017-12-13 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19925
  
Thanks! Merging to master to unblock #19926. If there are more comments, we 
can address them in #19926. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19865: [SPARK-22668][SQL] Ensure no global variables in ...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19865#discussion_r156863124
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -217,6 +217,18 @@ class CodegenContext {
 splitExpressions(expressions = initCodes, funcName = "init", arguments 
= Nil)
   }
 
+  /**
+   * Return true if a given variable has been described as a global 
variable
+   */
+  def isDeclaredMutableState(varName: String): Boolean = {
+val j = varName.indexOf("[")
--- End diff --

we don't need to deal with `[]` here, using them as parameter name will 
fail to compile.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19974
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84890/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19974
  
**[Test build #84890 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84890/testReport)**
 for PR 19974 at commit 
[`4402fc7`](https://github.com/apache/spark/commit/4402fc7ae33cf3bd1d5c2c8f2dcf380a90615f7c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19974
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19865: [SPARK-22668][SQL] Ensure no global variables in ...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19865#discussion_r156862608
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -217,6 +217,18 @@ class CodegenContext {
 splitExpressions(expressions = initCodes, funcName = "init", arguments 
= Nil)
   }
 
+  /**
+   * Return true if a given variable has been described as a global 
variable
+   */
+  def isDeclaredMutableState(varName: String): Boolean = {
--- End diff --

let's only enable this check in test environment, in case it has bugs and 
break production jobs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19970: [SPARK-22775][SQL] move dictionary related APIs from Col...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19970
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84891/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19970: [SPARK-22775][SQL] move dictionary related APIs from Col...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19970
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19970: [SPARK-22775][SQL] move dictionary related APIs from Col...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19970
  
**[Test build #84891 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84891/testReport)**
 for PR 19970 at commit 
[`c38f58e`](https://github.com/apache/spark/commit/c38f58e80ef5e8c4c3b732695ea94681f22d85b4).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19971: [SPARK-22774][SQL][Test] Add compilation check in...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19971#discussion_r156861190
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala ---
@@ -348,13 +351,38 @@ class TPCDSQuerySuite extends QueryTest with 
SharedSQLContext with BeforeAndAfte
 "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
 "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99")
 
+  private def checkGeneratedCode(plan: SparkPlan): Unit = {
+val codegenSubtrees = new 
collection.mutable.HashSet[WholeStageCodegenExec]()
+plan foreach {
+  case s: WholeStageCodegenExec =>
+codegenSubtrees += s
+  case s => s
+}
+codegenSubtrees.toSeq.map { subtree =>
+  val code = subtree.doCodeGen()._2
+  try {
+// Just check the generated code can be properly compiled
+CodeGenerator.compile(code)
+  } catch {
+case e: Exception =>
+  logError(s"failed to compile: $e", e)
+  val msg =
+s"Subtree:\n$subtree\n" +
+s"Generated code:\n${CodeFormatter.format(code)}\n"
+  logDebug(msg)
--- End diff --

it's more useful to include the error message in the exception, e.g. `throw 
new Exception(msg, e)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19971: [SPARK-22774][SQL][Test] Add compilation check in...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19971#discussion_r156860978
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala ---
@@ -348,13 +351,38 @@ class TPCDSQuerySuite extends QueryTest with 
SharedSQLContext with BeforeAndAfte
 "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
 "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99")
 
+  private def checkGeneratedCode(plan: SparkPlan): Unit = {
+val codegenSubtrees = new 
collection.mutable.HashSet[WholeStageCodegenExec]()
+plan foreach {
+  case s: WholeStageCodegenExec =>
+codegenSubtrees += s
+  case s => s
+}
+codegenSubtrees.toSeq.map { subtree =>
--- End diff --

nit: `map` -> `foreach`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19594#discussion_r156861642
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala
 ---
@@ -191,8 +191,16 @@ case class JoinEstimation(join: Join) extends Logging {
   val rInterval = ValueInterval(rightKeyStat.min, rightKeyStat.max, 
rightKey.dataType)
   if (ValueInterval.isIntersected(lInterval, rInterval)) {
 val (newMin, newMax) = ValueInterval.intersect(lInterval, 
rInterval, leftKey.dataType)
-val (card, joinStat) = computeByNdv(leftKey, rightKey, newMin, 
newMax)
-keyStatsAfterJoin += (leftKey -> joinStat, rightKey -> joinStat)
+val (card, joinStat) = (leftKeyStat.histogram, 
rightKeyStat.histogram) match {
+  case (Some(l: Histogram), Some(r: Histogram)) =>
+computeByEquiHeightHistogram(leftKey, rightKey, l, r, newMin, 
newMax)
+  case _ =>
+computeByNdv(leftKey, rightKey, newMin, newMax)
+}
+keyStatsAfterJoin += (
+  leftKey -> joinStat.copy(histogram = leftKeyStat.histogram),
+  rightKey -> joinStat.copy(histogram = rightKeyStat.histogram)
--- End diff --

Actually keeping it unchanged is more memory efficient. We just pass around 
pointers, but updating the histogram means creating a new one.

Let's keep it, and add some comments to explain it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-13 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156860058
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -179,11 +209,64 @@ class CodegenContext {
* the list of default imports available.
* Also, generic type arguments are accepted but ignored.
* @param variableName Name of the field.
-   * @param initCode The statement(s) to put into the init() method to 
initialize this field.
-   * If left blank, the field will be default-initialized.
+   * @param initFunc Function includes statement(s) to put into the init() 
method to initialize
+   *this field. An argument is the name of the mutable state 
variable.
+   *If left blank, the field will be default-initialized.
+   * @param inline whether the declaration and initialization code may be 
inlined rather than
+   *   compacted.
+   * @param useFreshName If false and inline is true, the name is not 
changed
+   * @return the name of the mutable state variable, which is either the 
original name if the
+   * variable is inlined to the outer class, or an array access if 
the variable is to be
+   * stored in an array of variables of the same type and 
initialization.
+   * There are two use cases. One is to use the original name for 
global variable instead
+   * of fresh name. Second is to use the original initialization 
statement since it is
+   * complex (e.g. allocate multi-dimensional array or object 
constructor has varibles).
+   * Primitive type variables will be inlined into outer class 
when the total number of
+   * mutable variables is less than 
`CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD`
+   * the max size of an array for compaction is given by
+   * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`.
*/
-  def addMutableState(javaType: String, variableName: String, initCode: 
String = ""): Unit = {
-mutableStates += ((javaType, variableName, initCode))
+  def addMutableState(
+  javaType: String,
+  variableName: String,
+  initFunc: String => String = _ => "",
+  inline: Boolean = false,
+  useFreshName: Boolean = true): String = {
+val varName = if (useFreshName) freshName(variableName) else 
variableName
+
+if (inline ||
+// want to put a primitive type variable at outerClass for 
performance
+isPrimitiveType(javaType) &&
+  (mutableStates.length < 
CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD) ||
+// type is multi-dimensional array
--- End diff --

We are changing a declared type for compaction (i.e. adding one dimension 
at the first dimension) at `declareMutableStates`. In the logic at 
`declareMutableStates` assumes that a given variable is scalar or 
one-dimensional array.  
We could support multi-dimensional array. However, it requires slightly 
complicated string operations. Should we support multi-dimensional array or 
keep it simple?

@cloud-fan WDYT?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-13 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/19813
  
Probably, we'd better to move this discussion to the jira?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19813
  
can we have a google doc so that we can leave comments inline? thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19971
  
**[Test build #84894 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84894/testReport)**
 for PR 19971 at commit 
[`1480ae3`](https://github.com/apache/spark/commit/1480ae31ea28ddf0d5ea3beffb99ae9995932be2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19811
  
**[Test build #84895 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84895/testReport)**
 for PR 19811 at commit 
[`49119a9`](https://github.com/apache/spark/commit/49119a9de53cd7bb2bf91df6b691358da22e1b00).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-13 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19811
  
Just resolved conflicts. Next commit will address review comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

2017-12-13 Thread gczsjdy

Github user gczsjdy commented on the issue:

https://github.com/apache/spark/pull/19862
  
cc @cloud-fan @hvanhovell  @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-13 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19813
  
I wrote a design doc at 
https://paper.dropbox.com/doc/Split-deeply-nested-expressions-under-wholestage-codegen-WXkQ9iIlN3zkGdn8MHgiW

The design is based on what did in this PR.

Please give me feedbacks if you have time to go through it. Thank you.

cc @cloud-fan @kiszk @mgaido91 @maropu @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...

2017-12-13 Thread sitalkedia

Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/19955
  
@ericvandenbergfb - In the testing section, let's put the benchmark numbers 
for large shuffle heavy jobs we ran internally.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19955
  
and also what's the behavior if the memory is used up but 
`spark.shuffle.async.num.sorter` is not hit yet?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19955
  
shall we do it for all the external sorters?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...

2017-12-13 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19955
  
cc @JoshRosen @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17619
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-12-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19020


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19020: [SPARK-3181] [ML] Implement huber loss for LinearRegress...

2017-12-13 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/19020
  
Merged into master, thanks for all your reviewing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19020: [SPARK-3181] [ML] Implement huber loss for Linear...

2017-12-13 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19020#discussion_r156856635
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -480,10 +640,14 @@ object LinearRegression extends 
DefaultParamsReadable[LinearRegression] {
 class LinearRegressionModel private[ml] (
 @Since("1.4.0") override val uid: String,
 @Since("2.0.0") val coefficients: Vector,
-@Since("1.3.0") val intercept: Double)
+@Since("1.3.0") val intercept: Double,
+@Since("2.3.0") val scale: Double)
   extends RegressionModel[Vector, LinearRegressionModel]
   with LinearRegressionParams with MLWritable {
 
+  def this(uid: String, coefficients: Vector, intercept: Double) =
+this(uid, coefficients, intercept, 1.0)
--- End diff --

```scale``` denotes that ```|y - X'w - c|``` is scaled down, I think it 
make sense to be set 1.0 for least squares regression.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156855675
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -179,11 +209,64 @@ class CodegenContext {
* the list of default imports available.
* Also, generic type arguments are accepted but ignored.
* @param variableName Name of the field.
-   * @param initCode The statement(s) to put into the init() method to 
initialize this field.
-   * If left blank, the field will be default-initialized.
+   * @param initFunc Function includes statement(s) to put into the init() 
method to initialize
+   *this field. An argument is the name of the mutable state 
variable.
+   *If left blank, the field will be default-initialized.
+   * @param inline whether the declaration and initialization code may be 
inlined rather than
+   *   compacted.
+   * @param useFreshName If false and inline is true, the name is not 
changed
+   * @return the name of the mutable state variable, which is either the 
original name if the
+   * variable is inlined to the outer class, or an array access if 
the variable is to be
+   * stored in an array of variables of the same type and 
initialization.
+   * There are two use cases. One is to use the original name for 
global variable instead
+   * of fresh name. Second is to use the original initialization 
statement since it is
+   * complex (e.g. allocate multi-dimensional array or object 
constructor has varibles).
+   * Primitive type variables will be inlined into outer class 
when the total number of
+   * mutable variables is less than 
`CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD`
+   * the max size of an array for compaction is given by
+   * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`.
*/
-  def addMutableState(javaType: String, variableName: String, initCode: 
String = ""): Unit = {
-mutableStates += ((javaType, variableName, initCode))
+  def addMutableState(
+  javaType: String,
+  variableName: String,
+  initFunc: String => String = _ => "",
+  inline: Boolean = false,
+  useFreshName: Boolean = true): String = {
+val varName = if (useFreshName) freshName(variableName) else 
variableName
+
+if (inline ||
+// want to put a primitive type variable at outerClass for 
performance
+isPrimitiveType(javaType) &&
+  (mutableStates.length < 
CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD) ||
+// type is multi-dimensional array
+javaType.contains("[][]")) {
+  val initCode = initFunc(varName)
+  mutableStates += ((javaType, varName, initCode))
+  varName
+} else {
+  // mutableStateArray has not been declared yet for the given type 
and name. Create a new name
+  // for the array, The mutableStateArray for the given type and name 
has been declared,
+  // update the max index of the array. Then, add an entry to keep 
track of current array name
+  // for type and nit code is stored for code generation. Finally, 
return an array element
+  val (arrayName, newIdx) = {
+val compactArrayName = "mutableStateArray"
--- End diff --

how about
```
class MutableStateArrays {
  val arrayNames = mutable.ListBuffer.empty[String]
  createNewArray()
  private[this] var currentIndex = 0

  def getNextSlot(): String = {
if (currentIndex < CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT) {
  val res = s"${arrayNames.last}[$currentIndex]"
  currentIndex += 1
  res
} else {
  createNewArray()
  currentIndex = 1
  s"${arrayNames.last}[0]"
}
  }

  private def createNewArray() = 
arrayNames.append(freshName("mutableStateArray"))
}

val mutableStateArrayMap: Map[String, MutableStateArrays] = ... // type 
name -> MutableStateArrays
```
and here the logic can be very simple
```
val arrays = mutableStateArrayMap.getOrElseUpdate(javaType, new 
MutableStateArrays)
arrays.getNextSlot()
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156853851
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -205,18 +287,35 @@ class CodegenContext {
   def declareMutableStates(): String = {
 // It's possible that we add same mutable state twice, e.g. the 
`mergeExpressions` in
 // `TypedAggregateExpression`, we should call `distinct` here to 
remove the duplicated ones.
-mutableStates.distinct.map { case (javaType, variableName, _) =>
+val inlinedStates = mutableStates.distinct.map { case (javaType, 
variableName, _) =>
   s"private $javaType $variableName;"
-}.mkString("\n")
+}
+
+val arrayStates = mutableStateArrayIdx.keys.map { case (javaType, 
arrayName) =>
+  val length = mutableStateArrayIdx((javaType, arrayName)) + 1
+  if (javaType.matches("^.*\\[\\]$")) {
--- End diff --

shall we just check `javaType.contains("[]")`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19975
  
**[Test build #84893 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84893/testReport)**
 for PR 19975 at commit 
[`1ab78ed`](https://github.com/apache/spark/commit/1ab78ed24c81903ae33d48f34d2e89f0a4eaa24e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-13 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156853203
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -179,11 +209,64 @@ class CodegenContext {
* the list of default imports available.
* Also, generic type arguments are accepted but ignored.
* @param variableName Name of the field.
-   * @param initCode The statement(s) to put into the init() method to 
initialize this field.
-   * If left blank, the field will be default-initialized.
+   * @param initFunc Function includes statement(s) to put into the init() 
method to initialize
+   *this field. An argument is the name of the mutable state 
variable.
+   *If left blank, the field will be default-initialized.
+   * @param inline whether the declaration and initialization code may be 
inlined rather than
+   *   compacted.
+   * @param useFreshName If false and inline is true, the name is not 
changed
+   * @return the name of the mutable state variable, which is either the 
original name if the
+   * variable is inlined to the outer class, or an array access if 
the variable is to be
+   * stored in an array of variables of the same type and 
initialization.
+   * There are two use cases. One is to use the original name for 
global variable instead
+   * of fresh name. Second is to use the original initialization 
statement since it is
+   * complex (e.g. allocate multi-dimensional array or object 
constructor has varibles).
+   * Primitive type variables will be inlined into outer class 
when the total number of
+   * mutable variables is less than 
`CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD`
+   * the max size of an array for compaction is given by
+   * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`.
*/
-  def addMutableState(javaType: String, variableName: String, initCode: 
String = ""): Unit = {
-mutableStates += ((javaType, variableName, initCode))
+  def addMutableState(
+  javaType: String,
+  variableName: String,
+  initFunc: String => String = _ => "",
+  inline: Boolean = false,
+  useFreshName: Boolean = true): String = {
+val varName = if (useFreshName) freshName(variableName) else 
variableName
+
+if (inline ||
+// want to put a primitive type variable at outerClass for 
performance
+isPrimitiveType(javaType) &&
+  (mutableStates.length < 
CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD) ||
+// type is multi-dimensional array
--- End diff --

Good catch. Now, we can do. This limitation came from creating a loop for 
initialization.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156852077
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -179,11 +209,64 @@ class CodegenContext {
* the list of default imports available.
* Also, generic type arguments are accepted but ignored.
* @param variableName Name of the field.
-   * @param initCode The statement(s) to put into the init() method to 
initialize this field.
-   * If left blank, the field will be default-initialized.
+   * @param initFunc Function includes statement(s) to put into the init() 
method to initialize
+   *this field. An argument is the name of the mutable state 
variable.
+   *If left blank, the field will be default-initialized.
+   * @param inline whether the declaration and initialization code may be 
inlined rather than
+   *   compacted.
+   * @param useFreshName If false and inline is true, the name is not 
changed
+   * @return the name of the mutable state variable, which is either the 
original name if the
+   * variable is inlined to the outer class, or an array access if 
the variable is to be
+   * stored in an array of variables of the same type and 
initialization.
+   * There are two use cases. One is to use the original name for 
global variable instead
+   * of fresh name. Second is to use the original initialization 
statement since it is
+   * complex (e.g. allocate multi-dimensional array or object 
constructor has varibles).
+   * Primitive type variables will be inlined into outer class 
when the total number of
+   * mutable variables is less than 
`CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD`
+   * the max size of an array for compaction is given by
+   * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`.
*/
-  def addMutableState(javaType: String, variableName: String, initCode: 
String = ""): Unit = {
-mutableStates += ((javaType, variableName, initCode))
+  def addMutableState(
+  javaType: String,
+  variableName: String,
+  initFunc: String => String = _ => "",
+  inline: Boolean = false,
+  useFreshName: Boolean = true): String = {
+val varName = if (useFreshName) freshName(variableName) else 
variableName
+
+if (inline ||
+// want to put a primitive type variable at outerClass for 
performance
+isPrimitiveType(javaType) &&
+  (mutableStates.length < 
CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD) ||
+// type is multi-dimensional array
+javaType.contains("[][]")) {
+  val initCode = initFunc(varName)
+  mutableStates += ((javaType, varName, initCode))
+  varName
+} else {
+  // mutableStateArray has not been declared yet for the given type 
and name. Create a new name
--- End diff --

nit: `If ...`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156851936
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -179,11 +209,64 @@ class CodegenContext {
* the list of default imports available.
* Also, generic type arguments are accepted but ignored.
* @param variableName Name of the field.
-   * @param initCode The statement(s) to put into the init() method to 
initialize this field.
-   * If left blank, the field will be default-initialized.
+   * @param initFunc Function includes statement(s) to put into the init() 
method to initialize
+   *this field. An argument is the name of the mutable state 
variable.
+   *If left blank, the field will be default-initialized.
+   * @param inline whether the declaration and initialization code may be 
inlined rather than
+   *   compacted.
+   * @param useFreshName If false and inline is true, the name is not 
changed
+   * @return the name of the mutable state variable, which is either the 
original name if the
+   * variable is inlined to the outer class, or an array access if 
the variable is to be
+   * stored in an array of variables of the same type and 
initialization.
+   * There are two use cases. One is to use the original name for 
global variable instead
+   * of fresh name. Second is to use the original initialization 
statement since it is
+   * complex (e.g. allocate multi-dimensional array or object 
constructor has varibles).
+   * Primitive type variables will be inlined into outer class 
when the total number of
+   * mutable variables is less than 
`CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD`
+   * the max size of an array for compaction is given by
+   * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`.
*/
-  def addMutableState(javaType: String, variableName: String, initCode: 
String = ""): Unit = {
-mutableStates += ((javaType, variableName, initCode))
+  def addMutableState(
+  javaType: String,
+  variableName: String,
+  initFunc: String => String = _ => "",
+  inline: Boolean = false,
+  useFreshName: Boolean = true): String = {
+val varName = if (useFreshName) freshName(variableName) else 
variableName
+
+if (inline ||
+// want to put a primitive type variable at outerClass for 
performance
+isPrimitiveType(javaType) &&
+  (mutableStates.length < 
CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD) ||
+// type is multi-dimensional array
--- End diff --

why can't support `multi-dimensional array`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156851771
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -179,11 +209,64 @@ class CodegenContext {
* the list of default imports available.
* Also, generic type arguments are accepted but ignored.
* @param variableName Name of the field.
-   * @param initCode The statement(s) to put into the init() method to 
initialize this field.
-   * If left blank, the field will be default-initialized.
+   * @param initFunc Function includes statement(s) to put into the init() 
method to initialize
+   *this field. An argument is the name of the mutable state 
variable.
--- End diff --

nit: `The Argument...`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19975
  
**[Test build #84892 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84892/testReport)**
 for PR 19975 at commit 
[`66475b7`](https://github.com/apache/spark/commit/66475b737e2db96cb2bf491cf22d133bdcfe9b02).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156851696
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -179,11 +209,64 @@ class CodegenContext {
* the list of default imports available.
* Also, generic type arguments are accepted but ignored.
* @param variableName Name of the field.
-   * @param initCode The statement(s) to put into the init() method to 
initialize this field.
-   * If left blank, the field will be default-initialized.
+   * @param initFunc Function includes statement(s) to put into the init() 
method to initialize
+   *this field. An argument is the name of the mutable state 
variable.
--- End diff --

nit: indentation


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19975: [SPARK-22781][SS] Support creating streaming data...

2017-12-13 Thread dongjoon-hyun

GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/19975

[SPARK-22781][SS] Support creating streaming dataset with ORC files

## What changes were proposed in this pull request?

This PR supports creating streaming dataset with ORC file format.

## How was this patch tested?

Pass the newly added test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-22781

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19975.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19975


commit 66475b737e2db96cb2bf491cf22d133bdcfe9b02
Author: Dongjoon Hyun 
Date:   2017-12-14T04:14:53Z

[SPARK-22781][SS] Support creating streaming dataset with ORC file format




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19970: [SPARK-22775][SQL] move dictionary related APIs from Col...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19970
  
**[Test build #84891 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84891/testReport)**
 for PR 19970 at commit 
[`c38f58e`](https://github.com/apache/spark/commit/c38f58e80ef5e8c4c3b732695ea94681f22d85b4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19972: [SPARK-22778][Kubernetes] Added the missing service meta...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19972
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84888/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19972: [SPARK-22778][Kubernetes] Added the missing service meta...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19972
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19972: [SPARK-22778][Kubernetes] Added the missing service meta...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19972
  
**[Test build #84888 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84888/testReport)**
 for PR 19972 at commit 
[`c91c9a6`](https://github.com/apache/spark/commit/c91c9a606bb00ea57c50fe27081273e0d5a8bcee).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19813
  
I'm not sure, it depends on how many places we already doing it and what's 
the drawback if we forbid it. Let's have a design doc and gather more feedbacks.

Thanks for your understanding! both PRs are reverted.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19864#discussion_r156847927
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
@@ -80,6 +80,14 @@ class CacheManager extends Logging {
 cachedData.isEmpty
   }
 
+  private def extractStatsOfPlanForCache(plan: LogicalPlan): 
Option[Statistics] = {
+if (plan.stats.rowCount.isDefined) {
--- End diff --

The expected value should be `rowCount * avgRowSize`. Without CBO, I think 
the file size is the best we can get, although it may not be correct.

That is to say, without CBO, parquet relation may have underestimated size 
and cause OOM, users need to turn on CBO to fix it. So the same thing should 
happen in table cache.

We can fix this by defining `sizeInBytes` as `file size * some factor`, but 
it should belong to another PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...

2017-12-13 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19594#discussion_r156847785
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala
 ---
@@ -191,8 +191,16 @@ case class JoinEstimation(join: Join) extends Logging {
   val rInterval = ValueInterval(rightKeyStat.min, rightKeyStat.max, 
rightKey.dataType)
   if (ValueInterval.isIntersected(lInterval, rInterval)) {
 val (newMin, newMax) = ValueInterval.intersect(lInterval, 
rInterval, leftKey.dataType)
-val (card, joinStat) = computeByNdv(leftKey, rightKey, newMin, 
newMax)
-keyStatsAfterJoin += (leftKey -> joinStat, rightKey -> joinStat)
+val (card, joinStat) = (leftKeyStat.histogram, 
rightKeyStat.histogram) match {
+  case (Some(l: Histogram), Some(r: Histogram)) =>
+computeByEquiHeightHistogram(leftKey, rightKey, l, r, newMin, 
newMax)
+  case _ =>
+computeByNdv(leftKey, rightKey, newMin, newMax)
+}
+keyStatsAfterJoin += (
+  leftKey -> joinStat.copy(histogram = leftKeyStat.histogram),
+  rightKey -> joinStat.copy(histogram = rightKeyStat.histogram)
--- End diff --

Currently we don't update histogram since min/max can help us to know which 
bins are valid. It doesn't affect correctness. But updating histograms helps to 
reduce memory usage for histogram propagation. We can do this in both filter 
and join estimation in following PRs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-13 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19813
  
@cloud-fan Do you think we should still allow something like `a + 1` as the 
output of codegen?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...

2017-12-13 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19594#discussion_r156847046
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +115,183 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns overlapped ranges between two histograms, in the given value 
range [newMin, newMax].
+   */
+  def getOverlappedRanges(
+  leftHistogram: Histogram,
+  rightHistogram: Histogram,
+  newMin: Double,
+  newMax: Double): Seq[OverlappedRange] = {
+val overlappedRanges = new ArrayBuffer[OverlappedRange]()
+// Only bins whose range intersect [newMin, newMax] have join 
possibility.
+val leftBins = leftHistogram.bins
+  .filter(b => b.lo <= newMax && b.hi >= newMin)
+val rightBins = rightHistogram.bins
+  .filter(b => b.lo <= newMax && b.hi >= newMin)
+
+leftBins.foreach { lb =>
+  rightBins.foreach { rb =>
+val (left, leftHeight) = trimBin(lb, leftHistogram.height, newMin, 
newMax)
+val (right, rightHeight) = trimBin(rb, rightHistogram.height, 
newMin, newMax)
+// Only collect overlapped ranges.
+if (left.lo <= right.hi && left.hi >= right.lo) {
+  // Collect overlapped ranges.
+  val range = if (left.lo == left.hi) {
+// Case1: the left bin has only one value
+OverlappedRange(
+  lo = left.lo,
+  hi = left.lo,
+  leftNdv = 1,
+  rightNdv = 1,
+  leftNumRows = leftHeight,
+  rightNumRows = rightHeight / right.ndv
+)
+  } else if (right.lo == right.hi) {
+// Case2: the right bin has only one value
+OverlappedRange(
+  lo = right.lo,
+  hi = right.lo,
+  leftNdv = 1,
+  rightNdv = 1,
+  leftNumRows = leftHeight / left.ndv,
+  rightNumRows = rightHeight
+)
+  } else if (right.lo >= left.lo && right.hi >= left.hi) {
+// Case3: the left bin is "smaller" than the right bin
+//  left.loright.lo left.hi  
right.hi
+// 
+--+++--->
+val leftRatio = (left.hi - right.lo) / (left.hi - left.lo)
+val rightRatio = (left.hi - right.lo) / (right.hi - right.lo)
+if (leftRatio == 0) {
+  // The overlapped range has only one value.
+  OverlappedRange(
+lo = right.lo,
+hi = right.lo,
+leftNdv = 1,
+rightNdv = 1,
+leftNumRows = leftHeight / left.ndv,
+rightNumRows = rightHeight / right.ndv
+  )
+} else {
+  OverlappedRange(
+lo = right.lo,
+hi = left.hi,
+leftNdv = left.ndv * leftRatio,
+rightNdv = right.ndv * rightRatio,
+leftNumRows = leftHeight * leftRatio,
+rightNumRows = rightHeight * rightRatio
+  )
+}
+  } else if (right.lo <= left.lo && right.hi <= left.hi) {
+// Case4: the left bin is "larger" than the right bin
+//  right.lo   left.lo  right.hi 
left.hi
+// 
+--+++--->
+val leftRatio = (right.hi - left.lo) / (left.hi - left.lo)
+val rightRatio = (right.hi - left.lo) / (right.hi - right.lo)
+if (leftRatio == 0) {
+  // The overlapped range has only one value.
+  OverlappedRange(
+lo = right.hi,
+hi = right.hi,
+leftNdv = 1,
+rightNdv = 1,
+leftNumRows = leftHeight / left.ndv,
+rightNumRows = rightHeight / right.ndv
+  )
+} else {
+  OverlappedRange(
+lo = left.lo,
+hi = right.hi,
+leftNdv = left.ndv * leftRatio,
+rightNdv = right.ndv * rightRatio,
+leftNumRows = leftHeight * leftRatio,
+rightNumRows = rightHeight * rightRatio
+  )
+}
+  } else if (right.lo >= left.lo && right.hi <= left.hi) {
+// Case5: the left bin contains the right bin
+//  left.loright.lo

[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...

2017-12-13 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19594#discussion_r156846872
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +115,183 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns overlapped ranges between two histograms, in the given value 
range [newMin, newMax].
+   */
+  def getOverlappedRanges(
+  leftHistogram: Histogram,
+  rightHistogram: Histogram,
+  newMin: Double,
+  newMax: Double): Seq[OverlappedRange] = {
--- End diff --

yea I think `upperBound/lowerBound` is better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-13 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19813
  
Also needs to revert #19969 which is based on this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-13 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19813
  
@cloud-fan Ok. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-13 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19813
  
After some more thoughts, this PR makes a new contract that Spark doesn't 
promise before: `Expression.genCode` must output something that can be used as 
parameter name or literal.

I do remember in some places we just output statement like `a + 1` for 
codegen, but I could be wrong. At least we need to check all the places and 
document this new contract before merging this PR.

Another solution is to not make this contract. By a quick look this seems 
hard to do, because at the time of doing this, the code(method body) is already 
generated and we don't know how to replace statement like `a + 1` with the 
generated parameter name, inside the method body. We may need to do this fix 
earlier in the codegen procedure.

I'm going to revert it, let's have a proper design doc and resubmit this. 
Sorry for the inconvenient!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19955
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84889/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19955
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuf...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19955
  
**[Test build #84889 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84889/testReport)**
 for PR 19955 at commit 
[`a3ea93c`](https://github.com/apache/spark/commit/a3ea93cd829debc13de6aac3fc94e49c3a03dabc).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19970: [SPARK-22775][SQL] move dictionary related APIs f...

2017-12-13 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19970#discussion_r156843445
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
 ---
@@ -145,39 +135,39 @@
   public abstract double[] getDoubles(int rowId, int count);
 
   /**
-   * Returns the length of the array at rowid.
+   * Returns the length of the array for rowId.
*/
   public abstract int getArrayLength(int rowId);
 
   /**
-   * Returns the offset of the array at rowid.
+   * Returns the offset of the array for rowId.
*/
   public abstract int getArrayOffset(int rowId);
 
   /**
-   * Returns a utility object to get structs.
+   * Returns the struct for rowId.
*/
   public final ColumnarRow getStruct(int rowId) {
 return new ColumnarRow(this, rowId);
   }
 
   /**
-   * Returns a utility object to get structs.
-   * provided to keep API compatibility with InternalRow for code 
generation
+   * A special version of {@link #getShort(int)}, which is only used as an 
adapter for Spark codegen
--- End diff --

`getStruct(int)` instead of `getShort(int)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19974
  
**[Test build #84890 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84890/testReport)**
 for PR 19974 at commit 
[`4402fc7`](https://github.com/apache/spark/commit/4402fc7ae33cf3bd1d5c2c8f2dcf380a90615f7c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19751
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84885/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19751
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...

2017-12-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19751
  
**[Test build #84885 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84885/testReport)**
 for PR 19751 at commit 
[`b02ea2c`](https://github.com/apache/spark/commit/b02ea2c4a96e27918c02f75c19d6639daae7cb2d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19974: [SPARK-22779][sql] Resolve default values for fallback c...

2017-12-13 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/19974
  
@rxin I believe this is what you were trying to do?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19974: [SPARK-22779][sql] Resolve default values for fal...

2017-12-13 Thread vanzin

GitHub user vanzin opened a pull request:

https://github.com/apache/spark/pull/19974

[SPARK-22779][sql] Resolve default values for fallback configs.

SQLConf allows some callers to define a custom default value for
configs, and that complicates a little bit the handling of fallback
config entries, since most of the default value resolution is
hidden by the config code.

This change peaks into the internals of these fallback configs
to figure out the correct default value, and also returns the
current human-readable default when showing the default value
(e.g. through "set -v").

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vanzin/spark SPARK-22779

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19974.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19974


commit 4402fc7ae33cf3bd1d5c2c8f2dcf380a90615f7c
Author: Marcelo Vanzin 
Date:   2017-12-14T02:49:28Z

[SPARK-22779][sql] Resolve default values for fallback configs.

SQLConf allows some callers to define a custom default value for
configs, and that complicates a little bit the handling of fallback
config entries, since most of the default value resolution is
hidden by the config code.

This change peaks into the internals of these fallback configs
to figure out the correct default value, and also returns the
current human-readable default when showing the default value
(e.g. through "set -v").




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19971
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...

2017-12-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19971
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84886/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 >

1 - 100 of 597 matches

Mail list logo