date:20171214

[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2017-12-14 Thread squito

Github user squito commented on the issue:

https://github.com/apache/spark/pull/19848
  
i dunno what the requirements are -- I was hoping you would know the hadoop 
committer semantics better than me!  I suppose a uuid is really the only get 
something globally unique, as you could even have multiple independent spark 
contexts.  I have seen a committer creating a temp directory based on the ID, 
so you could end up with a collision with them both writing to the same dir.

anyway, I'm willling to set this aside as a rare case, the fix here is 
still a huge improvement.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19971
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19971
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84913/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19971
  
**[Test build #84913 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84913/testReport)**
 for PR 19971 at commit 
[`dacb790`](https://github.com/apache/spark/commit/dacb79039c2abb09a082b817d7c52f2e0902de56).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19681: [SPARK-20652][sql] Store SQL UI data in the new a...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19681#discussion_r156987695
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusStore.scala
 ---
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.ui
+
+import java.lang.{Long => JLong}
+import java.util.Date
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize
+
+import org.apache.spark.{JobExecutionStatus, SparkConf}
+import org.apache.spark.scheduler.SparkListener
+import org.apache.spark.status.AppStatusPlugin
+import org.apache.spark.status.KVUtils.KVIndexParam
+import org.apache.spark.ui.SparkUI
+import org.apache.spark.util.Utils
+import org.apache.spark.util.kvstore.KVStore
+
+/**
+ * Provides a view of a KVStore with methods that make it easy to query 
SQL-specific state. There's
+ * no state kept in this class, so it's ok to have multiple instances of 
it in an application.
+ */
+private[sql] class SQLAppStatusStore(
+store: KVStore,
+listener: Option[SQLAppStatusListener] = None) {
+
+  def executionsList(): Seq[SQLExecutionUIData] = {
+store.view(classOf[SQLExecutionUIData]).asScala.toSeq
+  }
+
+  def execution(executionId: Long): Option[SQLExecutionUIData] = {
+try {
+  Some(store.read(classOf[SQLExecutionUIData], executionId))
+} catch {
+  case _: NoSuchElementException => None
+}
+  }
+
+  def executionsCount(): Long = {
+store.count(classOf[SQLExecutionUIData])
+  }
+
+  def executionMetrics(executionId: Long): Map[Long, String] = {
+def metricsFromStore(): Option[Map[Long, String]] = {
+  val exec = store.read(classOf[SQLExecutionUIData], executionId)
+  Option(exec.metricValues)
+}
+
+metricsFromStore()
+  .orElse(listener.flatMap(_.liveExecutionMetrics(executionId)))
+  // Try a second time in case the execution finished while this 
method is trying to
+  // get the metrics.
+  .orElse(metricsFromStore())
+  .getOrElse(Map())
+  }
+
+  def planGraph(executionId: Long): SparkPlanGraph = {
+store.read(classOf[SparkPlanGraphWrapper], 
executionId).toSparkPlanGraph()
+  }
+
+}
+
+/**
+ * An AppStatusPlugin for handling the SQL UI and listeners.
+ */
+private[sql] class SQLAppStatusPlugin extends AppStatusPlugin {
+
+  override def setupListeners(
+  conf: SparkConf,
+  store: KVStore,
+  addListenerFn: SparkListener => Unit,
+  live: Boolean): Unit = {
+// For live applications, the listener is installed in [[setupUI]]. 
This also avoids adding
+// the listener when the UI is disabled. Force installation during 
testing, though.
+if (!live || Utils.isTesting) {
+  val listener = new SQLAppStatusListener(conf, store, live, None)
+  addListenerFn(listener)
+}
+  }
+
+  override def setupUI(ui: SparkUI): Unit = {
--- End diff --

Do we have a clear rule about when `setupListeners` is called and when 
`setupUI` is called?

Here we register `SQLAppStatusListener` in both `setupListeners` and 
`setupUI`, will we register it twice?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

2017-12-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19792
  
Will take another look soon tomorrow. Sorry that is getting delayed again 
and again but I just realised this code path is a little bit tricky .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19975: [SPARK-22781][SS] Support creating streaming dataset wit...

2017-12-14 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19975
  
Hi, @tdas and @zsxwing .
Could you review this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19792
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19792
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84918/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19792
  
**[Test build #84918 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84918/testReport)**
 for PR 19792 at commit 
[`6d171dd`](https://github.com/apache/spark/commit/6d171dda179ecdbe95dbc959c961397e08b8b537).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19980: [SPARK-22785][SQL] remove ColumnVector.anyNullsSet

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19980
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84912/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19980: [SPARK-22785][SQL] remove ColumnVector.anyNullsSet

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19980
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19980: [SPARK-22785][SQL] remove ColumnVector.anyNullsSet

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19980
  
**[Test build #84912 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84912/testReport)**
 for PR 19980 at commit 
[`c4d9a41`](https://github.com/apache/spark/commit/c4d9a41bae0bdcbb8d336a7fca012ad4c7e1efc1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19976: [SPARK-22660][BUILD] Use position() and limit() to fix a...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19976
  
**[Test build #4011 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4011/testReport)**
 for PR 19976 at commit 
[`bcc4c1a`](https://github.com/apache/spark/commit/bcc4c1af60c820877e83592f1bd6b38db498a17f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19976: [SPARK-22660][BUILD] Use position() and limit() to fix a...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19976
  
**[Test build #4010 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4010/testReport)**
 for PR 19976 at commit 
[`bcc4c1a`](https://github.com/apache/spark/commit/bcc4c1af60c820877e83592f1bd6b38db498a17f).
 * This patch **fails PySpark pip packaging tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19977
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84910/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19977
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19977
  
**[Test build #84910 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84910/testReport)**
 for PR 19977 at commit 
[`a252bab`](https://github.com/apache/spark/commit/a252babda0035b32d4fcb8d76cafeadc5e2549a3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19811
  
**[Test build #84919 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84919/testReport)**
 for PR 19811 at commit 
[`a9d40e9`](https://github.com/apache/spark/commit/a9d40e9665bbe5e31a1c3dbd76e675c01730b487).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19971
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19980: [SPARK-22785][SQL] remove ColumnVector.anyNullsSet

2017-12-14 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19980
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19792
  
**[Test build #84918 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84918/testReport)**
 for PR 19792 at commit 
[`6d171dd`](https://github.com/apache/spark/commit/6d171dda179ecdbe95dbc959c961397e08b8b537).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19971
  
**[Test build #84917 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84917/testReport)**
 for PR 19971 at commit 
[`44abd2f`](https://github.com/apache/spark/commit/44abd2f541701a1ff3cc4e10e686f0de56400a98).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19792: [SPARK-22566][PYTHON] Better error message for `_...

2017-12-14 Thread gberger

Github user gberger commented on a diff in the pull request:

https://github.com/apache/spark/pull/19792#discussion_r156971910
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1083,7 +1083,11 @@ def _infer_schema(row):
 elif hasattr(row, "_fields"):  # namedtuple
 items = zip(row._fields, tuple(row))
 else:
-names = ['_%d' % i for i in range(1, len(row) + 1)]
+if names is None:
+names = ['_%d' % i for i in range(1, len(row) + 1)]
+elif len(names) < len(row):
+names = names[:]
--- End diff --

Yes, I did not want to modify the original list since `.extend` is an 
in-place operation. However, session.py#602 already creates a copy of the list 
passed by the user, so this copying in `_infer_schema` is actually not 
necessary. Removing now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19971: [SPARK-22774][SQL][Test] Add compilation check in...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19971#discussion_r156970863
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala ---
@@ -348,13 +351,41 @@ class TPCDSQuerySuite extends QueryTest with 
SharedSQLContext with BeforeAndAfte
 "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
 "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99")
 
+  private def checkGeneratedCode(plan: SparkPlan): Unit = {
+val codegenSubtrees = new 
collection.mutable.HashSet[WholeStageCodegenExec]()
+plan foreach {
+  case s: WholeStageCodegenExec =>
+codegenSubtrees += s
+  case s => s
+}
+codegenSubtrees.toSeq.foreach { subtree =>
+  val code = subtree.doCodeGen()._2
+  try {
+// Just check the generated code can be properly compiled
+CodeGenerator.compile(code)
+  } catch {
+case e: Exception =>
+  val msg =
+s"""
+   |failed to compile:
+   |Subtree:
+   |$subtree
+   |Generated code:
+   |${CodeFormatter.format(code)}
+ """
--- End diff --

Good catch, while I am using IntelliJ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19861#discussion_r156970195
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2UtilsSuite.scala
 ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class DataSourceV2UtilsSuite extends QueryTest with SharedSQLContext {
+
+  private val keyPrefix = "userDefinedDataSource"
+
+  test("method withSessionConfig() should propagate session configs 
correctly") {
+// Only match configs with keys start with 
"spark.datasource.${keyPrefix}".
+withSQLConf(s"spark.datasource.$keyPrefix.foo.bar" -> "false",
--- End diff --

We don't need this heavy `withConf` stuff, just 
```
val conf = new SQLConf
conf.set(...)
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19979: [SPARK-22644][ML][TEST][FOLLOW-UP] ML regression testsui...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19979
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84911/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19979: [SPARK-22644][ML][TEST][FOLLOW-UP] ML regression testsui...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19979
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19979: [SPARK-22644][ML][TEST][FOLLOW-UP] ML regression testsui...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19979
  
**[Test build #84911 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84911/testReport)**
 for PR 19979 at commit 
[`47dccdd`](https://github.com/apache/spark/commit/47dccdd2531f8f84f38c75139d9223c682a089d6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AFTSurvivalRegressionSuite extends MLTest with 
DefaultReadWriteTest `
  * `class DecisionTreeRegressorSuite extends MLTest with 
DefaultReadWriteTest `
  * `class GBTRegressorSuite extends MLTest with DefaultReadWriteTest `
  * `class GeneralizedLinearRegressionSuite extends MLTest with 
DefaultReadWriteTest `
  * `class IsotonicRegressionSuite extends MLTest with DefaultReadWriteTest 
`
  * `  case class CheckAnswerRowsByFunc(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19861#discussion_r156969856
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala
 ---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import java.util.regex.Pattern
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.internal.SQLConf
+
+private[sql] object DataSourceV2Utils extends Logging {
+
+  /**
+   * Helper method that turns session configs with config keys that start 
with
+   * `spark.datasource.$keyPrefix` into k/v pairs, the k/v pairs will be 
used to create data source
+   * options.
+   * A session config `spark.datasource.$keyPrefix.xxx -> yyy` will be 
transformed into
+   * `xxx -> yyy`.
+   *
+   * @param keyPrefix the data source config key prefix to be matched
+   * @param conf the session conf
+   * @return an immutable map that contains all the extracted and 
transformed k/v pairs.
+   */
+  def withSessionConfig(
--- End diff --

how about we do more things here?
```
def extractSessionConfigs(ds: DataSourceV2, conf: SQLConf) = ds match {
  case cs: SessionConfigSupport => ...
  case _ => Map.empty
}
```
and the caller side
```
val options = new DataSourceV2Options((extraOptions ++ 
extractSessionConfigs(ds, conf)).asJava)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-14 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19811
  
Another use case is for frequently-executed variable (e.g. row variable) 
for performance. We want to keep it as non-array variable.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19861#discussion_r156969297
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -238,7 +239,16 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
 if (classOf[DataSourceV2].isAssignableFrom(cls)) {
   cls.newInstance() match {
 case ds: WriteSupport =>
-  val options = new DataSourceV2Options(extraOptions.asJava)
+  val dataSource = cls.newInstance()
--- End diff --

we can use `ds` directly here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156968572
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -138,21 +138,50 @@ class CodegenContext {
 
   /**
* Holding expressions' mutable states like 
`MonotonicallyIncreasingID.count` as a
-   * 3-tuple: java type, variable name, code to init it.
-   * As an example, ("int", "count", "count = 0;") will produce code:
+   * 2-tuple: java type, variable name.
+   * As an example, ("int", "count") will produce code:
* {{{
*   private int count;
* }}}
-   * as a member variable, and add
-   * {{{
-   *   count = 0;
-   * }}}
-   * to the constructor.
+   * as a member variable
*
* They will be kept as member variables in generated classes like 
`SpecificProjection`.
*/
-  val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
-mutable.ArrayBuffer.empty[(String, String, String)]
+  val mutableStates: mutable.ArrayBuffer[(String, String)] =
+mutable.ArrayBuffer.empty[(String, String)]
+
+  // An map keyed by mutable states' types holds the status of 
mutableStateArray
+  val mutableStateArrayMap: mutable.Map[String, MutableStateArrays] =
+mutable.Map.empty[String, MutableStateArrays]
+
+  // An array holds the code that will initialize each state
+  val mutableStateInitCodes: mutable.ArrayBuffer[String] =
--- End diff --

Sure, I would appreciate it if you put the link to a note


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19971: [SPARK-22774][SQL][Test] Add compilation check in...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19971#discussion_r156968031
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala ---
@@ -348,13 +351,41 @@ class TPCDSQuerySuite extends QueryTest with 
SharedSQLContext with BeforeAndAfte
 "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
 "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99")
 
+  private def checkGeneratedCode(plan: SparkPlan): Unit = {
+val codegenSubtrees = new 
collection.mutable.HashSet[WholeStageCodegenExec]()
+plan foreach {
+  case s: WholeStageCodegenExec =>
+codegenSubtrees += s
+  case s => s
+}
+codegenSubtrees.toSeq.foreach { subtree =>
+  val code = subtree.doCodeGen()._2
+  try {
+// Just check the generated code can be properly compiled
+CodeGenerator.compile(code)
+  } catch {
+case e: Exception =>
+  val msg =
+s"""
+   |failed to compile:
+   |Subtree:
+   |$subtree
+   |Generated code:
+   |${CodeFormatter.format(code)}
+ """
--- End diff --

Actually this should be generated by IDE automatically...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156967829
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -353,8 +353,8 @@ case class FileSourceScanExec(
 }
 val numOutputRows = metricTerm(ctx, "numOutputRows")
 // PhysicalRDD always just has one input
-val input = ctx.freshName("input")
-ctx.addMutableState("scala.collection.Iterator", input, s"$input = 
inputs[0];")
+val input = ctx.addMutableState("scala.collection.Iterator", "input", 
v => s"$v = inputs[0];",
+  forceInline = true)
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19971: [SPARK-22774][SQL][Test] Add compilation check in...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19971#discussion_r156967793
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala ---
@@ -348,13 +351,41 @@ class TPCDSQuerySuite extends QueryTest with 
SharedSQLContext with BeforeAndAfte
 "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
 "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99")
 
+  private def checkGeneratedCode(plan: SparkPlan): Unit = {
+val codegenSubtrees = new 
collection.mutable.HashSet[WholeStageCodegenExec]()
+plan foreach {
+  case s: WholeStageCodegenExec =>
+codegenSubtrees += s
+  case s => s
+}
+codegenSubtrees.toSeq.foreach { subtree =>
+  val code = subtree.doCodeGen()._2
+  try {
+// Just check the generated code can be properly compiled
+CodeGenerator.compile(code)
+  } catch {
+case e: Exception =>
+  val msg =
+s"""
+   |failed to compile:
+   |Subtree:
+   |$subtree
+   |Generated code:
+   |${CodeFormatter.format(code)}
+ """
--- End diff --

`strpMargin`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156967588
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -110,8 +110,8 @@ case class RowDataSourceScanExec(
   override protected def doProduce(ctx: CodegenContext): String = {
 val numOutputRows = metricTerm(ctx, "numOutputRows")
 // PhysicalRDD always just has one input
-val input = ctx.freshName("input")
-ctx.addMutableState("scala.collection.Iterator", input, s"$input = 
inputs[0];")
+val input = ctx.addMutableState("scala.collection.Iterator", "input", 
v => s"$v = inputs[0];",
+  forceInline = true)
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156967475
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala 
---
@@ -68,30 +68,26 @@ private[sql] trait ColumnarBatchScan extends 
CodegenSupport {
*/
   // TODO: return ColumnarBatch.Rows instead
   override protected def doProduce(ctx: CodegenContext): String = {
-val input = ctx.freshName("input")
 // PhysicalRDD always just has one input
-ctx.addMutableState("scala.collection.Iterator", input, s"$input = 
inputs[0];")
+val input = ctx.addMutableState("scala.collection.Iterator", "input",
+  v => s"$v = inputs[0];", forceInline = true)
--- End diff --

In the original implementation, an assignment that refers to a variable was 
inlined. Let me stop inline.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19811
  
My major concern is how to reason about when we need to force inline. I can 
understand the use case of keeping the variable name unchanged(e.g. 
`MapObjects`), but what about other cases?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156966971
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -1205,14 +1203,14 @@ case class DecodeUsingSerializer[T](child: 
Expression, tag: ClassTag[T], kryo: B
 // try conf from env, otherwise create a new one
 val env = s"${classOf[SparkEnv].getName}.get()"
 val sparkConf = s"new ${classOf[SparkConf].getName}()"
-val serializerInit = s"""
-  if ($env == null) {
-$serializer = ($serializerInstanceClass) new 
$serializerClass($sparkConf).newInstance();
+val serializer = ctx.addMutableState(serializerInstanceClass, 
"serializerForDecode",
+  v => s"""
+   if ($env == null) {
+ $v = ($serializerInstanceClass) new 
$serializerClass($sparkConf).newInstance();
} else {
- $serializer = ($serializerInstanceClass) new 
$serializerClass($env.conf()).newInstance();
+ $v = ($serializerInstanceClass) new 
$serializerClass($env.conf()).newInstance();
}
- """
-ctx.addMutableState(serializerInstanceClass, serializer, 
serializerInit)
+ """, forceInline = true)
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156967021
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -138,21 +138,50 @@ class CodegenContext {
 
   /**
* Holding expressions' mutable states like 
`MonotonicallyIncreasingID.count` as a
-   * 3-tuple: java type, variable name, code to init it.
-   * As an example, ("int", "count", "count = 0;") will produce code:
+   * 2-tuple: java type, variable name.
+   * As an example, ("int", "count") will produce code:
* {{{
*   private int count;
* }}}
-   * as a member variable, and add
-   * {{{
-   *   count = 0;
-   * }}}
-   * to the constructor.
+   * as a member variable
*
* They will be kept as member variables in generated classes like 
`SpecificProjection`.
*/
-  val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
-mutable.ArrayBuffer.empty[(String, String, String)]
+  val mutableStates: mutable.ArrayBuffer[(String, String)] =
+mutable.ArrayBuffer.empty[(String, String)]
+
+  // An map keyed by mutable states' types holds the status of 
mutableStateArray
+  val mutableStateArrayMap: mutable.Map[String, MutableStateArrays] =
+mutable.Map.empty[String, MutableStateArrays]
+
+  // An array holds the code that will initialize each state
+  val mutableStateInitCodes: mutable.ArrayBuffer[String] =
--- End diff --

yea let's use `code`, I picked `codes` previously by mistake, you can also 
fix all of them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156966787
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
@@ -1159,14 +1158,14 @@ case class EncodeUsingSerializer(child: Expression, 
kryo: Boolean)
 // try conf from env, otherwise create a new one
 val env = s"${classOf[SparkEnv].getName}.get()"
 val sparkConf = s"new ${classOf[SparkConf].getName}()"
-val serializerInit = s"""
-  if ($env == null) {
-$serializer = ($serializerInstanceClass) new 
$serializerClass($sparkConf).newInstance();
-   } else {
- $serializer = ($serializerInstanceClass) new 
$serializerClass($env.conf()).newInstance();
-   }
- """
-ctx.addMutableState(serializerInstanceClass, serializer, 
serializerInit)
+val serializer = ctx.addMutableState(serializerInstanceClass, 
"serializerForEncode", v =>
+  s"""
+ |if ($env == null) {
+ |  $v = ($serializerInstanceClass) new 
$serializerClass($sparkConf).newInstance();
+ |} else {
+ |  $v = ($serializerInstanceClass) new 
$serializerClass($env.conf()).newInstance();
+ |}
+   """, forceInline = true)
--- End diff --

In the original implementation, constructor with non-constant value is 
inlined. Let me stop inline.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19751
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19751
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84909/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156966742
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -163,11 +192,49 @@ class CodegenContext {
* the list of default imports available.
* Also, generic type arguments are accepted but ignored.
* @param variableName Name of the field.
-   * @param initCode The statement(s) to put into the init() method to 
initialize this field.
+   * @param initFunc Function includes statement(s) to put into the init() 
method to initialize
+   * this field. The argument is the name of the mutable 
state variable.
* If left blank, the field will be default-initialized.
+   * @param forceInline whether the declaration and initialization code 
may be inlined rather than
+   *compacted. Please set `true` into forceInline, if 
you want to access the
+   *status fast (e.g. frequently accessed) or if you 
want to use the original
+   *variable name
+   * @param useFreshName If false and inline is true, the name is not 
changed
+   * @return the name of the mutable state variable, which is either the 
original name if the
+   * variable is inlined to the outer class, or an array access if 
the variable is to be
+   * stored in an array of variables of the same type and 
initialization.
+   * There are two use cases. One is to use the original name for 
global variable instead
+   * of fresh name. Second is to use the original initialization 
statement since it is
+   * complex (e.g. allocate multi-dimensional array or object 
constructor has varibles).
+   * Primitive type variables will be inlined into outer class 
when the total number of
+   * mutable variables is less than 
`CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD`
+   * the max size of an array for compaction is given by
+   * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`.
*/
-  def addMutableState(javaType: String, variableName: String, initCode: 
String = ""): Unit = {
-mutableStates += ((javaType, variableName, initCode))
+  def addMutableState(
+  javaType: String,
+  variableName: String,
+  initFunc: String => String = _ => "",
+  forceInline: Boolean = false,
+  useFreshName: Boolean = true): String = {
+val varName = if (useFreshName) freshName(variableName) else 
variableName
--- End diff --

isn't it an existing problem? Let's fix it in another PR to make this PR 
more consistent with the previous code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19751
  
**[Test build #84909 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84909/testReport)**
 for PR 19751 at commit 
[`b02ea2c`](https://github.com/apache/spark/commit/b02ea2c4a96e27918c02f75c19d6639daae7cb2d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18692
  
you are right, then I don't know if there is any valid use case for 
inferring join condition from literals...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156964736
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -484,18 +484,17 @@ case class WeekOfYear(child: Expression) extends 
UnaryExpression with ImplicitCa
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 nullSafeCodeGen(ctx, ev, time => {
   val cal = classOf[Calendar].getName
-  val c = ctx.freshName("cal")
   val dtu = DateTimeUtils.getClass.getName.stripSuffix("$")
-  ctx.addMutableState(cal, c,
+  val c = ctx.addMutableState(cal, "cal", v =>
 s"""
-  $c = $cal.getInstance($dtu.getTimeZone("UTC"));
-  $c.setFirstDayOfWeek($cal.MONDAY);
-  $c.setMinimalDaysInFirstWeek(4);
+   |$v = $cal.getInstance($dtu.getTimeZone("UTC"));
+   |$v.setFirstDayOfWeek($cal.MONDAY);
+   |$v.setMinimalDaysInFirstWeek(4);
  """)
--- End diff --

good catch, thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19941: [SPARK-22753][SQL] Get rid of dataSource.writeAndRead

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19941
  
**[Test build #84916 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84916/testReport)**
 for PR 19941 at commit 
[`41a2c15`](https://github.com/apache/spark/commit/41a2c151c11a0d690626cae2a39623e712aae764).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19941: [SPARK-22753][SQL] Get rid of dataSource.writeAnd...

2017-12-14 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/19941#discussion_r156960449
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -490,20 +489,19 @@ case class DataSource(
   }
 
   /**
-   * Writes the given [[LogicalPlan]] out to this [[DataSource]] and 
returns a [[BaseRelation]] for
-   * the following reading.
+   * Returns a [[BaseRelation]] for creating table after `planForWriting`. 
Only use
+   * in `CreateDataSourceTableAsSelectCommand` while saving data to 
non-existing table.
*/
-  def writeAndRead(mode: SaveMode, data: LogicalPlan): BaseRelation = {
+  def getRelation(mode: SaveMode, data: LogicalPlan): BaseRelation = {
 if 
(data.schema.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) {
   throw new AnalysisException("Cannot save interval data type into 
external storage.")
 }
 
 providingClass.newInstance() match {
-  case dataSource: CreatableRelationProvider =>
+  case dataSource: RelationProvider =>
--- End diff --

Ah, I see. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156959928
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -138,21 +138,50 @@ class CodegenContext {
 
   /**
* Holding expressions' mutable states like 
`MonotonicallyIncreasingID.count` as a
-   * 3-tuple: java type, variable name, code to init it.
-   * As an example, ("int", "count", "count = 0;") will produce code:
+   * 2-tuple: java type, variable name.
+   * As an example, ("int", "count") will produce code:
* {{{
*   private int count;
* }}}
-   * as a member variable, and add
-   * {{{
-   *   count = 0;
-   * }}}
-   * to the constructor.
+   * as a member variable
*
* They will be kept as member variables in generated classes like 
`SpecificProjection`.
*/
-  val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
-mutable.ArrayBuffer.empty[(String, String, String)]
+  val mutableStates: mutable.ArrayBuffer[(String, String)] =
+mutable.ArrayBuffer.empty[(String, String)]
+
+  // An map keyed by mutable states' types holds the status of 
mutableStateArray
+  val mutableStateArrayMap: mutable.Map[String, MutableStateArrays] =
+mutable.Map.empty[String, MutableStateArrays]
+
+  // An array holds the code that will initialize each state
+  val mutableStateInitCodes: mutable.ArrayBuffer[String] =
--- End diff --

I don't think it is a big deal, but I remember a note by @gatorsmile who 
advised not to use it anymore.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19941: [SPARK-22753][SQL] Get rid of dataSource.writeAnd...

2017-12-14 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/19941#discussion_r156958793
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/InsertIntoDataSourceDirCommand.scala
 ---
@@ -67,8 +67,9 @@ case class InsertIntoDataSourceDirCommand(
 
 val saveMode = if (overwrite) SaveMode.Overwrite else 
SaveMode.ErrorIfExists
 try {
-  
sparkSession.sessionState.executePlan(dataSource.planForWriting(saveMode, 
query))
-  dataSource.writeAndRead(saveMode, query)
--- End diff --

You're right, test the query "INSERT OVERWRITE DIRECTORY 
'/home/liyuanjian/tmp' USING json SELECT 1 AS a, 'c' as b;".

![image](https://user-images.githubusercontent.com/4833765/33997144-6d159446-e11e-11e7-8bae-f485873d84c3.png)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...

2017-12-14 Thread gaborgsomogyi

Github user gaborgsomogyi commented on the issue:

https://github.com/apache/spark/pull/19893
  
The last suspicious big group of threads (at least for me) is 
broadcast-exchange.* but as I've seen this is not false positive because the 
threadpool never stopped. In BroadcastExchangeExec:141 a new daemon thread pool 
created in case of broadcast join which dies when jvm dies.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156954383
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -138,21 +138,50 @@ class CodegenContext {
 
   /**
* Holding expressions' mutable states like 
`MonotonicallyIncreasingID.count` as a
-   * 3-tuple: java type, variable name, code to init it.
-   * As an example, ("int", "count", "count = 0;") will produce code:
+   * 2-tuple: java type, variable name.
+   * As an example, ("int", "count") will produce code:
* {{{
*   private int count;
* }}}
-   * as a member variable, and add
-   * {{{
-   *   count = 0;
-   * }}}
-   * to the constructor.
+   * as a member variable
*
* They will be kept as member variables in generated classes like 
`SpecificProjection`.
*/
-  val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
-mutable.ArrayBuffer.empty[(String, String, String)]
+  val mutableStates: mutable.ArrayBuffer[(String, String)] =
+mutable.ArrayBuffer.empty[(String, String)]
+
+  // An map keyed by mutable states' types holds the status of 
mutableStateArray
+  val mutableStateArrayMap: mutable.Map[String, MutableStateArrays] =
+mutable.Map.empty[String, MutableStateArrays]
+
+  // An array holds the code that will initialize each state
+  val mutableStateInitCodes: mutable.ArrayBuffer[String] =
--- End diff --

You are right in English. We are seeing some `codes` variable in source 
files. Is this renaming a big deal?
WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156953610
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
 ---
@@ -219,4 +219,31 @@ class GeneratedProjectionSuite extends SparkFunSuite {
 // - one is the mutableRow
 assert(globalVariables.length == 3)
   }
+
+  test("SPARK-18016: generated projections on wider table requiring state 
compaction") {
+val N = 6000
+val wideRow1 = new GenericInternalRow((0 until N).toArray[Any])
+val schema1 = StructType((1 to N).map(i => StructField("", 
IntegerType)))
+val wideRow2 = new GenericInternalRow(
+  (0 until N).map(i => UTF8String.fromString(i.toString)).toArray[Any])
+val schema2 = StructType((1 to N).map(i => StructField("", 
StringType)))
+val joined = new JoinedRow(wideRow1, wideRow2)
+val joinedSchema = StructType(schema1 ++ schema2)
+val nested = new JoinedRow(InternalRow(joined, joined), joined)
+val nestedSchema = StructType(
+  Seq(StructField("", joinedSchema), StructField("", joinedSchema)) ++ 
joinedSchema)
+
+val safeProj = FromUnsafeProjection(nestedSchema)
+val result = safeProj(nested)
+
+// test generated MutableProjection
+val exprs = nestedSchema.fields.zipWithIndex.map { case (f, i) =>
+  BoundReference(i, f.dataType, true)
+}
+val mutableProj = GenerateMutableProjection.generate(exprs)
--- End diff --

cc @maropu


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156953675
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -163,11 +192,49 @@ class CodegenContext {
* the list of default imports available.
* Also, generic type arguments are accepted but ignored.
* @param variableName Name of the field.
-   * @param initCode The statement(s) to put into the init() method to 
initialize this field.
+   * @param initFunc Function includes statement(s) to put into the init() 
method to initialize
+   * this field. The argument is the name of the mutable 
state variable.
* If left blank, the field will be default-initialized.
+   * @param forceInline whether the declaration and initialization code 
may be inlined rather than
+   *compacted. Please set `true` into forceInline, if 
you want to access the
+   *status fast (e.g. frequently accessed) or if you 
want to use the original
+   *variable name
+   * @param useFreshName If false and inline is true, the name is not 
changed
+   * @return the name of the mutable state variable, which is either the 
original name if the
+   * variable is inlined to the outer class, or an array access if 
the variable is to be
+   * stored in an array of variables of the same type and 
initialization.
+   * There are two use cases. One is to use the original name for 
global variable instead
+   * of fresh name. Second is to use the original initialization 
statement since it is
+   * complex (e.g. allocate multi-dimensional array or object 
constructor has varibles).
+   * Primitive type variables will be inlined into outer class 
when the total number of
+   * mutable variables is less than 
`CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD`
+   * the max size of an array for compaction is given by
+   * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`.
*/
-  def addMutableState(javaType: String, variableName: String, initCode: 
String = ""): Unit = {
-mutableStates += ((javaType, variableName, initCode))
+  def addMutableState(
+  javaType: String,
+  variableName: String,
+  initFunc: String => String = _ => "",
+  forceInline: Boolean = false,
+  useFreshName: Boolean = true): String = {
+val varName = if (useFreshName) freshName(variableName) else 
variableName
--- End diff --

Good catch


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19980: [SPARK-22785][SQL] remove ColumnVector.anyNullsSet

2017-12-14 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/19980
  
LGTM pending Jenkins.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19893
  
**[Test build #84915 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84915/testReport)**
 for PR 19893 at commit 
[`ef00796`](https://github.com/apache/spark/commit/ef007969a576dd5f84440ce78a325d6617dc5278).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...

2017-12-14 Thread gaborgsomogyi

Github user gaborgsomogyi commented on the issue:

https://github.com/apache/spark/pull/19893
  
Seems like the new feature caught some false positives in SQL:

```
= THREAD AUDIT POST ACTION CALLED WITHOUT PRE ACTION IN SUITE 
o.a.s.sql.sources.DataSourceAnalysisSuite =
= THREAD AUDIT POST ACTION CALLED WITHOUT PRE ACTION IN SUITE 
o.a.s.sql.SessionStateSuite =
```

Fixed them as well.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19893
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84907/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19893
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19893: [SPARK-16139][TEST] Add logging functionality for leaked...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19893
  
**[Test build #84907 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84907/testReport)**
 for PR 19893 at commit 
[`e2da334`](https://github.com/apache/spark/commit/e2da334560f764af71b6677f2e4c7c772063c229).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19861
  
**[Test build #84914 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84914/testReport)**
 for PR 19861 at commit 
[`52aaf51`](https://github.com/apache/spark/commit/52aaf51a9ef0d3b2517ee26cff58d7f281433881).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19971
  
**[Test build #84913 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84913/testReport)**
 for PR 19971 at commit 
[`dacb790`](https://github.com/apache/spark/commit/dacb79039c2abb09a082b817d7c52f2e0902de56).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19971: [SPARK-22774][SQL][Test] Add compilation check in...

2017-12-14 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19971#discussion_r156945302
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala ---
@@ -29,7 +32,7 @@ import org.apache.spark.util.Utils
  * This test suite ensures all the TPC-DS queries can be successfully 
analyzed and optimized
  * without hitting the max iteration threshold.
  */
-class TPCDSQuerySuite extends QueryTest with SharedSQLContext with 
BeforeAndAfterAll {
+class TPCDSQuerySuite extends QueryTest with SharedSQLContext with 
BeforeAndAfterAll with Logging {
--- End diff --

nit: drop `Logging` and the import


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156943716
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala 
---
@@ -68,30 +68,26 @@ private[sql] trait ColumnarBatchScan extends 
CodegenSupport {
*/
   // TODO: return ColumnarBatch.Rows instead
   override protected def doProduce(ctx: CodegenContext): String = {
-val input = ctx.freshName("input")
 // PhysicalRDD always just has one input
-ctx.addMutableState("scala.collection.Iterator", input, s"$input = 
inputs[0];")
+val input = ctx.addMutableState("scala.collection.Iterator", "input",
+  v => s"$v = inputs[0];", forceInline = true)
 
 // metrics
 val numOutputRows = metricTerm(ctx, "numOutputRows")
 val scanTimeMetric = metricTerm(ctx, "scanTime")
-val scanTimeTotalNs = ctx.freshName("scanTime")
-ctx.addMutableState(ctx.JAVA_LONG, scanTimeTotalNs, s"$scanTimeTotalNs 
= 0;")
+val scanTimeTotalNs = ctx.addMutableState(ctx.JAVA_LONG, "scanTime")
--- End diff --

I see. Let us leave default value as a comment for clarity.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19792: [SPARK-22566][PYTHON] Better error message for `_...

2017-12-14 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19792#discussion_r156941615
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1083,7 +1083,11 @@ def _infer_schema(row):
 elif hasattr(row, "_fields"):  # namedtuple
 items = zip(row._fields, tuple(row))
 else:
-names = ['_%d' % i for i in range(1, len(row) + 1)]
+if names is None:
+names = ['_%d' % i for i in range(1, len(row) + 1)]
+elif len(names) < len(row):
+names = names[:]
--- End diff --

Hm .. why we do this? to copy?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

2017-12-14 Thread aokolnychyi

Github user aokolnychyi commented on the issue:

https://github.com/apache/spark/pull/18692
  
I am not sure we can infer ``a == b`` if ``a in (0, 2, 3, 4)`` and ``b in 
(0, 2, 3, 4)``. 

table 'a'
```
a1 a2
1  2
3  3
4  5
```

table 'b'
```
b1 b2
1  -1
2  -2
3  -4
```

```
SELECT * FROM a, b WHERE a1 in (1, 2) AND b1 in (1, 2)
// 1 2 1 -1
// 1 2 2 -2
```
```
SELECT * FROM a JOIN b ON a.a1 = b.b1 WHERE a1 in (1, 2) AND b1 in (1, 2)
// 1 2 1 -1
```

Do I miss anything?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156941474
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -163,11 +192,49 @@ class CodegenContext {
* the list of default imports available.
* Also, generic type arguments are accepted but ignored.
* @param variableName Name of the field.
-   * @param initCode The statement(s) to put into the init() method to 
initialize this field.
+   * @param initFunc Function includes statement(s) to put into the init() 
method to initialize
+   * this field. The argument is the name of the mutable 
state variable.
* If left blank, the field will be default-initialized.
+   * @param forceInline whether the declaration and initialization code 
may be inlined rather than
+   *compacted. Please set `true` into forceInline, if 
you want to access the
+   *status fast (e.g. frequently accessed) or if you 
want to use the original
+   *variable name
+   * @param useFreshName If false and inline is true, the name is not 
changed
+   * @return the name of the mutable state variable, which is either the 
original name if the
+   * variable is inlined to the outer class, or an array access if 
the variable is to be
+   * stored in an array of variables of the same type and 
initialization.
+   * There are two use cases. One is to use the original name for 
global variable instead
+   * of fresh name. Second is to use the original initialization 
statement since it is
+   * complex (e.g. allocate multi-dimensional array or object 
constructor has varibles).
+   * Primitive type variables will be inlined into outer class 
when the total number of
+   * mutable variables is less than 
`CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD`
+   * the max size of an array for compaction is given by
+   * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`.
*/
-  def addMutableState(javaType: String, variableName: String, initCode: 
String = ""): Unit = {
-mutableStates += ((javaType, variableName, initCode))
+  def addMutableState(
+  javaType: String,
+  variableName: String,
+  initFunc: String => String = _ => "",
+  forceInline: Boolean = false,
+  useFreshName: Boolean = true): String = {
+val varName = if (useFreshName) freshName(variableName) else 
variableName
--- End diff --

Since I noticed that most of caller sides executes `freshName`, I decided 
to use the new style that can simply caller code.  If a developer want to 
guarantee the given name is unique at caller site (currently, they are only 
several cases), it is OK by using `useFreshName = true`.  


Do we need redundant code at caller side? WDYT? @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156940683
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
 ---
@@ -219,4 +219,31 @@ class GeneratedProjectionSuite extends SparkFunSuite {
 // - one is the mutableRow
 assert(globalVariables.length == 3)
   }
+
+  test("SPARK-18016: generated projections on wider table requiring state 
compaction") {
+val N = 6000
+val wideRow1 = new GenericInternalRow((0 until N).toArray[Any])
+val schema1 = StructType((1 to N).map(i => StructField("", 
IntegerType)))
+val wideRow2 = new GenericInternalRow(
+  (0 until N).map(i => UTF8String.fromString(i.toString)).toArray[Any])
+val schema2 = StructType((1 to N).map(i => StructField("", 
StringType)))
+val joined = new JoinedRow(wideRow1, wideRow2)
+val joinedSchema = StructType(schema1 ++ schema2)
+val nested = new JoinedRow(InternalRow(joined, joined), joined)
+val nestedSchema = StructType(
+  Seq(StructField("", joinedSchema), StructField("", joinedSchema)) ++ 
joinedSchema)
+
+val safeProj = FromUnsafeProjection(nestedSchema)
+val result = safeProj(nested)
+
+// test generated MutableProjection
+val exprs = nestedSchema.fields.zipWithIndex.map { case (f, i) =>
+  BoundReference(i, f.dataType, true)
+}
+val mutableProj = GenerateMutableProjection.generate(exprs)
--- End diff --

Current `GenerateUnsafeProjection` uses only the fixed number of mutable 
states. IIUC, we cannot do such a test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19980: [SPARK-22785][SQL] remove ColumnVector.anyNullsSet

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19980
  
**[Test build #84912 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84912/testReport)**
 for PR 19980 at commit 
[`c4d9a41`](https://github.com/apache/spark/commit/c4d9a41bae0bdcbb8d336a7fca012ad4c7e1efc1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19980: [SPARK-22785][SQL] remove ColumnVector.anyNullsSe...

2017-12-14 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/19980

[SPARK-22785][SQL] remove ColumnVector.anyNullsSet

## What changes were proposed in this pull request?
`ColumnVector.anyNullsSet` is not called anywhere except tests, and we can 
easily replace it with `ColumnVector.numNulls > 0`

## How was this patch tested?

existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark minor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19980.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19980


commit c4d9a41bae0bdcbb8d336a7fca012ad4c7e1efc1
Author: Wenchen Fan 
Date:   2017-12-14T12:53:20Z

remove ColumnVector.anyNullsSet




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19980: [SPARK-22785][SQL] remove ColumnVector.anyNullsSet

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19980
  
cc @ueshin @kiszk @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156935514
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -138,21 +138,50 @@ class CodegenContext {
 
   /**
* Holding expressions' mutable states like 
`MonotonicallyIncreasingID.count` as a
-   * 3-tuple: java type, variable name, code to init it.
-   * As an example, ("int", "count", "count = 0;") will produce code:
+   * 2-tuple: java type, variable name.
+   * As an example, ("int", "count") will produce code:
* {{{
*   private int count;
* }}}
-   * as a member variable, and add
-   * {{{
-   *   count = 0;
-   * }}}
-   * to the constructor.
+   * as a member variable
*
* They will be kept as member variables in generated classes like 
`SpecificProjection`.
*/
-  val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
-mutable.ArrayBuffer.empty[(String, String, String)]
+  val mutableStates: mutable.ArrayBuffer[(String, String)] =
+mutable.ArrayBuffer.empty[(String, String)]
+
+  // An map keyed by mutable states' types holds the status of 
mutableStateArray
+  val mutableStateArrayMap: mutable.Map[String, MutableStateArrays] =
+mutable.Map.empty[String, MutableStateArrays]
+
+  // An array holds the code that will initialize each state
+  val mutableStateInitCodes: mutable.ArrayBuffer[String] =
--- End diff --

nit: since code is uncountable, maybe we can rename to 
`mutableStatesInitCode`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19974: [SPARK-22779][sql] Resolve default values for fal...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19974#discussion_r156935029
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala ---
@@ -280,4 +280,34 @@ class SQLConfSuite extends QueryTest with 
SharedSQLContext {
 
 spark.sessionState.conf.clear()
   }
+
+  test("SPARK-22779: correctly compute default value for fallback 
configs") {
+val fallback = SQLConf.buildConf("spark.sql.__test__.spark_22779")
+  .fallbackConf(SQLConf.PARQUET_COMPRESSION)
--- End diff --

Yea sounds like this PR is fixing a non-existing issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156932945
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -71,17 +71,15 @@ trait BaseLimitExec extends UnaryExecNode with 
CodegenSupport {
   }
 
   override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: 
ExprCode): String = {
-val stopEarly = ctx.freshName("stopEarly")
-ctx.addMutableState(ctx.JAVA_BOOLEAN, stopEarly, s"$stopEarly = 
false;")
+val stopEarly = ctx.addMutableState(ctx.JAVA_BOOLEAN, "stopEarly")
 
 ctx.addNewFunction("stopEarly", s"""
   @Override
   protected boolean stopEarly() {
 return $stopEarly;
   }
 """, inlineToOuterClass = true)
-val countTerm = ctx.freshName("count")
-ctx.addMutableState(ctx.JAVA_INT, countTerm, s"$countTerm = 0;")
+val countTerm = ctx.addMutableState(ctx.JAVA_INT, "count")
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156932923
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -71,17 +71,15 @@ trait BaseLimitExec extends UnaryExecNode with 
CodegenSupport {
   }
 
   override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: 
ExprCode): String = {
-val stopEarly = ctx.freshName("stopEarly")
-ctx.addMutableState(ctx.JAVA_BOOLEAN, stopEarly, s"$stopEarly = 
false;")
+val stopEarly = ctx.addMutableState(ctx.JAVA_BOOLEAN, "stopEarly")
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156932789
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -163,11 +192,49 @@ class CodegenContext {
* the list of default imports available.
* Also, generic type arguments are accepted but ignored.
* @param variableName Name of the field.
-   * @param initCode The statement(s) to put into the init() method to 
initialize this field.
+   * @param initFunc Function includes statement(s) to put into the init() 
method to initialize
+   * this field. The argument is the name of the mutable 
state variable.
* If left blank, the field will be default-initialized.
+   * @param forceInline whether the declaration and initialization code 
may be inlined rather than
+   *compacted. Please set `true` into forceInline, if 
you want to access the
+   *status fast (e.g. frequently accessed) or if you 
want to use the original
+   *variable name
+   * @param useFreshName If false and inline is true, the name is not 
changed
+   * @return the name of the mutable state variable, which is either the 
original name if the
+   * variable is inlined to the outer class, or an array access if 
the variable is to be
+   * stored in an array of variables of the same type and 
initialization.
+   * There are two use cases. One is to use the original name for 
global variable instead
+   * of fresh name. Second is to use the original initialization 
statement since it is
+   * complex (e.g. allocate multi-dimensional array or object 
constructor has varibles).
+   * Primitive type variables will be inlined into outer class 
when the total number of
+   * mutable variables is less than 
`CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD`
+   * the max size of an array for compaction is given by
+   * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`.
*/
-  def addMutableState(javaType: String, variableName: String, initCode: 
String = ""): Unit = {
-mutableStates += ((javaType, variableName, initCode))
+  def addMutableState(
+  javaType: String,
+  variableName: String,
+  initFunc: String => String = _ => "",
+  forceInline: Boolean = false,
+  useFreshName: Boolean = true): String = {
+val varName = if (useFreshName) freshName(variableName) else 
variableName
--- End diff --

I think this can be moved in the if for clarity


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156932214
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala 
---
@@ -68,30 +68,26 @@ private[sql] trait ColumnarBatchScan extends 
CodegenSupport {
*/
   // TODO: return ColumnarBatch.Rows instead
   override protected def doProduce(ctx: CodegenContext): String = {
-val input = ctx.freshName("input")
 // PhysicalRDD always just has one input
-ctx.addMutableState("scala.collection.Iterator", input, s"$input = 
inputs[0];")
+val input = ctx.addMutableState("scala.collection.Iterator", "input",
+  v => s"$v = inputs[0];", forceInline = true)
 
 // metrics
 val numOutputRows = metricTerm(ctx, "numOutputRows")
 val scanTimeMetric = metricTerm(ctx, "scanTime")
-val scanTimeTotalNs = ctx.freshName("scanTime")
-ctx.addMutableState(ctx.JAVA_LONG, scanTimeTotalNs, s"$scanTimeTotalNs 
= 0;")
+val scanTimeTotalNs = ctx.addMutableState(ctx.JAVA_LONG, "scanTime")
 
 val columnarBatchClz = classOf[ColumnarBatch].getName
-val batch = ctx.freshName("batch")
-ctx.addMutableState(columnarBatchClz, batch, s"$batch = null;")
+val batch = ctx.addMutableState(columnarBatchClz, "batch")
 
-val idx = ctx.freshName("batchIdx")
-ctx.addMutableState(ctx.JAVA_INT, idx, s"$idx = 0;")
-val colVars = output.indices.map(i => ctx.freshName("colInstance" + i))
+val idx = ctx.addMutableState(ctx.JAVA_INT, "batchIdx")
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156932172
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala 
---
@@ -68,30 +68,26 @@ private[sql] trait ColumnarBatchScan extends 
CodegenSupport {
*/
   // TODO: return ColumnarBatch.Rows instead
   override protected def doProduce(ctx: CodegenContext): String = {
-val input = ctx.freshName("input")
 // PhysicalRDD always just has one input
-ctx.addMutableState("scala.collection.Iterator", input, s"$input = 
inputs[0];")
+val input = ctx.addMutableState("scala.collection.Iterator", "input",
+  v => s"$v = inputs[0];", forceInline = true)
 
 // metrics
 val numOutputRows = metricTerm(ctx, "numOutputRows")
 val scanTimeMetric = metricTerm(ctx, "scanTime")
-val scanTimeTotalNs = ctx.freshName("scanTime")
-ctx.addMutableState(ctx.JAVA_LONG, scanTimeTotalNs, s"$scanTimeTotalNs 
= 0;")
+val scanTimeTotalNs = ctx.addMutableState(ctx.JAVA_LONG, "scanTime")
--- End diff --

I see that the initialization is not needed since 0 is the default value, 
but maybe we can leave it for clarity?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19976: [SPARK-22660][BUILD] Use position() and limit() to fix a...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19976
  
**[Test build #4010 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4010/testReport)**
 for PR 19976 at commit 
[`bcc4c1a`](https://github.com/apache/spark/commit/bcc4c1af60c820877e83592f1bd6b38db498a17f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19978: [SPARK-22784][CORE] Configure reading buffer size in Spa...

2017-12-14 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19978
  
That seems quite large, but I'm only speculating. If it defaulted to 
something that's just not tiny, would that help? 1M events?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156930498
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -189,15 +255,37 @@ class CodegenContext {
   def declareMutableStates(): String = {
 // It's possible that we add same mutable state twice, e.g. the 
`mergeExpressions` in
 // `TypedAggregateExpression`, we should call `distinct` here to 
remove the duplicated ones.
-mutableStates.distinct.map { case (javaType, variableName, _) =>
+val inlinedStates = mutableStates.distinct.map { case (javaType, 
variableName) =>
   s"private $javaType $variableName;"
-}.mkString("\n")
+}
+
+val arrayStates = mutableStateArrayMap.flatMap { case (javaType, 
mutableStateArrays) =>
+  val numArrays = mutableStateArrays.arrayNames.size
+  mutableStateArrays.arrayNames.zipWithIndex.map { case (arrayName, 
index) =>
+val length = if (index + 1 == numArrays) {
+  mutableStateArrays.getCurrentIndex
+} else {
+  CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT
--- End diff --

sorry, stupid question.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156930403
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -163,11 +192,49 @@ class CodegenContext {
* the list of default imports available.
* Also, generic type arguments are accepted but ignored.
* @param variableName Name of the field.
-   * @param initCode The statement(s) to put into the init() method to 
initialize this field.
+   * @param initFunc Function includes statement(s) to put into the init() 
method to initialize
+   * this field. The argument is the name of the mutable 
state variable.
* If left blank, the field will be default-initialized.
+   * @param forceInline whether the declaration and initialization code 
may be inlined rather than
+   *compacted. Please set `true` into forceInline, if 
you want to access the
+   *status fast (e.g. frequently accessed) or if you 
want to use the original
+   *variable name
+   * @param useFreshName If false and inline is true, the name is not 
changed
+   * @return the name of the mutable state variable, which is either the 
original name if the
+   * variable is inlined to the outer class, or an array access if 
the variable is to be
+   * stored in an array of variables of the same type and 
initialization.
+   * There are two use cases. One is to use the original name for 
global variable instead
+   * of fresh name. Second is to use the original initialization 
statement since it is
+   * complex (e.g. allocate multi-dimensional array or object 
constructor has varibles).
+   * Primitive type variables will be inlined into outer class 
when the total number of
+   * mutable variables is less than 
`CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD`
+   * the max size of an array for compaction is given by
+   * `CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT`.
*/
-  def addMutableState(javaType: String, variableName: String, initCode: 
String = ""): Unit = {
-mutableStates += ((javaType, variableName, initCode))
+  def addMutableState(
+  javaType: String,
+  variableName: String,
+  initFunc: String => String = _ => "",
+  forceInline: Boolean = false,
+  useFreshName: Boolean = true): String = {
+val varName = if (useFreshName) freshName(variableName) else 
variableName
--- End diff --

instead of calling `freshName` here and adding a `useFreshName` parameter, 
can we follow the previous style and ask the caller side to guarantee the given 
name is unique? i.e. call `freshName` at caller side


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19971
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19971
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84906/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19971: [SPARK-22774][SQL][Test] Add compilation check into TPCD...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19971
  
**[Test build #84906 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84906/testReport)**
 for PR 19971 at commit 
[`7d2c599`](https://github.com/apache/spark/commit/7d2c5995070dd0ef56d3d75bca30d3fbfc8443d3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19979: [SPARK-22644][ML][TEST][FOLLOW-UP] ML regression testsui...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19979
  
**[Test build #84911 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84911/testReport)**
 for PR 19979 at commit 
[`47dccdd`](https://github.com/apache/spark/commit/47dccdd2531f8f84f38c75139d9223c682a089d6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19979: [SPARK-22644][ML][TEST][FOLLOW-UP] ML regression ...

2017-12-14 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/19979

[SPARK-22644][ML][TEST][FOLLOW-UP] ML regression testsuite add 
StructuredStreaming test

## What changes were proposed in this pull request?

ML regression testsuite add StructuredStreaming test

In order to make testsuite easier to modify, new helper function added in 
`MLTest`:
```
def testTransformerByGlobalCheckFunc[A : Encoder](
  dataframe: DataFrame,
  transformer: Transformer,
  firstResultCol: String,
  otherResultCols: String*)
  (globalCheckFunction: Seq[Row] => Unit): Unit
```

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark ml_stream_test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19979.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19979


commit 47dccdd2531f8f84f38c75139d9223c682a089d6
Author: WeichenXu 
Date:   2017-12-14T12:14:56Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19978: [SPARK-22784][CORE] Configure reading buffer size in Spa...

2017-12-14 Thread MikhailErofeev

Github user MikhailErofeev commented on the issue:

https://github.com/apache/spark/pull/19978
  
I don't mind to just set it to a higher value. Moreover, the current 
default value (2048) is small in any case. 
For my log files, 30M buffer was the best value (a bigger one did not bring 
a lot of speedup), although for other files the optimal value could be bigger. 

What do you think? Is it ok to keep the value as 30M? With 50 cores it 
could eat 1.5G. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156928231
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -110,8 +110,8 @@ case class RowDataSourceScanExec(
   override protected def doProduce(ctx: CodegenContext): String = {
 val numOutputRows = metricTerm(ctx, "numOutputRows")
 // PhysicalRDD always just has one input
-val input = ctx.freshName("input")
-ctx.addMutableState("scala.collection.Iterator", input, s"$input = 
inputs[0];")
+val input = ctx.addMutableState("scala.collection.Iterator", "input", 
v => s"$v = inputs[0];",
+  forceInline = true)
--- End diff --

why inline


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156928270
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -353,8 +353,8 @@ case class FileSourceScanExec(
 }
 val numOutputRows = metricTerm(ctx, "numOutputRows")
 // PhysicalRDD always just has one input
-val input = ctx.freshName("input")
-ctx.addMutableState("scala.collection.Iterator", input, s"$input = 
inputs[0];")
+val input = ctx.addMutableState("scala.collection.Iterator", "input", 
v => s"$v = inputs[0];",
+  forceInline = true)
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156928205
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala 
---
@@ -68,30 +68,26 @@ private[sql] trait ColumnarBatchScan extends 
CodegenSupport {
*/
   // TODO: return ColumnarBatch.Rows instead
   override protected def doProduce(ctx: CodegenContext): String = {
-val input = ctx.freshName("input")
 // PhysicalRDD always just has one input
-ctx.addMutableState("scala.collection.Iterator", input, s"$input = 
inputs[0];")
+val input = ctx.addMutableState("scala.collection.Iterator", "input",
+  v => s"$v = inputs[0];", forceInline = true)
--- End diff --

why inline?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156926848
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
 ---
@@ -219,4 +219,31 @@ class GeneratedProjectionSuite extends SparkFunSuite {
 // - one is the mutableRow
 assert(globalVariables.length == 3)
   }
+
+  test("SPARK-18016: generated projections on wider table requiring state 
compaction") {
+val N = 6000
+val wideRow1 = new GenericInternalRow((0 until N).toArray[Any])
+val schema1 = StructType((1 to N).map(i => StructField("", 
IntegerType)))
+val wideRow2 = new GenericInternalRow(
+  (0 until N).map(i => UTF8String.fromString(i.toString)).toArray[Any])
+val schema2 = StructType((1 to N).map(i => StructField("", 
StringType)))
+val joined = new JoinedRow(wideRow1, wideRow2)
+val joinedSchema = StructType(schema1 ++ schema2)
+val nested = new JoinedRow(InternalRow(joined, joined), joined)
+val nestedSchema = StructType(
+  Seq(StructField("", joinedSchema), StructField("", joinedSchema)) ++ 
joinedSchema)
+
+val safeProj = FromUnsafeProjection(nestedSchema)
+val result = safeProj(nested)
+
+// test generated MutableProjection
+val exprs = nestedSchema.fields.zipWithIndex.map { case (f, i) =>
+  BoundReference(i, f.dataType, true)
+}
+val mutableProj = GenerateMutableProjection.generate(exprs)
--- End diff --

shall we also test `GenerateUnsafeProjection`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...

2017-12-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19977
  
**[Test build #84910 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84910/testReport)**
 for PR 19977 at commit 
[`a252bab`](https://github.com/apache/spark/commit/a252babda0035b32d4fcb8d76cafeadc5e2549a3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156926555
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
 ---
@@ -219,4 +219,31 @@ class GeneratedProjectionSuite extends SparkFunSuite {
 // - one is the mutableRow
 assert(globalVariables.length == 3)
   }
+
+  test("SPARK-18016: generated projections on wider table requiring state 
compaction") {
+val N = 6000
+val wideRow1 = new GenericInternalRow((0 until N).toArray[Any])
+val schema1 = StructType((1 to N).map(i => StructField("", 
IntegerType)))
+val wideRow2 = new GenericInternalRow(
+  (0 until N).map(i => UTF8String.fromString(i.toString)).toArray[Any])
--- End diff --

nit: `Array[Any].fill(N)(i => UTF8String.fromString(i.toString))`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156926377
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
 ---
@@ -219,4 +219,31 @@ class GeneratedProjectionSuite extends SparkFunSuite {
 // - one is the mutableRow
 assert(globalVariables.length == 3)
   }
+
+  test("SPARK-18016: generated projections on wider table requiring state 
compaction") {
+val N = 6000
+val wideRow1 = new GenericInternalRow((0 until N).toArray[Any])
--- End diff --

(0 until N).toArray[Any] -> new Array[Any](N)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19811: [SPARK-18016][SQL] Code Generation: Constant Pool...

2017-12-14 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19811#discussion_r156926146
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
 ---
@@ -401,4 +401,26 @@ class CodeGenerationSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 ctx.addReferenceObj("foo", foo)
 assert(ctx.mutableStates.isEmpty)
   }
+
+  test("SPARK-18016: def  mutable states by using an array") {
+val ctx1 = new CodegenContext
+for (i <- 1 to CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD + 10) {
+  ctx1.addMutableState(ctx1.JAVA_INT, "i", v => s"$v = $i;")
+}
+assert(ctx1.mutableStates.size == 
CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD)
+// When the number of primitive type mutable states is over the 
threshold, others are
+// allocated into an array
+
assert(ctx1.mutableStateArrayMap.get(ctx1.JAVA_INT).get.arrayNames.size == 1)
+assert(ctx1.mutableStateInitCodes.size == 
CodeGenerator.OUTER_CLASS_VARIABLES_THRESHOLD + 10)
+
+val ctx2 = new CodegenContext
+for (i <- 1 to CodeGenerator.MUTABLESTATEARRAY_SIZE_LIMIT + 10) {
+  ctx2.addMutableState("InternalRow[]", "r", v => s"$v = new 
InternalRow[$i];")
+}
+// When the number of non-primitive type mutable states is over the 
threshold, others are
+// allocated into a new array
+
assert(ctx2.mutableStateArrayMap.get("InternalRow[]").get.arrayNames.size == 2)
--- End diff --

and `assert(ctx2.mutableStateArrayMap("InternalRow[]").getCurrentIndex == 
10)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14151
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84905/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

2017-12-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14151
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 >

501 - 600 of 722 matches

Mail list logo