date:20171114

[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19753
  
**[Test build #83884 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83884/testReport)**
 for PR 19753 at commit 
[`f684cd0`](https://github.com/apache/spark/commit/f684cd0e82bf237032c3efc90605f933ecb65c81).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19754: [BUILD] update release scripts

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19754
  
**[Test build #83883 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83883/testReport)**
 for PR 19754 at commit 
[`b860375`](https://github.com/apache/spark/commit/b860375d187f8560fefbb8ce394cf124a239ae0b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19754: [BUILD] update release scripts

2017-11-14 Thread felixcheung

GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/19754

[BUILD] update release scripts

## What changes were proposed in this pull request?

Change to dist.apache.org instead of home directory
sha512 should have .sha512 extension. From ASF release signing doc: "The 
checksum SHOULD be generated using SHA-512. A .sha file SHOULD contain a SHA-1 
checksum, for historical reasons."

NOTE: I *think* should require some changes to work with Jenkins' release 
build

## How was this patch tested?

manually

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark releasescript

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19754.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19754


commit e51029fef5b0359d71f56703e30fdc8acc43cc79
Author: Felix Cheung 
Date:   2017-11-14T10:42:15Z

svn

commit b860375d187f8560fefbb8ce394cf124a239ae0b
Author: Felix Cheung 
Date:   2017-11-15T05:40:32Z

fix repo publish




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18641: [SPARK-21413][SQL] Fix 64KB JVM bytecode limit problem i...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18641
  
**[Test build #83882 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83882/testReport)**
 for PR 18641 at commit 
[`5466ef0`](https://github.com/apache/spark/commit/5466ef0914e8e702d43019995067ef49ecb90696).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19630: wip: [SPARK-22409] Introduce function type argume...

2017-11-14 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19630#discussion_r151042155
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2247,16 +2142,20 @@ def pandas_udf(f=None, returnType=StringType()):
| 8|  JOHN DOE|  22|
+--+--++
 
-2. A `pandas.DataFrame` -> A `pandas.DataFrame`
+2. GROUP_MAP
 
-   This udf is only used with :meth:`pyspark.sql.GroupedData.apply`.
+   A group map UDF defines transformation: A `pandas.DataFrame` -> A 
`pandas.DataFrame`
The returnType should be a :class:`StructType` describing the 
schema of the returned
`pandas.DataFrame`.
+   The length of the returned `pandas.DataFrame` can arbitrary.
--- End diff --

nit: `can arbitrary` -> `can be arbitrary`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...

2017-11-14 Thread wzhfy

Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19594
  
cc @cloud-fan @gatorsmile @ron8hu 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19747: [Spark-22431][SQL] Ensure that the datatype in th...

2017-11-14 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19747#discussion_r151043733
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -895,6 +897,18 @@ private[hive] object HiveClientImpl {
 Option(hc.getComment).map(field.withComment).getOrElse(field)
   }
 
+  private def verifyColumnDataType(schema: StructType): Unit = {
+schema.map(col => {
--- End diff --

schema.foreach { col =>


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19747: [Spark-22431][SQL] Ensure that the datatype in th...

2017-11-14 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19747#discussion_r151042051
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -68,6 +69,48 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils 
with TestHiveSingleton {
   import hiveContext._
   import spark.implicits._
 
+  test("SPARK-22431: table ctas - illegal nested type") {
--- End diff --

Put all illegal cases together since they share the same logic except the 
sql statements?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19747
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19747
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83879/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19747
  
**[Test build #83879 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83879/testReport)**
 for PR 19747 at commit 
[`6267033`](https://github.com/apache/spark/commit/626703310aa269a9351a2cf7b6ce23f8e4ab095a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18641: [SPARK-21413][SQL] Fix 64KB JVM bytecode limit problem i...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18641
  
**[Test build #83880 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83880/testReport)**
 for PR 18641 at commit 
[`e69f126`](https://github.com/apache/spark/commit/e69f12636bee5f3496421d70f764976f4cb687b7).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18641: [SPARK-21413][SQL] Fix 64KB JVM bytecode limit problem i...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18641
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18641: [SPARK-21413][SQL] Fix 64KB JVM bytecode limit problem i...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18641
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83880/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-11-14 Thread jinxing64

Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19560
  
@wangyum 
Make sense.
You can also try approach in this pr. 
If there are many(tens of thousands of) ETLs in the warehouse, we cannot 
afford to give that many hints or fix all the inaccurate table properties in 
metastore.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19740: [SPARK-22514][SQL] move ColumnVector.Array and Co...

2017-11-14 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19740#discussion_r151041303
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorBasedArray.java
 ---
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.vectorized;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.*;
+import org.apache.spark.unsafe.types.CalendarInterval;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Array abstraction in {@link ColumnVector}. The instance of this class 
is intended
+ * to be reused, callers should copy the data out if it needs to be stored.
+ */
+public final class VectorBasedArray extends ArrayData {
--- End diff --

`ColumnarArray`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19740: [SPARK-22514][SQL] move ColumnVector.Array and Co...

2017-11-14 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19740#discussion_r151041177
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/VectorBasedRow.java
 ---
@@ -0,0 +1,328 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.vectorized;
+
+import java.math.BigDecimal;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.*;
+import org.apache.spark.unsafe.types.CalendarInterval;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Row abstraction in {@link ColumnVector}. The instance of this class is 
intended
+ * to be reused, callers should copy the data out if it needs to be stored.
+ */
+public final class VectorBasedRow extends InternalRow {
--- End diff --

How about `ColumnarRow`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19631
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83876/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19631
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19631
  
**[Test build #83876 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83876/testReport)**
 for PR 19631 at commit 
[`08f47ca`](https://github.com/apache/spark/commit/08f47ca3fb54315c537b1134e31e0a1a912c285e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class ClientSuite extends SparkFunSuite with Matchers `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19753
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83881/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19753
  
**[Test build #83881 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83881/testReport)**
 for PR 19753 at commit 
[`108ce2b`](https://github.com/apache/spark/commit/108ce2b1daa6b9b908c8791654433e90c666).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, 
HasHandleInvalid, JavaMLReadable,`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19753
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19631
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83875/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19631
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19631
  
**[Test build #83875 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83875/testReport)**
 for PR 19631 at commit 
[`121bcf8`](https://github.com/apache/spark/commit/121bcf8a2758858cad8e88e7fb7d78566494765b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19753
  
**[Test build #83881 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83881/testReport)**
 for PR 19753 at commit 
[`108ce2b`](https://github.com/apache/spark/commit/108ce2b1daa6b9b908c8791654433e90c666).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-14 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/19753

[SPARK-22521][ML] VectorIndexerModel support handle unseen categories via 
handleInvalid: Python API

## What changes were proposed in this pull request?

Add python api for VectorIndexerModel support handle unseen categories via 
handleInvalid.

## How was this patch tested?

doctest added.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark vector_indexer_invalid_py

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19753.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19753


commit 108ce2b1daa6b9b908c8791654433e90c666
Author: WeichenXu 
Date:   2017-11-15T06:04:56Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19621
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19621
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83878/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19621
  
**[Test build #83878 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83878/testReport)**
 for PR 19621 at commit 
[`77bea32`](https://github.com/apache/spark/commit/77bea32984b167894be79736f56601a44b99).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18853: [SPARK-21646][SQL] Add new type coercion to compa...

2017-11-14 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18853#discussion_r151037844
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1490,6 +1490,13 @@ that these options will be deprecated in future 
release as more optimizations ar
   Configures the number of partitions to use when shuffling data for 
joins or aggregations.
 
   
+  
+spark.sql.typeCoercion.mode
+default
+
+The default type coercion mode was used in spark 
prior to 2.3, and so it continues to be the default to avoid breaking behavior. 
However, it has logical inconsistencies. The hive mode is 
preferred for most new applications, though it may require additional manual 
casting.
--- End diff --

2.3 -> 2.3.0 to be clear?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...

2017-11-14 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17436#discussion_r151037948
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/AggregateHashMap.java
 ---
@@ -40,7 +42,7 @@
  */
 public class AggregateHashMap {
 
-  private OnHeapColumnVector[] columnVectors;
+  private WritableColumnVector[] columnVectors;
--- End diff --

Thanks, I realized it this morning. I will revert changes in 
`AggregateHashMap`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18641: [SPARK-21413][SQL] Fix 64KB JVM bytecode limit problem i...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18641
  
**[Test build #83880 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83880/testReport)**
 for PR 18641 at commit 
[`e69f126`](https://github.com/apache/spark/commit/e69f12636bee5f3496421d70f764976f4cb687b7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-11-14 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/19560
  
I also hint this issues:
```sql
select * from A join B on a.key = b.key
```
table A is small but table B is big and table B's stats are incorrect. so 
It will Broadcast table B.

I try to use Broadcast hint to solve this issues:
```sql
select /*+ MAPJOIN(A) */ * from A join B on a.key = b.key
```
But it doesn't work. I create a pr to fix it: 
https://github.com/apache/spark/pull/19714


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19640: [SPARK-16986][WEB-UI] Converter Started, Complete...

2017-11-14 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/19640#discussion_r151034910
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/utils.js ---
@@ -46,3 +46,25 @@ function formatBytes(bytes, type) {
 var i = Math.floor(Math.log(bytes) / Math.log(k));
 return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + 
sizes[i];
 }
+
+function padZeroes(num) {
+  return ("0" + num).slice(-2);
+}
+
+function formatTimeMillis(timeMillis) {
+  if (timeMillis <= 0) {
+return "-";
+  } else {
+var dt = new Date(timeMillis);
+return dt.getFullYear() + "-" +
+  padZeroes(dt.getMonth() + 1) + "-" +
+  padZeroes(dt.getDate()) + " " +
+  padZeroes(dt.getHours()) + ":" +
+  padZeroes(dt.getMinutes()) + ":" +
+  padZeroes(dt.getSeconds());
+  }
+}
+
+function getTimeZone() {
+  return new Date().toString().match(/\(([A-Za-z\s].*)\)/)[1];
--- End diff --

This timeZone seems incorrect. Safari gets `Asia/Shanghai`, but Chrome gets 
`America/Chicago` on the same computer.

How about just include this change to release notes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19640: [SPARK-16986][WEB-UI] Converter Started, Complete...

2017-11-14 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/19640#discussion_r151033625
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/utils.js ---
@@ -46,3 +46,25 @@ function formatBytes(bytes, type) {
 var i = Math.floor(Math.log(bytes) / Math.log(k));
 return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + 
sizes[i];
 }
+
+function padZeroes(num) {
+  return ("0" + num).slice(-2);
--- End diff --

[TagNameQuery("a") 
](https://github.com/apache/spark/blob/4a78965c22f11fbda7c9ba843ee266048bf6d319/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala#L349)
 gets result from 
[WebBrowser](https://github.com/apache/spark/blob/4a78965c22f11fbda7c9ba843ee266048bf6d319/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala#L43),
 This browser doesn't support this function.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19594
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83870/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19594
  
Build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19594
  
**[Test build #83870 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83870/testReport)**
 for PR 19594 at commit 
[`67bd651`](https://github.com/apache/spark/commit/67bd65153bd0afc30c6ef4799caa02a05a19).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds the following public classes _(experimental)_:
  * `  case class OverlappedRange(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19594
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83871/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19594
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19692: [SPARK-22469][SQL] Accuracy problem in comparison with s...

2017-11-14 Thread liutang123

Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19692
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19594: [SPARK-21984] [SQL] Join estimation based on equi-height...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19594
  
**[Test build #83871 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83871/testReport)**
 for PR 19594 at commit 
[`8b2084a`](https://github.com/apache/spark/commit/8b2084a4bec8fdd58cca809b2d2b26bdc939436d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class OverlappedRange(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

2017-11-14 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19588#discussion_r151029101
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
@@ -311,22 +346,39 @@ class VectorIndexerModel private[ml] (
   // TODO: Check more carefully about whether this whole class will be 
included in a closure.
 
   /** Per-vector transform function */
-  private val transformFunc: Vector => Vector = {
+  private lazy val transformFunc: Vector => Vector = {
 val sortedCatFeatureIndices = categoryMaps.keys.toArray.sorted
 val localVectorMap = categoryMaps
 val localNumFeatures = numFeatures
+val localHandleInvalid = getHandleInvalid
 val f: Vector => Vector = { (v: Vector) =>
   assert(v.size == localNumFeatures, "VectorIndexerModel expected 
vector of length" +
 s" $numFeatures but found length ${v.size}")
   v match {
 case dv: DenseVector =>
+  var hasInvalid = false
   val tmpv = dv.copy
   localVectorMap.foreach { case (featureIndex: Int, categoryMap: 
Map[Double, Int]) =>
-tmpv.values(featureIndex) = categoryMap(tmpv(featureIndex))
+try {
--- End diff --

The try part is fast, yet the catch part can be very slow comparably.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19381
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83877/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19381
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19381
  
**[Test build #83877 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83877/testReport)**
 for PR 19381 at commit 
[`de84ca5`](https://github.com/apache/spark/commit/de84ca501d17b44f9153577ad2118e1254d80d34).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19747
  
**[Test build #83879 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83879/testReport)**
 for PR 19747 at commit 
[`6267033`](https://github.com/apache/spark/commit/626703310aa269a9351a2cf7b6ce23f8e4ab095a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-14 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19747
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-14 Thread mallman

Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r151026919
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -961,6 +961,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val NESTED_SCHEMA_PRUNING_ENABLED =
+buildConf("spark.sql.nestedSchemaPruning.enabled")
+  .internal()
+  .doc("Prune nested fields from a logical relation's output which are 
unnecessary in " +
+"satisfying a query. This optimization allows columnar file format 
readers to avoid " +
+"reading unnecessary nested column data.")
+  .booleanConf
+  .createWithDefault(true)
--- End diff --

Giving it more though, I believe it's prudent to choose correctness over 
performance. I will change the default to `false`. "Power users" will set it to 
`true` and (hopefully) report a problem if they run into one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19621
  
**[Test build #83878 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83878/testReport)**
 for PR 19621 at commit 
[`77bea32`](https://github.com/apache/spark/commit/77bea32984b167894be79736f56601a44b99).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19631
  
**[Test build #83876 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83876/testReport)**
 for PR 19631 at commit 
[`08f47ca`](https://github.com/apache/spark/commit/08f47ca3fb54315c537b1134e31e0a1a912c285e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19381
  
**[Test build #83877 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83877/testReport)**
 for PR 19381 at commit 
[`de84ca5`](https://github.com/apache/spark/commit/de84ca501d17b44f9153577ad2118e1254d80d34).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19746
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19746
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83869/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19746
  
**[Test build #83869 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83869/testReport)**
 for PR 19746 at commit 
[`73fe1d8`](https://github.com/apache/spark/commit/73fe1d8087cfc2d59ac5b9af48b4cf5f5b86f920).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  class InvalidEntryException(msg: String) extends Exception(msg) `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19433
  
**[Test build #83874 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83874/testReport)**
 for PR 19433 at commit 
[`d86dd18`](https://github.com/apache/spark/commit/d86dd18e47451c2e4463c68db441f92a898ac765).
 * This patch **fails to generate documentation**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19433
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83874/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19433
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19631
  
**[Test build #83875 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83875/testReport)**
 for PR 19631 at commit 
[`121bcf8`](https://github.com/apache/spark/commit/121bcf8a2758858cad8e88e7fb7d78566494765b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19640: [SPARK-16986][WEB-UI] Converter Started, Completed and L...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19640
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19640: [SPARK-16986][WEB-UI] Converter Started, Completed and L...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19640
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83868/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19640: [SPARK-16986][WEB-UI] Converter Started, Completed and L...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19640
  
**[Test build #83868 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83868/testReport)**
 for PR 19640 at commit 
[`4a78965`](https://github.com/apache/spark/commit/4a78965c22f11fbda7c9ba843ee266048bf6d319).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19433
  
**[Test build #83873 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83873/testReport)**
 for PR 19433 at commit 
[`0b27c56`](https://github.com/apache/spark/commit/0b27c56d1ea4e1108a62b77e9eca8ae160740756).
 * This patch **fails to generate documentation**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19433
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19433
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83873/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2017-11-14 Thread smurching

Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19381#discussion_r151020618
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/RandomForestRegressorSuite.scala
 ---
@@ -19,14 +19,16 @@ package org.apache.spark.ml.regression
 
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.feature.LabeledPoint
+import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.tree.impl.TreeTests
 import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
+import org.apache.spark.ml.util.TestingUtils._
--- End diff --

Nit: unused import


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19381: [SPARK-10884][ML] Support prediction on single in...

2017-11-14 Thread smurching

Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19381#discussion_r151020666
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GBTRegressorSuite.scala ---
@@ -19,15 +19,16 @@ package org.apache.spark.ml.regression
 
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.feature.LabeledPoint
-import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.tree.impl.TreeTests
 import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
+import org.apache.spark.ml.util.TestingUtils._
--- End diff --

Nit: unused import


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19433
  
**[Test build #83874 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83874/testReport)**
 for PR 19433 at commit 
[`d86dd18`](https://github.com/apache/spark/commit/d86dd18e47451c2e4463c68db441f92a898ac765).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...

2017-11-14 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19631#discussion_r151020385
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala 
---
@@ -745,15 +739,20 @@ private[spark] class Client(
   // Save the YARN configuration into a separate file that will be 
overlayed on top of the
   // cluster's Hadoop conf.
   confStream.putNextEntry(new ZipEntry(SPARK_HADOOP_CONF_FILE))
-  yarnConf.writeXml(confStream)
+  hadoopConf.writeXml(confStream)
   confStream.closeEntry()
 
   // Save Spark configuration to a file in the archive.
   val props = new Properties()
   sparkConf.getAll.foreach { case (k, v) => props.setProperty(k, v) }
   // Override spark.yarn.key to point to the location in distributed 
cache which will be used
   // by AM.
-  Option(amKeytabFileName).foreach { k => 
props.setProperty(KEYTAB.key, k) }
+  Option(amKeytabFileName).foreach { k =>
+// Do not propagate the app's secret using the config file.
+if (k != SecurityManager.SPARK_AUTH_SECRET_CONF) {
--- End diff --

Oh, I think this is the wrong place for the check.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...

2017-11-14 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19631#discussion_r151020337
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -412,8 +412,6 @@ class SparkContext(config: SparkConf) extends Logging {
   }
 }
 
-if (master == "yarn" && deployMode == "client") 
System.setProperty("SPARK_YARN_MODE", "true")
--- End diff --

This change is removing all references to `SPARK_YARN_MODE`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19621
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83872/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19621
  
**[Test build #83872 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83872/testReport)**
 for PR 19621 at commit 
[`b0b14b0`](https://github.com/apache/spark/commit/b0b14b0971a7b941abbadf52d03dbb7d77e93adc).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19621
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19433
  
**[Test build #83873 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83873/testReport)**
 for PR 19433 at commit 
[`0b27c56`](https://github.com/apache/spark/commit/0b27c56d1ea4e1108a62b77e9eca8ae160740756).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-11-14 Thread smurching

Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r151019591
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -852,6 +662,41 @@ private[spark] object RandomForest extends Logging {
   }
 
   /**
+   * Find the best split for a node.
+   *
+   * @param binAggregates Bin statistics.
+   * @return tuple for best split: (Split, information gain, prediction at 
node)
+   */
+  private[tree] def binsToBestSplit(
+  binAggregates: DTStatsAggregator,
+  splits: Array[Array[Split]],
+  featuresForNode: Option[Array[Int]],
+  node: LearningNode): (Split, ImpurityStats) = {
+val validFeatureSplits = 
getNonConstantFeatures(binAggregates.metadata, featuresForNode)
+// For each (feature, split), calculate the gain, and select the best 
(feature, split).
+val parentImpurityCalc = if (node.stats == null) None else 
Some(node.stats.impurityCalculator)
--- End diff --

I believe so, the nodes at the top level are created 
([RandomForest.scala:178](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L178))
 with 
[`LearningNode.emptyNode`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala#L341),
 which sets `node.stats = null`.

I could change this to check node depth (via node index), but if we're 
planning on deprecating node indices in the future it might be best not to.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19588: [SPARK-12375][ML] VectorIndexerModel support handle unse...

2017-11-14 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19588
  
Python API jira created here: 
https://issues.apache.org/jira/browse/SPARK-22521


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-14 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19621
  
@WeichenXu123 I will try to look into this today.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...

2017-11-14 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19631#discussion_r151015745
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -412,8 +412,6 @@ class SparkContext(config: SparkConf) extends Logging {
   }
 }
 
-if (master == "yarn" && deployMode == "client") 
System.setProperty("SPARK_YARN_MODE", "true")
--- End diff --

Not sure why this is not required anymore?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...

2017-11-14 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19631#discussion_r151017494
  
--- Diff: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 ---
@@ -216,7 +216,9 @@ private[spark] object CoarseGrainedExecutorBackend 
extends Logging {
   if (driverConf.contains("spark.yarn.credentials.file")) {
 logInfo("Will periodically update credentials from: " +
   driverConf.get("spark.yarn.credentials.file"))
-SparkHadoopUtil.get.startCredentialUpdater(driverConf)
+
Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil")
--- End diff --

I see, thanks for explanation, this kind of reflection seems not so elegant.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...

2017-11-14 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19631#discussion_r151018454
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala 
---
@@ -745,15 +739,20 @@ private[spark] class Client(
   // Save the YARN configuration into a separate file that will be 
overlayed on top of the
   // cluster's Hadoop conf.
   confStream.putNextEntry(new ZipEntry(SPARK_HADOOP_CONF_FILE))
-  yarnConf.writeXml(confStream)
+  hadoopConf.writeXml(confStream)
   confStream.closeEntry()
 
   // Save Spark configuration to a file in the archive.
   val props = new Properties()
   sparkConf.getAll.foreach { case (k, v) => props.setProperty(k, v) }
   // Override spark.yarn.key to point to the location in distributed 
cache which will be used
   // by AM.
-  Option(amKeytabFileName).foreach { k => 
props.setProperty(KEYTAB.key, k) }
+  Option(amKeytabFileName).foreach { k =>
+// Do not propagate the app's secret using the config file.
+if (k != SecurityManager.SPARK_AUTH_SECRET_CONF) {
--- End diff --

Is it necessary to add a check here? I'm not sure how this could happen.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-14 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19621
  
@viirya @MLnick Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19621
  
**[Test build #83872 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83872/testReport)**
 for PR 19621 at commit 
[`b0b14b0`](https://github.com/apache/spark/commit/b0b14b0971a7b941abbadf52d03dbb7d77e93adc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-11-14 Thread smurching

Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r151017375
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala ---
@@ -0,0 +1,215 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import org.apache.spark.ml.tree.{CategoricalSplit, Split}
+import org.apache.spark.mllib.tree.impurity.ImpurityCalculator
+import org.apache.spark.mllib.tree.model.ImpurityStats
+
+/** Utility methods for choosing splits during local & distributed tree 
training. */
+private[impl] object SplitUtils {
+
+  /** Sorts ordered feature categories by label centroid, returning an 
ordered list of categories */
+  private def sortByCentroid(
+  binAggregates: DTStatsAggregator,
+  featureIndex: Int,
+  featureIndexIdx: Int): List[Int] = {
+/* Each bin is one category (feature value).
+ * The bins are ordered based on centroidForCategories, and this 
ordering determines which
+ * splits are considered.  (With K categories, we consider K - 1 
possible splits.)
+ *
+ * centroidForCategories is a list: (category, centroid)
+ */
+val numCategories = binAggregates.metadata.numBins(featureIndex)
+val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
+
+val centroidForCategories = Range(0, numCategories).map { featureValue 
=>
+  val categoryStats =
+binAggregates.getImpurityCalculator(nodeFeatureOffset, 
featureValue)
+  val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, 
categoryStats)
+  (featureValue, centroid)
+}
+// TODO(smurching): How to handle logging statements like these?
+// logDebug("Centroids for categorical variable: " + 
centroidForCategories.mkString(","))
+// bins sorted by centroids
+val categoriesSortedByCentroid = 
centroidForCategories.toList.sortBy(_._2).map(_._1)
+// logDebug("Sorted centroids for categorical variable = " +
+//   categoriesSortedByCentroid.mkString(","))
+categoriesSortedByCentroid
+  }
+
+  /**
+   * Find the best split for an unordered categorical feature at a single 
node.
+   *
+   * Algorithm:
+   *  - Considers all possible subsets (exponentially many)
+   *
+   * @param featureIndex  Global index of feature being split.
+   * @param featureIndexIdx Index of feature being split within subset of 
features for current node.
+   * @param featureSplits Array of splits for the current feature
+   * @param parentCalculator Optional: ImpurityCalculator containing 
impurity stats for current node
+   * @return  (best split, statistics for split)  If no valid split was 
found, the returned
+   *  ImpurityStats instance will be invalid (have member valid = 
false).
+   */
+  private[impl] def chooseUnorderedCategoricalSplit(
+  binAggregates: DTStatsAggregator,
+  featureIndex: Int,
+  featureIndexIdx: Int,
+  featureSplits: Array[Split],
+  parentCalculator: Option[ImpurityCalculator] = None): (Split, 
ImpurityStats) = {
+// Unordered categorical feature
+val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
+val numSplits = binAggregates.metadata.numSplits(featureIndex)
+var parentCalc = parentCalculator
+val (bestFeatureSplitIndex, bestFeatureGainStats) =
+  Range(0, numSplits).map { splitIndex =>
+val leftChildStats = 
binAggregates.getImpurityCalculator(nodeFeatureOffset, splitIndex)
+val rightChildStats = binAggregates.getParentImpurityCalculator()
+  .subtract(leftChildStats)
+val gainAndImpurityStats = 
ImpurityUtils.calculateImpurityStats(parentCalc,
+  leftChildStats, rightChildStats, binAggregates.metadata)
+// Compute parent stats once, when considering first split for 
current feature
+if

[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19751
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19751
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83867/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19751
  
**[Test build #83867 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83867/testReport)**
 for PR 19751 at commit 
[`8c346a1`](https://github.com/apache/spark/commit/8c346a148d7be78b0f53aadb9c8ca78098b0ea6c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19750
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83865/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19750
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19750
  
**[Test build #83865 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83865/testReport)**
 for PR 19750 at commit 
[`406779b`](https://github.com/apache/spark/commit/406779bf05cbab7afd8c632ebb7035fb0f2cbd28).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19594: [WIP] [SPARK-21984] Join estimation based on equi-height...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19594
  
**[Test build #83871 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83871/testReport)**
 for PR 19594 at commit 
[`8b2084a`](https://github.com/apache/spark/commit/8b2084a4bec8fdd58cca809b2d2b26bdc939436d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19752: [SPARK-22520][SQL] Support code generation for large Cas...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19752
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83866/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19752: [SPARK-22520][SQL] Support code generation for large Cas...

2017-11-14 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19752
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19752: [SPARK-22520][SQL] Support code generation for large Cas...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19752
  
**[Test build #83866 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83866/testReport)**
 for PR 19752 at commit 
[`98eaae9`](https://github.com/apache/spark/commit/98eaae9436adf63ec3023ee077f2fff8e23dfa35).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class CaseWhen(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-11-14 Thread smurching

Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r151011913
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -627,221 +621,37 @@ private[spark] object RandomForest extends Logging {
   }
 
   /**
-   * Calculate the impurity statistics for a given (feature, split) based 
upon left/right
-   * aggregates.
-   *
-   * @param stats the recycle impurity statistics for this feature's all 
splits,
-   *  only 'impurity' and 'impurityCalculator' are valid 
between each iteration
-   * @param leftImpurityCalculator left node aggregates for this (feature, 
split)
-   * @param rightImpurityCalculator right node aggregate for this 
(feature, split)
-   * @param metadata learning and dataset metadata for DecisionTree
-   * @return Impurity statistics for this (feature, split)
+   * Return a list of pairs (featureIndexIdx, featureIndex) where 
featureIndex is the global
+   * (across all trees) index of a feature and featureIndexIdx is the 
index of a feature within the
+   * list of features for a given node. Filters out constant features 
(features with 0 splits)
*/
-  private def calculateImpurityStats(
-  stats: ImpurityStats,
-  leftImpurityCalculator: ImpurityCalculator,
-  rightImpurityCalculator: ImpurityCalculator,
-  metadata: DecisionTreeMetadata): ImpurityStats = {
-
-val parentImpurityCalculator: ImpurityCalculator = if (stats == null) {
-  leftImpurityCalculator.copy.add(rightImpurityCalculator)
-} else {
-  stats.impurityCalculator
-}
-
-val impurity: Double = if (stats == null) {
-  parentImpurityCalculator.calculate()
-} else {
-  stats.impurity
-}
-
-val leftCount = leftImpurityCalculator.count
-val rightCount = rightImpurityCalculator.count
-
-val totalCount = leftCount + rightCount
-
-// If left child or right child doesn't satisfy minimum instances per 
node,
-// then this split is invalid, return invalid information gain stats.
-if ((leftCount < metadata.minInstancesPerNode) ||
-  (rightCount < metadata.minInstancesPerNode)) {
-  return 
ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator)
-}
-
-val leftImpurity = leftImpurityCalculator.calculate() // Note: This 
equals 0 if count = 0
-val rightImpurity = rightImpurityCalculator.calculate()
-
-val leftWeight = leftCount / totalCount.toDouble
-val rightWeight = rightCount / totalCount.toDouble
-
-val gain = impurity - leftWeight * leftImpurity - rightWeight * 
rightImpurity
-
-// if information gain doesn't satisfy minimum information gain,
-// then this split is invalid, return invalid information gain stats.
-if (gain < metadata.minInfoGain) {
-  return 
ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator)
+  private[impl] def getNonConstantFeatures(
+  metadata: DecisionTreeMetadata,
+  featuresForNode: Option[Array[Int]]): Seq[(Int, Int)] = {
+Range(0, metadata.numFeaturesPerNode).map { featureIndexIdx =>
--- End diff --

At some point when refactoring I was hitting errors caused by a stateful 
operation within a `map` over the output of this method (IIRC the result of the 
`map` was accessed repeatedly, causing the stateful operation to inadvertently 
be run multiple times).

However using `withFilter` and `view` now seems to work, I'll change it 
back :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

2017-11-14 Thread smurching

Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19433#discussion_r151011879
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala
 ---
@@ -112,7 +113,7 @@ private[spark] object ImpurityStats {
* minimum number of instances per node.
*/
   def getInvalidImpurityStats(impurityCalculator: ImpurityCalculator): 
ImpurityStats = {
-new ImpurityStats(Double.MinValue, impurityCalculator.calculate(),
+new ImpurityStats(Double.MinValue, impurity = -1,
--- End diff --

I changed this to be -1 here since node impurity would eventually get set 
to -1 anyways when `LearningNodes` with invalid `ImpurityStats` were converted 
into decision tree leaf nodes (see 
[`LearningNode.toNode`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala#L279))


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19708: [SPARK-22479][SQL] Exclude credentials from Savei...

2017-11-14 Thread ash211

Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19708#discussion_r151010772
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommandSuite.scala
 ---
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.SaveMode
+import org.apache.spark.sql.test.SharedSQLContext
+
+class SaveIntoDataSourceCommandSuite extends SharedSQLContext {
+
+  override protected def sparkConf: SparkConf = super.sparkConf
+  .set("spark.redaction.string.regex", "(?i)password|url")
+
+  test("treeString is redacted") {
--- End diff --

old test name?  we're not modifying the treeString anymore, it's just the 
`SaveIntoDataSourceCommand`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19594: [WIP] [SPARK-21984] Join estimation based on equi-height...

2017-11-14 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19594
  
**[Test build #83870 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83870/testReport)**
 for PR 19594 at commit 
[`67bd651`](https://github.com/apache/spark/commit/67bd65153bd0afc30c6ef4799caa02a05a19).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19588: [SPARK-12375][ML] VectorIndexerModel support hand...

2017-11-14 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19588


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 414 matches

Mail list logo