[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19601
  
**[Test build #83931 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83931/testReport)**
 for PR 19601 at commit 
[`9a41914`](https://github.com/apache/spark/commit/9a41914694c8f1f56f294cc2380bd6ecf1ce73b8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...

2017-11-15 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17436#discussion_r151340713
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala
 ---
@@ -75,9 +77,14 @@ class VectorizedHashMapGenerator(
   }
 }.mkString("\n").concat(";")
 
+val columnVector = if (SparkEnv.get.memoryManager.tungstenMemoryMode 
== MemoryMode.ON_HEAP) {
+  "OnHeapColumnVector"
+} else {
+  "OffHeapColumnVector"
+}
 s"""
-   |  private 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector[] batchVectors;
-   |  private 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector[] bufferVectors;
+   |  private 
org.apache.spark.sql.execution.vectorized.WritableColumnVector[] batchVectors;
+   |  private 
org.apache.spark.sql.execution.vectorized.WritableColumnVector[] bufferVectors;
--- End diff --

Good catch, thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19601
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19601
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83930/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19601
  
**[Test build #83930 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83930/testReport)**
 for PR 19601 at commit 
[`17449b4`](https://github.com/apache/spark/commit/17449b4748c5c32539227c7f50c4b6ec236ab4ee).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19763: [SPARK-22537][core] Aggregation of map output sta...

2017-11-15 Thread gczsjdy
Github user gczsjdy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19763#discussion_r151339166
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -473,16 +477,41 @@ private[spark] class MapOutputTrackerMaster(
   }
 
   /**
+   * Try to equally divide Range(0, num) to divisor slices
+   */
+  def equallyDivide(num: Int, divisor: Int): Iterator[Seq[Int]] = {
+assert(divisor > 0, "Divisor should be positive")
+val (each, remain) = (num / divisor, num % divisor)
+val (smaller, bigger) = (0 until num).splitAt((divisor-remain) * each)
+if (each != 0) {
+  smaller.grouped(each) ++ bigger.grouped(each + 1)
+} else {
+  bigger.grouped(each + 1)
+}
+  }
+
+  /**
* Return statistics about all of the outputs for a given shuffle.
*/
   def getStatistics(dep: ShuffleDependency[_, _, _]): MapOutputStatistics 
= {
 shuffleStatuses(dep.shuffleId).withMapStatuses { statuses =>
   val totalSizes = new Array[Long](dep.partitioner.numPartitions)
-  for (s <- statuses) {
-for (i <- 0 until totalSizes.length) {
-  totalSizes(i) += s.getSizeForBlock(i)
-}
+  val mapStatusSubmitTasks = ArrayBuffer[Future[_]]()
+  var taskSlices = parallelism
+
+  equallyDivide(totalSizes.length, taskSlices).foreach {
+reduceIds =>
+  mapStatusSubmitTasks += threadPoolMapStats.submit(
+new Runnable {
+  override def run(): Unit = {
+for (s <- statuses; i <- reduceIds) {
+  totalSizes(i) += s.getSizeForBlock(i)
+}
+  }
+}
+  )
   }
+  mapStatusSubmitTasks.foreach(_.get())
--- End diff --

Should I use the `scala.concurrent.ExecutionContext.Implicits.global` 
ExecutionContext?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19601
  
**[Test build #83930 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83930/testReport)**
 for PR 19601 at commit 
[`17449b4`](https://github.com/apache/spark/commit/17449b4748c5c32539227c7f50c4b6ec236ab4ee).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19764: [SPARK-22539][SQL] Add second order for rangepartitioner...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19764
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...

2017-11-15 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19621
  
Seems in the frequency-based string orders, the order of labels with same 
frequency is non-deterministic. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19764: [SPARK-22539][SQL] Add second order for rangepart...

2017-11-15 Thread caneGuy
GitHub user caneGuy opened a pull request:

https://github.com/apache/spark/pull/19764

[SPARK-22539][SQL] Add second order for rangepartitioner since partition 
nu…

…mber may be small if the specified key is skewed

## What changes were proposed in this pull request?

The rangepartitioner generated from shuffle exchange may cause partiton 
skew if sort key is skewed.
This patch add second order for rangepartitioner to avoid this situation.
This is an improvement from real case.

## How was this patch tested?

Manully test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/caneGuy/spark zhoukang/add-secondorder

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19764.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19764


commit 29d2c869ffc6dd4d1a3cf7606cde5d03e72fa171
Author: zhoukang 
Date:   2017-11-15T07:24:59Z

[SPARK][SQL] Add second order for rangepartitioner since partition number 
may be small if the specified key is skewed




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...

2017-11-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19630#discussion_r151333004
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2208,26 +2089,39 @@ def udf(f=None, returnType=StringType()):
 | 8|  JOHN DOE|  22|
 +--+--++
 """
-return _create_udf(f, returnType=returnType, 
pythonUdfType=PythonUdfType.NORMAL_UDF)
+# decorator @udf, @udf(), @udf(dataType())
+if f is None or isinstance(f, (str, DataType)):
+# If DataType has been passed as a positional argument
+# for decorator use it as a returnType
+return_type = f or returnType
+return functools.partial(_create_udf, returnType=return_type,
+ evalType=PythonEvalType.SQL_BATCHED_UDF)
+else:
+return _create_udf(f=f, returnType=returnType,
+   evalType=PythonEvalType.SQL_BATCHED_UDF)
 
 
 @since(2.3)
-def pandas_udf(f=None, returnType=StringType()):
+def pandas_udf(f=None, returnType=None, functionType=None):
 """
 Creates a vectorized user defined function (UDF).
 
 :param f: user-defined function. A python function if used as a 
standalone function
 :param returnType: a :class:`pyspark.sql.types.DataType` object
+:param functionType: an enum value in 
:class:`pyspark.sql.functions.PandasUdfType`.
--- End diff --

`PandasUdfType` -> `PandasUDFType`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...

2017-11-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19630#discussion_r151335091
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala
 ---
@@ -137,15 +138,18 @@ object ExtractPythonUDFs extends Rule[SparkPlan] with 
PredicateHelper {
   udf.references.subsetOf(child.outputSet)
 }
 if (validUdfs.nonEmpty) {
-  if (validUdfs.exists(_.pythonUdfType == 
PythonUdfType.PANDAS_GROUPED_UDF)) {
-throw new IllegalArgumentException("Can not use grouped 
vectorized UDFs")
-  }
+  require(validUdfs.forall(udf =>
+udf.evalType == PythonEvalType.SQL_BATCHED_UDF ||
+udf.evalType == PythonEvalType.PANDAS_SCALAR_UDF
+  ), "Can only extract scalar vectorized udf or sql batch udf")
--- End diff --

```require(validUdfs.forall(_.evalType != 
PythonEvalType.PANDAS_GROUP_MAP_UDF))```?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...

2017-11-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19630#discussion_r151333227
  
--- Diff: python/pyspark/rdd.py ---
@@ -56,6 +56,20 @@
 __all__ = ["RDD"]
 
 
+class PythonEvalType(object):
+"""
+Evaluation type of python rdd.
+
+These values are internal to PySpark.
--- End diff --

And should match with enum values in 
`org.apache.spark.api.python.PythonEvalType`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...

2017-11-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19630#discussion_r151332296
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -34,9 +34,11 @@ import org.apache.spark.util._
  */
 private[spark] object PythonEvalType {
   val NON_UDF = 0
-  val SQL_BATCHED_UDF = 1
-  val SQL_PANDAS_UDF = 2
-  val SQL_PANDAS_GROUPED_UDF = 3
+
+  val SQL_BATCHED_UDF = 100
+
+  val PANDAS_SCALAR_UDF = 200
+  val PANDAS_GROUP_MAP_UDF = 201
--- End diff --

nit: Is it better to keep `SQL` prefix?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...

2017-11-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19630#discussion_r151333891
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2208,26 +2089,39 @@ def udf(f=None, returnType=StringType()):
 | 8|  JOHN DOE|  22|
 +--+--++
 """
-return _create_udf(f, returnType=returnType, 
pythonUdfType=PythonUdfType.NORMAL_UDF)
+# decorator @udf, @udf(), @udf(dataType())
+if f is None or isinstance(f, (str, DataType)):
+# If DataType has been passed as a positional argument
+# for decorator use it as a returnType
+return_type = f or returnType
+return functools.partial(_create_udf, returnType=return_type,
+ evalType=PythonEvalType.SQL_BATCHED_UDF)
+else:
+return _create_udf(f=f, returnType=returnType,
+   evalType=PythonEvalType.SQL_BATCHED_UDF)
 
 
 @since(2.3)
-def pandas_udf(f=None, returnType=StringType()):
+def pandas_udf(f=None, returnType=None, functionType=None):
 """
 Creates a vectorized user defined function (UDF).
 
 :param f: user-defined function. A python function if used as a 
standalone function
 :param returnType: a :class:`pyspark.sql.types.DataType` object
+:param functionType: an enum value in 
:class:`pyspark.sql.functions.PandasUdfType`.
+ Default: SCALAR.
 
-The user-defined function can define one of the following 
transformations:
+The function type of the UDF can be one of the following:
 
-1. One or more `pandas.Series` -> A `pandas.Series`
+1. SCALAR
 
-   This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
-   :meth:`pyspark.sql.DataFrame.select`.
+   A scalar UDF defines a transformation: One or more `pandas.Series` 
-> A `pandas.Series`.
The returnType should be a primitive data type, e.g., 
`DoubleType()`.
The length of the returned `pandas.Series` must be of the same as 
the input `pandas.Series`.
 
+   Scalar UDFs are used with :meth:`pyspark.sql.DataFrame.withColumn` 
and
+   :meth:`pyspark.sql.DataFrame.select`.
+
>>> from pyspark.sql.types import IntegerType, StringType
>>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
>>> @pandas_udf(returnType=StringType())
--- End diff --

In this doctest, there are two pandas_udf. Please explicitly assign 
`PandasUDFType.SCALAR` as the `functionType` of one of udfs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...

2017-11-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/17436#discussion_r151333181
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala
 ---
@@ -75,9 +77,14 @@ class VectorizedHashMapGenerator(
   }
 }.mkString("\n").concat(";")
 
+val columnVector = if (SparkEnv.get.memoryManager.tungstenMemoryMode 
== MemoryMode.ON_HEAP) {
+  "OnHeapColumnVector"
+} else {
+  "OffHeapColumnVector"
+}
 s"""
-   |  private 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector[] batchVectors;
-   |  private 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector[] bufferVectors;
+   |  private 
org.apache.spark.sql.execution.vectorized.WritableColumnVector[] batchVectors;
+   |  private 
org.apache.spark.sql.execution.vectorized.WritableColumnVector[] bufferVectors;
--- End diff --

We can use `$columnVector` instead of `WritableColumnVector`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...

2017-11-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19630#discussion_r151331041
  
--- Diff: python/pyspark/worker.py ---
@@ -89,6 +90,26 @@ def verify_result_length(*a):
 return lambda *a: (verify_result_length(*a), arrow_return_type)
 
 
+def wrap_pandas_group_map_udf(f, return_type):
+def wrapped(*series):
+import pandas as pd
+
+result = f(pd.concat(series, axis=1))
--- End diff --

Can we remove `keys=columns`? The original includes it as the 3rd argument.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19763: [SPARK-22537][core] Aggregation of map output sta...

2017-11-15 Thread gczsjdy
Github user gczsjdy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19763#discussion_r151332369
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -473,16 +477,41 @@ private[spark] class MapOutputTrackerMaster(
   }
 
   /**
+   * Try to equally divide Range(0, num) to divisor slices
+   */
+  def equallyDivide(num: Int, divisor: Int): Iterator[Seq[Int]] = {
+assert(divisor > 0, "Divisor should be positive")
+val (each, remain) = (num / divisor, num % divisor)
+val (smaller, bigger) = (0 until num).splitAt((divisor-remain) * each)
+if (each != 0) {
+  smaller.grouped(each) ++ bigger.grouped(each + 1)
+} else {
+  bigger.grouped(each + 1)
+}
+  }
+
+  /**
* Return statistics about all of the outputs for a given shuffle.
*/
   def getStatistics(dep: ShuffleDependency[_, _, _]): MapOutputStatistics 
= {
 shuffleStatuses(dep.shuffleId).withMapStatuses { statuses =>
   val totalSizes = new Array[Long](dep.partitioner.numPartitions)
-  for (s <- statuses) {
-for (i <- 0 until totalSizes.length) {
-  totalSizes(i) += s.getSizeForBlock(i)
-}
+  val mapStatusSubmitTasks = ArrayBuffer[Future[_]]()
+  var taskSlices = parallelism
+
+  equallyDivide(totalSizes.length, taskSlices).foreach {
+reduceIds =>
+  mapStatusSubmitTasks += threadPoolMapStats.submit(
+new Runnable {
+  override def run(): Unit = {
+for (s <- statuses; i <- reduceIds) {
+  totalSizes(i) += s.getSizeForBlock(i)
+}
+  }
+}
+  )
   }
+  mapStatusSubmitTasks.foreach(_.get())
--- End diff --

Good idea, thx!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18424: [SPARK-17091] Add rule to convert IN predicate to equiva...

2017-11-15 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18424
  
I guess this is inactive now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19763: [SPARK-22537][core] Aggregation of map output sta...

2017-11-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19763#discussion_r151331447
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -473,16 +477,41 @@ private[spark] class MapOutputTrackerMaster(
   }
 
   /**
+   * Try to equally divide Range(0, num) to divisor slices
+   */
+  def equallyDivide(num: Int, divisor: Int): Iterator[Seq[Int]] = {
+assert(divisor > 0, "Divisor should be positive")
+val (each, remain) = (num / divisor, num % divisor)
+val (smaller, bigger) = (0 until num).splitAt((divisor-remain) * each)
+if (each != 0) {
+  smaller.grouped(each) ++ bigger.grouped(each + 1)
+} else {
+  bigger.grouped(each + 1)
+}
+  }
+
+  /**
* Return statistics about all of the outputs for a given shuffle.
*/
   def getStatistics(dep: ShuffleDependency[_, _, _]): MapOutputStatistics 
= {
 shuffleStatuses(dep.shuffleId).withMapStatuses { statuses =>
   val totalSizes = new Array[Long](dep.partitioner.numPartitions)
-  for (s <- statuses) {
-for (i <- 0 until totalSizes.length) {
-  totalSizes(i) += s.getSizeForBlock(i)
-}
+  val mapStatusSubmitTasks = ArrayBuffer[Future[_]]()
+  var taskSlices = parallelism
--- End diff --

Why `var`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19747: [Spark-22431][SQL] Ensure that the datatype in th...

2017-11-15 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/19747#discussion_r151331207
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -507,6 +508,7 @@ private[hive] class HiveClientImpl(
 // these properties are still available to the others that share the 
same Hive metastore.
 // If users explicitly alter these Hive-specific properties through 
ALTER TABLE DDL, we respect
 // these user-specified values.
+verifyColumnDataType(table.dataSchema)
--- End diff --

Thanks @gatorsmile for the review.   I'll incorporate your other comments 
in my next commit.  

In the current codeline, another recent PR changed verifyColumnNames to 
verifyDataSchema.  

The reason I could not put the check in verifyDataSchema ( or the old 
verifyColumnNames):
- verifyDataSchema is called in the beginning of the doCreateTable method. 
But we cannot error out that early as later on in the doCreateTable method, as 
later on in that method, we create the datasource table.  If the datasource 
table cannot be stored in hive compatible format, it falls back to storing it 
in Spark sql specific format which will work fine. 
- For e.g  If I put the check there, then the create datasource table would 
throw an exception right away, which we do not want. 

```CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19330
  
**[Test build #83929 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83929/testReport)**
 for PR 19330 at commit 
[`6aa4b8b`](https://github.com/apache/spark/commit/6aa4b8b3ad7b12a9196cd9b6740c0abe07fccf2c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19330
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19330
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83926/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19330
  
**[Test build #83926 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83926/testReport)**
 for PR 19330 at commit 
[`b1886f6`](https://github.com/apache/spark/commit/b1886f6ff3d917ea062ace95c19de7ca18ab18a4).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class Max(child: Expression) extends DeclarativeAggregate with 
OrderSpecified `
  * `case class Min(child: Expression) extends DeclarativeAggregate with 
OrderSpecified `
  * `case class Least(children: Seq[Expression]) extends Expression with 
OrderSpecified `
  * `case class Greatest(children: Seq[Expression]) extends Expression with 
OrderSpecified `
  * `case class OrderMaps(child: Expression) extends UnaryExpression with 
ExpectsInputTypes `
  * `class InterpretedOrdering(orders: Seq[SortOrder]) extends 
Ordering[InternalRow] `
  * `case class In(value: Expression, list: Seq[Expression]) extends 
Predicate with OrderSpecified `
  * `case class InSet(child: Expression, hset: Set[Any])`
  * `abstract class BinaryComparison extends BinaryOperator with Predicate 
with OrderSpecified `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19757: [SPARK-22529] [SQL] Relation stats should be consistent ...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19757
  
**[Test build #83928 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83928/testReport)**
 for PR 19757 at commit 
[`c75ae70`](https://github.com/apache/spark/commit/c75ae7062f550008155529156f0001ecc9a67e88).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...

2017-11-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19601#discussion_r151329119
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/UnsafeColumnVector.java
 ---
@@ -0,0 +1,517 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.vectorized;
+
+import java.nio.ByteBuffer;
+
+import org.apache.commons.lang.NotImplementedException;
+
+import org.apache.spark.sql.catalyst.expressions.UnsafeArrayData;
+import org.apache.spark.sql.types.*;
+import org.apache.spark.unsafe.Platform;
+
+/**
+ * A column backed by UnsafeArrayData on byte[].
+ */
+public final class UnsafeColumnVector extends WritableColumnVector {
--- End diff --

Since this `UnsafeColumnVector` represents array column, will we use the 
APIs like `getBoolean`, `getBooleans`...etc.?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...

2017-11-15 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19601#discussion_r151325703
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/UnsafeColumnVector.java
 ---
@@ -0,0 +1,517 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.vectorized;
+
+import java.nio.ByteBuffer;
+
+import org.apache.commons.lang.NotImplementedException;
+
+import org.apache.spark.sql.catalyst.expressions.UnsafeArrayData;
+import org.apache.spark.sql.types.*;
+import org.apache.spark.unsafe.Platform;
+
+/**
+ * A column backed by UnsafeArrayData on byte[].
+ */
+public final class UnsafeColumnVector extends WritableColumnVector {
--- End diff --

You are right `UnsafeColumnVector.putByteArray` is used to put the whole 
array in `byte[]` for `UnsafeArrayData`. I will put some comment to explain the 
usage of this API to make it clear.

Good catch for `getBooleans`. It seems to be my fault since it has to take 
care of `rowId`. I will fix this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...

2017-11-15 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19601#discussion_r151324069
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java
 ---
@@ -93,28 +93,6 @@ public static void populate(WritableColumnVector col, 
InternalRow row, int field
 }
   }
 
-  /**
-   * Returns the array data as the java primitive array.
-   * For example, an array of IntegerType will return an int[].
-   * Throws exceptions for unhandled schemas.
-   */
-  public static Object toPrimitiveJavaArray(ColumnarArray array) {
--- End diff --

This method was used only for test. I removed this by replacing this with 
another method.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19762: [SPARK-22535][PySpark] Sleep before killing the p...

2017-11-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19762


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19763: [SPARK-22537][core] Aggregation of map output statistics...

2017-11-15 Thread CodingCat
Github user CodingCat commented on the issue:

https://github.com/apache/spark/pull/19763
  
my question is "how many times we have seen this operation of collecting 
statistics is the bottleneck?"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...

2017-11-15 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/19762
  
Thanks! merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19763: [SPARK-22537][core] Aggregation of map output sta...

2017-11-15 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/19763#discussion_r151322438
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -473,16 +477,41 @@ private[spark] class MapOutputTrackerMaster(
   }
 
   /**
+   * Try to equally divide Range(0, num) to divisor slices
+   */
+  def equallyDivide(num: Int, divisor: Int): Iterator[Seq[Int]] = {
+assert(divisor > 0, "Divisor should be positive")
+val (each, remain) = (num / divisor, num % divisor)
+val (smaller, bigger) = (0 until num).splitAt((divisor-remain) * each)
+if (each != 0) {
+  smaller.grouped(each) ++ bigger.grouped(each + 1)
+} else {
+  bigger.grouped(each + 1)
+}
+  }
+
+  /**
* Return statistics about all of the outputs for a given shuffle.
*/
   def getStatistics(dep: ShuffleDependency[_, _, _]): MapOutputStatistics 
= {
 shuffleStatuses(dep.shuffleId).withMapStatuses { statuses =>
   val totalSizes = new Array[Long](dep.partitioner.numPartitions)
-  for (s <- statuses) {
-for (i <- 0 until totalSizes.length) {
-  totalSizes(i) += s.getSizeForBlock(i)
-}
+  val mapStatusSubmitTasks = ArrayBuffer[Future[_]]()
+  var taskSlices = parallelism
+
+  equallyDivide(totalSizes.length, taskSlices).foreach {
+reduceIds =>
+  mapStatusSubmitTasks += threadPoolMapStats.submit(
+new Runnable {
+  override def run(): Unit = {
+for (s <- statuses; i <- reduceIds) {
+  totalSizes(i) += s.getSizeForBlock(i)
+}
+  }
+}
+  )
   }
+  mapStatusSubmitTasks.foreach(_.get())
--- End diff --

this part can be simplified by using scala's Future, 

```scala
val futureArray = equallyDivide(totalSizes.length, taskSlices).map {
 reduceIds => Future {
 // whatever you want to do here
 }
}
Await.result(Future.sequence(futureArray), Duration.Inf) // or some timeout 
value you prefer
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19751: [SPARK-20653][core] Add cleaning of old elements ...

2017-11-15 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19751#discussion_r151321285
  
--- Diff: 
core/src/main/scala/org/apache/spark/status/ElementTrackingStore.scala ---
@@ -0,0 +1,168 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.status
+
+import java.util.concurrent.TimeUnit
+
+import scala.collection.mutable.{HashMap, ListBuffer}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.util.{ThreadUtils, Utils}
+import org.apache.spark.util.kvstore._
+
+/**
+ * A KVStore wrapper that allows tracking the number of elements of 
specific types, and triggering
+ * actions once they reach a threshold. This allows writers, for example, 
to control how much data
+ * is stored by potentially deleting old data as new data is added.
+ *
+ * This store is used when populating data either from a live UI or an 
event log. On top of firing
+ * triggers when elements reach a certain threshold, it provides two extra 
bits of functionality:
+ *
+ * - a generic worker thread that can be used to run expensive tasks 
asynchronously; the tasks can
+ *   be configured to run on the calling thread when more determinism is 
desired (e.g. unit tests).
+ * - a generic flush mechanism so that listeners can be notified about 
when they should flush
+ *   internal state to the store (e.g. after the SHS finishes parsing an 
event log).
+ *
+ * The configured triggers are run on the same thread that triggered the 
write, after the write
+ * has completed.
+ */
+private[spark] class ElementTrackingStore(store: KVStore, conf: SparkConf) 
extends KVStore {
+
+  import config._
+
+  private val triggers = new HashMap[Class[_], Seq[Trigger[_]]]()
+  private val flushTriggers = new ListBuffer[() => Unit]()
+  private val executor = if (conf.get(ASYNC_TRACKING_ENABLED)) {
+
Some(ThreadUtils.newDaemonSingleThreadExecutor("element-tracking-store-worker"))
+  } else {
+None
+  }
+
+  @volatile private var stopped = false
+
+  /**
+   * Register a trigger that will be fired once the number of elements of 
a given type reaches
+   * the given threshold.
+   *
+   * Triggers are fired in a separate thread, so that they can do more 
expensive operations
+   * than would be allowed on the main threads populating the store.
+   *
+   * @param klass The type to monitor.
+   * @param threshold The number of elements that should trigger the 
action.
+   * @param action Action to run when the threshold is reached; takes as a 
parameter the number
+   *   of elements of the registered type currently known to 
be in the store.
+   */
+  def addTrigger(klass: Class[_], threshold: Long)(action: Long => Unit): 
Unit = {
+val existing = triggers.getOrElse(klass, Seq())
+triggers(klass) = existing :+ Trigger(threshold, action)
+  }
+
+  /**
+   * Adds a trigger to be executed before the store is flushed. This 
normally happens before
+   * closing, and is useful for flushing intermediate state to the store, 
e.g. when replaying
+   * in-progress applications through the SHS.
+   *
+   * Flush triggers are called synchronously in the same thread that is 
closing the store.
+   */
+  def onFlush(action: => Unit): Unit = {
+flushTriggers += { () => action }
+  }
+
+  /**
+   * Enqueues an action to be executed asynchronously.
+   */
+  def doAsync(fn: => Unit): Unit = {
+executor match {
+  case Some(exec) =>
+exec.submit(new Runnable() {
+  override def run(): Unit = Utils.tryLog { fn }
+})
+
+  case _ =>
+fn
+}
+  }
+
+  override def read[T](klass: Class[T], naturalKey: Any): T = 
store.read(klass, naturalKey)
+
+  override def write(value: Any): Unit = store.write(value)
+
+  /** Write an element to the store, optio

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-11-15 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19631
  
Did another round of review, LGTM overall. @tgravescs do you any comment?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...

2017-11-15 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19631#discussion_r151320052
  
--- Diff: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 ---
@@ -216,7 +216,9 @@ private[spark] object CoarseGrainedExecutorBackend 
extends Logging {
   if (driverConf.contains("spark.yarn.credentials.file")) {
 logInfo("Will periodically update credentials from: " +
   driverConf.get("spark.yarn.credentials.file"))
-SparkHadoopUtil.get.startCredentialUpdater(driverConf)
+
Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil")
--- End diff --

#19272 is already in, do you plan to update it here in this PR or in a 
separate PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19751
  
**[Test build #83927 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83927/testReport)**
 for PR 19751 at commit 
[`8b150e0`](https://github.com/apache/spark/commit/8b150e0b2b4909239482038ee0742ced6e4c7511).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...

2017-11-15 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/19751
  
@vanzin looks like this PR has conflicts now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19762
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83925/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19762
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19762
  
**[Test build #83925 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83925/testReport)**
 for PR 19762 at commit 
[`dffcb05`](https://github.com/apache/spark/commit/dffcb05e3c7b7ef0973769337092b2b7adbc7992).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19330
  
**[Test build #83926 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83926/testReport)**
 for PR 19330 at commit 
[`b1886f6`](https://github.com/apache/spark/commit/b1886f6ff3d917ea062ace95c19de7ca18ab18a4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19763: [SPARK-22537] Aggregation of map output statistics on dr...

2017-11-15 Thread gczsjdy
Github user gczsjdy commented on the issue:

https://github.com/apache/spark/pull/19763
  
cc @cloud-fan @viirya @gatorsmile @chenghao-intel 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19763: [SPARK-22537] Aggregation of map output statistics on dr...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19763
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19763: [SPARK-22537] Aggregation of map output statistic...

2017-11-15 Thread gczsjdy
GitHub user gczsjdy opened a pull request:

https://github.com/apache/spark/pull/19763

[SPARK-22537] Aggregation of map output statistics on driver faces single 
point bottleneck

## What changes were proposed in this pull request?

In adaptive execution, the map output statistics of all mappers will be 
aggregated after previous stage is successfully executed. Driver takes the 
aggregation job while it will get slow when the number of `mapper * shuffle 
partitions` is large, since it only uses single thread to compute. This PR uses 
multi-thread to deal with this single point bottleneck.

## How was this patch tested?

Test cases are in `MapOutputTrackerSuite.scala`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gczsjdy/spark single_point_mapstatistics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19763.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19763


commit 5dd04872e983de861a301c22a124dd8923ccc8c6
Author: GuoChenzhao 
Date:   2017-11-16T02:58:22Z

Use multi-thread to solve single point bottleneck

commit 819774fc7087c51a4b7b03213bfb330331d6f108
Author: GuoChenzhao 
Date:   2017-11-16T03:01:21Z

Add test case

commit da028258bd172b6d3ff89504097fb6651f5c05c0
Author: GuoChenzhao 
Date:   2017-11-16T03:24:47Z

Style




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19762
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83921/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19762
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19762
  
**[Test build #83921 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83921/testReport)**
 for PR 19762 at commit 
[`a1f054b`](https://github.com/apache/spark/commit/a1f054ba9aff2bef5caf4534ef56dfd5e22eb96e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...

2017-11-15 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19753
  
This LGTM, @jkbradley would you be able to give this a look?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...

2017-11-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19601#discussion_r151311790
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java
 ---
@@ -93,28 +93,6 @@ public static void populate(WritableColumnVector col, 
InternalRow row, int field
 }
   }
 
-  /**
-   * Returns the array data as the java primitive array.
-   * For example, an array of IntegerType will return an int[].
-   * Throws exceptions for unhandled schemas.
-   */
-  public static Object toPrimitiveJavaArray(ColumnarArray array) {
--- End diff --

Why this method? Looks it is just used in test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...

2017-11-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19601#discussion_r151315786
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/UnsafeColumnVector.java
 ---
@@ -0,0 +1,517 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.vectorized;
+
+import java.nio.ByteBuffer;
+
+import org.apache.commons.lang.NotImplementedException;
+
+import org.apache.spark.sql.catalyst.expressions.UnsafeArrayData;
+import org.apache.spark.sql.types.*;
+import org.apache.spark.unsafe.Platform;
+
+/**
+ * A column backed by UnsafeArrayData on byte[].
+ */
+public final class UnsafeColumnVector extends WritableColumnVector {
--- End diff --

This abstraction looks confused at first glance. It seems not following 
some `ColumnVector` APIs usage. Looks like this uses `putByteArray` to set up 
byte array `data` which stores the data of this array column.

IIUC, this is proposed to represent only array column, but some APIs 
implementation looks weird. For example, `getBoolean` respects `rowId` 
parameter and `getBooleans` doesn't.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-15 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19753#discussion_r151315757
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable, Ja
   "(>= 2). If a feature is found to have > 
maxCategories values, then " +
   "it is declared continuous.", 
typeConverter=TypeConverters.toInt)
 
+handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle 
invalid data " +
+  "(unseen labels or NULL values). Options are 
'skip' (filter out " +
+  "rows with invalid data), 'error' (throw an 
error), or 'keep' (put " +
+  "invalid data in a special additional bucket, at 
index numCategories).",
+  typeConverter=TypeConverters.toString)
+
 @keyword_only
-def __init__(self, maxCategories=20, inputCol=None, outputCol=None):
+def __init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
 """
-__init__(self, maxCategories=20, inputCol=None, outputCol=None)
+__init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error")
 """
 super(VectorIndexer, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid)
-self._setDefault(maxCategories=20)
+self._setDefault(maxCategories=20, handleInvalid="error")
 kwargs = self._input_kwargs
 self.setParams(**kwargs)
 
 @keyword_only
 @since("1.4.0")
-def setParams(self, maxCategories=20, inputCol=None, outputCol=None):
+def setParams(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
--- End diff --

Thanks for the explanation, that makes sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19761
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19761
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83924/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19761
  
**[Test build #83924 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83924/testReport)**
 for PR 19761 at commit 
[`adbf94e`](https://github.com/apache/spark/commit/adbf94eef67780b2ef547eff250ea3e11e4deda2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19671: [SPARK-22297][CORE TESTS] Flaky test: BlockManagerSuite ...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19671
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83920/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19671: [SPARK-22297][CORE TESTS] Flaky test: BlockManagerSuite ...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19671
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19671: [SPARK-22297][CORE TESTS] Flaky test: BlockManagerSuite ...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19671
  
**[Test build #83920 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83920/testReport)**
 for PR 19671 at commit 
[`6da1d34`](https://github.com/apache/spark/commit/6da1d348ee5a14cb4cb34d6c2ca8be824aab5264).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19750
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19750
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83919/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19750
  
**[Test build #83919 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83919/testReport)**
 for PR 19750 at commit 
[`81b3f3d`](https://github.com/apache/spark/commit/81b3f3dd223785e02ad2df2a6e40046a76539681).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19753#discussion_r151311832
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable, Ja
   "(>= 2). If a feature is found to have > 
maxCategories values, then " +
   "it is declared continuous.", 
typeConverter=TypeConverters.toInt)
 
+handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle 
invalid data " +
+  "(unseen labels or NULL values). Options are 
'skip' (filter out " +
+  "rows with invalid data), 'error' (throw an 
error), or 'keep' (put " +
+  "invalid data in a special additional bucket, at 
index numCategories).",
+  typeConverter=TypeConverters.toString)
+
 @keyword_only
-def __init__(self, maxCategories=20, inputCol=None, outputCol=None):
+def __init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
 """
-__init__(self, maxCategories=20, inputCol=None, outputCol=None)
+__init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error")
 """
 super(VectorIndexer, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid)
-self._setDefault(maxCategories=20)
+self._setDefault(maxCategories=20, handleInvalid="error")
 kwargs = self._input_kwargs
 self.setParams(**kwargs)
 
 @keyword_only
 @since("1.4.0")
-def setParams(self, maxCategories=20, inputCol=None, outputCol=None):
+def setParams(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
--- End diff --

You can also check `Param._set` method in pyspark, you will find, it skips 
input params which value is `None`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19753#discussion_r151311569
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable, Ja
   "(>= 2). If a feature is found to have > 
maxCategories values, then " +
   "it is declared continuous.", 
typeConverter=TypeConverters.toInt)
 
+handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle 
invalid data " +
+  "(unseen labels or NULL values). Options are 
'skip' (filter out " +
+  "rows with invalid data), 'error' (throw an 
error), or 'keep' (put " +
+  "invalid data in a special additional bucket, at 
index numCategories).",
+  typeConverter=TypeConverters.toString)
+
 @keyword_only
-def __init__(self, maxCategories=20, inputCol=None, outputCol=None):
+def __init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
 """
-__init__(self, maxCategories=20, inputCol=None, outputCol=None)
+__init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error")
 """
 super(VectorIndexer, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid)
-self._setDefault(maxCategories=20)
+self._setDefault(maxCategories=20, handleInvalid="error")
 kwargs = self._input_kwargs
 self.setParams(**kwargs)
 
 @keyword_only
 @since("1.4.0")
-def setParams(self, maxCategories=20, inputCol=None, outputCol=None):
+def setParams(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
--- End diff --

ah, but, unfortunately, I think you're wrong. The `inputCol=None` 
represent, if user do not specify the inputCol, there is no default value, and 
exception will be thrown.
Duplicating default params is an issue, but already exists in all the 
pyspark.ml estimator/models.
e.g., you can check `StringIndexer` in pyspark, it also has `handleInvalid` 
param.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19746
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19746
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83918/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19746
  
**[Test build #83918 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83918/testReport)**
 for PR 19746 at commit 
[`03bd63c`](https://github.com/apache/spark/commit/03bd63c654cc3f8982ec726c50472759f704918d).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class VectorSizeHint @Since(\"2.3.0\") (@Since(\"2.3.0\") override val 
uid: String)`
  * `  class InvalidEntryException(msg: String) extends Exception(msg)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12162: [SPARK-14289][WIP] Support multiple eviction strategies ...

2017-11-15 Thread michaelmior
Github user michaelmior commented on the issue:

https://github.com/apache/spark/pull/12162
  
As best I can tell, the code that was pushed here is incomplete. However, 
Spark's default cache eviction policy is LRU. You can find the code which 
performs eviction 
[here](https://github.com/apache/spark/blob/1e82335413bc2384073ead0d6d581c862036d0f5/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L501).
 It basically just works by storing all the data in a `LinkedHashMap` 
configured to track which elements were accessed most recently.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19741: [SPARK-14228][CORE][YARN] Lost executor of RPC di...

2017-11-15 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19741#discussion_r151308593
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -268,8 +268,13 @@ private[spark] abstract class YarnSchedulerBackend(
 logWarning(reason.toString)
 driverEndpoint.ask[Boolean](r).onFailure {
--- End diff --

IIRC there are two different `RemoveExecutor` messages in the code, as 
confusing as that may be.

But if this one is used in multiple places then it's probably not worth 
changing right now, unless you're up for verifying the return value is not 
needed in the other places.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

2017-11-15 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19633#discussion_r151308496
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -424,11 +424,19 @@ case class FileSourceScanExec(
 val defaultMaxSplitBytes =
   fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
 val openCostInBytes = 
fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
-val defaultParallelism = 
fsRelation.sparkSession.sparkContext.defaultParallelism
-val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + 
openCostInBytes)).sum
-val bytesPerCore = totalBytes / defaultParallelism
 
-val maxSplitBytes = Math.min(defaultMaxSplitBytes, 
Math.max(openCostInBytes, bytesPerCore))
+// Ignore bytesPerCore when dynamic allocation is enabled. See 
SPARK-22411
+val maxSplitBytes =
+  if 
(Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf))
 {
+defaultMaxSplitBytes
--- End diff --

>For small data, sometimes users care about the number of output files 
generated as well.

Can you please elaborate more, do you want more output files or less output 
files. if less, I think I already mentioned in the above comment.

Seems you patch already ignored `bytesPerCore` when dynamic allocation is 
enabled, are you suggesting something different? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for ...

2017-11-15 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19643#discussion_r151307924
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -1838,12 +1852,21 @@ class SparkContext(config: SparkConf) extends 
Logging {
   case _ => path
 }
   }
+
   if (key != null) {
 val timestamp = System.currentTimeMillis
 if (addedJars.putIfAbsent(key, timestamp).isEmpty) {
   logInfo(s"Added JAR $path at $key with timestamp $timestamp")
   postEnvironmentUpdate()
 }
+
+if (addToCurrentClassLoader) {
+  Utils.getContextOrSparkClassLoader match {
+case cl: MutableURLClassLoader => 
cl.addURL(Utils.resolveURI(path).toURL)
--- End diff --

I'm not sure does it support remote jars on HTTPS or Hadoop 
FileSystems?In the executor side, we handle this explicitly by downloading 
jars to local and  add to classpath, but here looks like we don't have such 
logic. I'm not sure how this `URLClassLoader` communicate with Hadoop or Https 
without certificates.

The `addJar` is just adding jars to fileserver, so that executor could 
fetch them from driver and add to classpath. It will not affect driver's 
classpath. If we support adding jars to current driver's classloader, then how 
do we leverage this newly added jars?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19170: [SPARK-21961][Core] Filter out BlockStatuses Accumulator...

2017-11-15 Thread zhouyejoe
Github user zhouyejoe commented on the issue:

https://github.com/apache/spark/pull/19170
  
I will work on it. Thanks for review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19741: [SPARK-14228][CORE][YARN] Lost executor of RPC di...

2017-11-15 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19741#discussion_r151305271
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -268,8 +268,13 @@ private[spark] abstract class YarnSchedulerBackend(
 logWarning(reason.toString)
 driverEndpoint.ask[Boolean](r).onFailure {
--- End diff --

I think when you use `send`, all the related things as you mentioned above 
should be changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-11-15 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19468
  
I am actually quite swamped right now, hence the delay in getting to this 
PR.
I will try to take a pass over the weekend; but if it does get merged if 
@vanzin is ok with it, fine by me !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19741: [SPARK-14228][CORE][YARN] Lost executor of RPC di...

2017-11-15 Thread devaraj-kavali
Github user devaraj-kavali commented on a diff in the pull request:

https://github.com/apache/spark/pull/19741#discussion_r151303638
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -268,8 +268,13 @@ private[spark] abstract class YarnSchedulerBackend(
 logWarning(reason.toString)
 driverEndpoint.ask[Boolean](r).onFailure {
--- End diff --

Thanks @vanzin for the review.

I think it is a good suggestion, there are other places where the 
`RemoveExecutor` message is sending using the `ask`, are you suggesting to 
change those as well?

> (And the driver endpoint will probably need a minor change too.)

You mean moving the case `RemoveExecutor(executorId, reason)` from 
`receiveAndReply` to `receive`?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19760
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83916/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19760
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19760
  
**[Test build #83916 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83916/testReport)**
 for PR 19760 at commit 
[`fcf3fbf`](https://github.com/apache/spark/commit/fcf3fbfc47fbf4ea7290f74e65f324eee68f7aba).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...

2017-11-15 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19601
  
@cloud-fan could you please review this again? I merged with the 
`ColumnarArray`. As you suggested, the latest implementation does not change 
`ColumnVector` and `ColumnarArray`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19750
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83915/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19750
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19750
  
**[Test build #83915 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83915/testReport)**
 for PR 19750 at commit 
[`a004ce9`](https://github.com/apache/spark/commit/a004ce915a21e0dc8093bfdde4fe4290e9fa5fc2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-15 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19753#discussion_r151298582
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable, Ja
   "(>= 2). If a feature is found to have > 
maxCategories values, then " +
   "it is declared continuous.", 
typeConverter=TypeConverters.toInt)
 
+handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle 
invalid data " +
+  "(unseen labels or NULL values). Options are 
'skip' (filter out " +
+  "rows with invalid data), 'error' (throw an 
error), or 'keep' (put " +
+  "invalid data in a special additional bucket, at 
index numCategories).",
+  typeConverter=TypeConverters.toString)
+
 @keyword_only
-def __init__(self, maxCategories=20, inputCol=None, outputCol=None):
+def __init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
 """
-__init__(self, maxCategories=20, inputCol=None, outputCol=None)
+__init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error")
 """
 super(VectorIndexer, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid)
-self._setDefault(maxCategories=20)
+self._setDefault(maxCategories=20, handleInvalid="error")
 kwargs = self._input_kwargs
 self.setParams(**kwargs)
 
 @keyword_only
 @since("1.4.0")
-def setParams(self, maxCategories=20, inputCol=None, outputCol=None):
+def setParams(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
--- End diff --

Another Q: I see there's a pattern of `setParams` using `None` as a default 
value for all/most of its arguments in other featurizers, perhaps we should do 
the same (i.e. have a default argument of `handleValid=None` here)? IMO 
specifying the default parameter value in one place is preferable to 
duplicating it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...

2017-11-15 Thread smurching
Github user smurching commented on a diff in the pull request:

https://github.com/apache/spark/pull/19753#discussion_r151298765
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable, Ja
   "(>= 2). If a feature is found to have > 
maxCategories values, then " +
   "it is declared continuous.", 
typeConverter=TypeConverters.toInt)
 
+handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle 
invalid data " +
+  "(unseen labels or NULL values). Options are 
'skip' (filter out " +
+  "rows with invalid data), 'error' (throw an 
error), or 'keep' (put " +
+  "invalid data in a special additional bucket, at 
index numCategories).",
+  typeConverter=TypeConverters.toString)
+
 @keyword_only
-def __init__(self, maxCategories=20, inputCol=None, outputCol=None):
+def __init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
 """
-__init__(self, maxCategories=20, inputCol=None, outputCol=None)
+__init__(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error")
 """
 super(VectorIndexer, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid)
-self._setDefault(maxCategories=20)
+self._setDefault(maxCategories=20, handleInvalid="error")
 kwargs = self._input_kwargs
 self.setParams(**kwargs)
 
 @keyword_only
 @since("1.4.0")
-def setParams(self, maxCategories=20, inputCol=None, outputCol=None):
+def setParams(self, maxCategories=20, inputCol=None, outputCol=None, 
handleInvalid="error"):
--- End diff --

The same goes for the constructor (IMO we should default to 
`handleInvalid=None` there too), but open to hearing your thoughts.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19762
  
**[Test build #83925 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83925/testReport)**
 for PR 19762 at commit 
[`dffcb05`](https://github.com/apache/spark/commit/dffcb05e3c7b7ef0973769337092b2b7adbc7992).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19761
  
**[Test build #83924 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83924/testReport)**
 for PR 19761 at commit 
[`adbf94e`](https://github.com/apache/spark/commit/adbf94eef67780b2ef547eff250ea3e11e4deda2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19761
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19761
  
**[Test build #83923 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83923/testReport)**
 for PR 19761 at commit 
[`ba4d590`](https://github.com/apache/spark/commit/ba4d590b192cfe220bfed0e1bd690af746cf2ad1).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19761
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83923/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for addJar ...

2017-11-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19643
  
Hi @jerryshao. Would you maybe have some time to take a look for this one 
please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for ...

2017-11-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19643#discussion_r151296077
  
--- Diff: R/pkg/R/context.R ---
@@ -319,6 +319,27 @@ spark.addFile <- function(path, recursive = FALSE) {
   invisible(callJMethod(sc, "addFile", 
suppressWarnings(normalizePath(path)), recursive))
 }
 
+#' Adds a JAR dependency for Spark tasks to be executed in the future.
+#'
+#' The \code{path} passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on 
every worker node.
+#' If \code{addToCurrentClassLoader} is true, add the jar to the current 
driver.
--- End diff --

@mariusvniekerk are you okay with proceeding this here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19761: [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from ...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19761
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19761: [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from ...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19761
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83922/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19761: [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from ...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19761
  
**[Test build #83922 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83922/testReport)**
 for PR 19761 at commit 
[`fd36d2f`](https://github.com/apache/spark/commit/fd36d2f54939a27406f4077a04a6f7dad328f840).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...

2017-11-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19630
  
Looks clean and pretty solid in general. Let me take another look to double 
check, probably, within this weekend and maybe I will leave it to @ueshin if I 
can take the look ahead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19761: [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from ...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19761
  
**[Test build #83923 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83923/testReport)**
 for PR 19761 at commit 
[`ba4d590`](https://github.com/apache/spark/commit/ba4d590b192cfe220bfed0e1bd690af746cf2ad1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19761: [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from ...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19761
  
**[Test build #83922 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83922/testReport)**
 for PR 19761 at commit 
[`fd36d2f`](https://github.com/apache/spark/commit/fd36d2f54939a27406f4077a04a6f7dad328f840).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...

2017-11-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19762
  
**[Test build #83921 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83921/testReport)**
 for PR 19762 at commit 
[`a1f054b`](https://github.com/apache/spark/commit/a1f054ba9aff2bef5caf4534ef56dfd5e22eb96e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19762: [SPARK-22535][PySpark] Sleep before killing the p...

2017-11-15 Thread zsxwing
GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/19762

[SPARK-22535][PySpark] Sleep before killing the python worker in 
PythonRunner.MonitorThread

## What changes were proposed in this pull request?

`PythonRunner.MonitorThread` should give the task a little time to finish 
before forcibly killing the python worker. This will reduce the chance of the 
race condition a lot. I also improved the log a bit to find out the task to 
blame when it's stuck.

## How was this patch tested?

Jenkins


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-22535

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19762.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19762


commit a1f054ba9aff2bef5caf4534ef56dfd5e22eb96e
Author: Shixiong Zhu 
Date:   2017-11-16T00:35:09Z

Sleep before killing the python worker in PythonRunner.MonitorThread




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...

2017-11-15 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19762
  
cc @ueshin


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...

2017-11-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19753
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >