[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19601 **[Test build #83931 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83931/testReport)** for PR 19601 at commit [`9a41914`](https://github.com/apache/spark/commit/9a41914694c8f1f56f294cc2380bd6ecf1ce73b8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r151340713 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala --- @@ -75,9 +77,14 @@ class VectorizedHashMapGenerator( } }.mkString("\n").concat(";") +val columnVector = if (SparkEnv.get.memoryManager.tungstenMemoryMode == MemoryMode.ON_HEAP) { + "OnHeapColumnVector" +} else { + "OffHeapColumnVector" +} s""" - | private org.apache.spark.sql.execution.vectorized.OnHeapColumnVector[] batchVectors; - | private org.apache.spark.sql.execution.vectorized.OnHeapColumnVector[] bufferVectors; + | private org.apache.spark.sql.execution.vectorized.WritableColumnVector[] batchVectors; + | private org.apache.spark.sql.execution.vectorized.WritableColumnVector[] bufferVectors; --- End diff -- Good catch, thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19601 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19601 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83930/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19601 **[Test build #83930 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83930/testReport)** for PR 19601 at commit [`17449b4`](https://github.com/apache/spark/commit/17449b4748c5c32539227c7f50c4b6ec236ab4ee). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19763: [SPARK-22537][core] Aggregation of map output sta...
Github user gczsjdy commented on a diff in the pull request: https://github.com/apache/spark/pull/19763#discussion_r151339166 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -473,16 +477,41 @@ private[spark] class MapOutputTrackerMaster( } /** + * Try to equally divide Range(0, num) to divisor slices + */ + def equallyDivide(num: Int, divisor: Int): Iterator[Seq[Int]] = { +assert(divisor > 0, "Divisor should be positive") +val (each, remain) = (num / divisor, num % divisor) +val (smaller, bigger) = (0 until num).splitAt((divisor-remain) * each) +if (each != 0) { + smaller.grouped(each) ++ bigger.grouped(each + 1) +} else { + bigger.grouped(each + 1) +} + } + + /** * Return statistics about all of the outputs for a given shuffle. */ def getStatistics(dep: ShuffleDependency[_, _, _]): MapOutputStatistics = { shuffleStatuses(dep.shuffleId).withMapStatuses { statuses => val totalSizes = new Array[Long](dep.partitioner.numPartitions) - for (s <- statuses) { -for (i <- 0 until totalSizes.length) { - totalSizes(i) += s.getSizeForBlock(i) -} + val mapStatusSubmitTasks = ArrayBuffer[Future[_]]() + var taskSlices = parallelism + + equallyDivide(totalSizes.length, taskSlices).foreach { +reduceIds => + mapStatusSubmitTasks += threadPoolMapStats.submit( +new Runnable { + override def run(): Unit = { +for (s <- statuses; i <- reduceIds) { + totalSizes(i) += s.getSizeForBlock(i) +} + } +} + ) } + mapStatusSubmitTasks.foreach(_.get()) --- End diff -- Should I use the `scala.concurrent.ExecutionContext.Implicits.global` ExecutionContextï¼ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19601 **[Test build #83930 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83930/testReport)** for PR 19601 at commit [`17449b4`](https://github.com/apache/spark/commit/17449b4748c5c32539227c7f50c4b6ec236ab4ee). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19764: [SPARK-22539][SQL] Add second order for rangepartitioner...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19764 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19621 Seems in the frequency-based string orders, the order of labels with same frequency is non-deterministic. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19764: [SPARK-22539][SQL] Add second order for rangepart...
GitHub user caneGuy opened a pull request: https://github.com/apache/spark/pull/19764 [SPARK-22539][SQL] Add second order for rangepartitioner since partition nu⦠â¦mber may be small if the specified key is skewed ## What changes were proposed in this pull request? The rangepartitioner generated from shuffle exchange may cause partiton skew if sort key is skewed. This patch add second order for rangepartitioner to avoid this situation. This is an improvement from real case. ## How was this patch tested? Manully test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/caneGuy/spark zhoukang/add-secondorder Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19764.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19764 commit 29d2c869ffc6dd4d1a3cf7606cde5d03e72fa171 Author: zhoukang Date: 2017-11-15T07:24:59Z [SPARK][SQL] Add second order for rangepartitioner since partition number may be small if the specified key is skewed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19630#discussion_r151333004 --- Diff: python/pyspark/sql/functions.py --- @@ -2208,26 +2089,39 @@ def udf(f=None, returnType=StringType()): | 8| JOHN DOE| 22| +--+--++ """ -return _create_udf(f, returnType=returnType, pythonUdfType=PythonUdfType.NORMAL_UDF) +# decorator @udf, @udf(), @udf(dataType()) +if f is None or isinstance(f, (str, DataType)): +# If DataType has been passed as a positional argument +# for decorator use it as a returnType +return_type = f or returnType +return functools.partial(_create_udf, returnType=return_type, + evalType=PythonEvalType.SQL_BATCHED_UDF) +else: +return _create_udf(f=f, returnType=returnType, + evalType=PythonEvalType.SQL_BATCHED_UDF) @since(2.3) -def pandas_udf(f=None, returnType=StringType()): +def pandas_udf(f=None, returnType=None, functionType=None): """ Creates a vectorized user defined function (UDF). :param f: user-defined function. A python function if used as a standalone function :param returnType: a :class:`pyspark.sql.types.DataType` object +:param functionType: an enum value in :class:`pyspark.sql.functions.PandasUdfType`. --- End diff -- `PandasUdfType` -> `PandasUDFType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19630#discussion_r151335091 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala --- @@ -137,15 +138,18 @@ object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper { udf.references.subsetOf(child.outputSet) } if (validUdfs.nonEmpty) { - if (validUdfs.exists(_.pythonUdfType == PythonUdfType.PANDAS_GROUPED_UDF)) { -throw new IllegalArgumentException("Can not use grouped vectorized UDFs") - } + require(validUdfs.forall(udf => +udf.evalType == PythonEvalType.SQL_BATCHED_UDF || +udf.evalType == PythonEvalType.PANDAS_SCALAR_UDF + ), "Can only extract scalar vectorized udf or sql batch udf") --- End diff -- ```require(validUdfs.forall(_.evalType != PythonEvalType.PANDAS_GROUP_MAP_UDF))```? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19630#discussion_r151333227 --- Diff: python/pyspark/rdd.py --- @@ -56,6 +56,20 @@ __all__ = ["RDD"] +class PythonEvalType(object): +""" +Evaluation type of python rdd. + +These values are internal to PySpark. --- End diff -- And should match with enum values in `org.apache.spark.api.python.PythonEvalType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19630#discussion_r151332296 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -34,9 +34,11 @@ import org.apache.spark.util._ */ private[spark] object PythonEvalType { val NON_UDF = 0 - val SQL_BATCHED_UDF = 1 - val SQL_PANDAS_UDF = 2 - val SQL_PANDAS_GROUPED_UDF = 3 + + val SQL_BATCHED_UDF = 100 + + val PANDAS_SCALAR_UDF = 200 + val PANDAS_GROUP_MAP_UDF = 201 --- End diff -- nit: Is it better to keep `SQL` prefix? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19630#discussion_r151333891 --- Diff: python/pyspark/sql/functions.py --- @@ -2208,26 +2089,39 @@ def udf(f=None, returnType=StringType()): | 8| JOHN DOE| 22| +--+--++ """ -return _create_udf(f, returnType=returnType, pythonUdfType=PythonUdfType.NORMAL_UDF) +# decorator @udf, @udf(), @udf(dataType()) +if f is None or isinstance(f, (str, DataType)): +# If DataType has been passed as a positional argument +# for decorator use it as a returnType +return_type = f or returnType +return functools.partial(_create_udf, returnType=return_type, + evalType=PythonEvalType.SQL_BATCHED_UDF) +else: +return _create_udf(f=f, returnType=returnType, + evalType=PythonEvalType.SQL_BATCHED_UDF) @since(2.3) -def pandas_udf(f=None, returnType=StringType()): +def pandas_udf(f=None, returnType=None, functionType=None): """ Creates a vectorized user defined function (UDF). :param f: user-defined function. A python function if used as a standalone function :param returnType: a :class:`pyspark.sql.types.DataType` object +:param functionType: an enum value in :class:`pyspark.sql.functions.PandasUdfType`. + Default: SCALAR. -The user-defined function can define one of the following transformations: +The function type of the UDF can be one of the following: -1. One or more `pandas.Series` -> A `pandas.Series` +1. SCALAR - This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and - :meth:`pyspark.sql.DataFrame.select`. + A scalar UDF defines a transformation: One or more `pandas.Series` -> A `pandas.Series`. The returnType should be a primitive data type, e.g., `DoubleType()`. The length of the returned `pandas.Series` must be of the same as the input `pandas.Series`. + Scalar UDFs are used with :meth:`pyspark.sql.DataFrame.withColumn` and + :meth:`pyspark.sql.DataFrame.select`. + >>> from pyspark.sql.types import IntegerType, StringType >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) >>> @pandas_udf(returnType=StringType()) --- End diff -- In this doctest, there are two pandas_udf. Please explicitly assign `PandasUDFType.SCALAR` as the `functionType` of one of udfs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r151333181 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala --- @@ -75,9 +77,14 @@ class VectorizedHashMapGenerator( } }.mkString("\n").concat(";") +val columnVector = if (SparkEnv.get.memoryManager.tungstenMemoryMode == MemoryMode.ON_HEAP) { + "OnHeapColumnVector" +} else { + "OffHeapColumnVector" +} s""" - | private org.apache.spark.sql.execution.vectorized.OnHeapColumnVector[] batchVectors; - | private org.apache.spark.sql.execution.vectorized.OnHeapColumnVector[] bufferVectors; + | private org.apache.spark.sql.execution.vectorized.WritableColumnVector[] batchVectors; + | private org.apache.spark.sql.execution.vectorized.WritableColumnVector[] bufferVectors; --- End diff -- We can use `$columnVector` instead of `WritableColumnVector`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19630#discussion_r151331041 --- Diff: python/pyspark/worker.py --- @@ -89,6 +90,26 @@ def verify_result_length(*a): return lambda *a: (verify_result_length(*a), arrow_return_type) +def wrap_pandas_group_map_udf(f, return_type): +def wrapped(*series): +import pandas as pd + +result = f(pd.concat(series, axis=1)) --- End diff -- Can we remove `keys=columns`? The original includes it as the 3rd argument. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19763: [SPARK-22537][core] Aggregation of map output sta...
Github user gczsjdy commented on a diff in the pull request: https://github.com/apache/spark/pull/19763#discussion_r151332369 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -473,16 +477,41 @@ private[spark] class MapOutputTrackerMaster( } /** + * Try to equally divide Range(0, num) to divisor slices + */ + def equallyDivide(num: Int, divisor: Int): Iterator[Seq[Int]] = { +assert(divisor > 0, "Divisor should be positive") +val (each, remain) = (num / divisor, num % divisor) +val (smaller, bigger) = (0 until num).splitAt((divisor-remain) * each) +if (each != 0) { + smaller.grouped(each) ++ bigger.grouped(each + 1) +} else { + bigger.grouped(each + 1) +} + } + + /** * Return statistics about all of the outputs for a given shuffle. */ def getStatistics(dep: ShuffleDependency[_, _, _]): MapOutputStatistics = { shuffleStatuses(dep.shuffleId).withMapStatuses { statuses => val totalSizes = new Array[Long](dep.partitioner.numPartitions) - for (s <- statuses) { -for (i <- 0 until totalSizes.length) { - totalSizes(i) += s.getSizeForBlock(i) -} + val mapStatusSubmitTasks = ArrayBuffer[Future[_]]() + var taskSlices = parallelism + + equallyDivide(totalSizes.length, taskSlices).foreach { +reduceIds => + mapStatusSubmitTasks += threadPoolMapStats.submit( +new Runnable { + override def run(): Unit = { +for (s <- statuses; i <- reduceIds) { + totalSizes(i) += s.getSizeForBlock(i) +} + } +} + ) } + mapStatusSubmitTasks.foreach(_.get()) --- End diff -- Good idea, thx! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18424: [SPARK-17091] Add rule to convert IN predicate to equiva...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18424 I guess this is inactive now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19763: [SPARK-22537][core] Aggregation of map output sta...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19763#discussion_r151331447 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -473,16 +477,41 @@ private[spark] class MapOutputTrackerMaster( } /** + * Try to equally divide Range(0, num) to divisor slices + */ + def equallyDivide(num: Int, divisor: Int): Iterator[Seq[Int]] = { +assert(divisor > 0, "Divisor should be positive") +val (each, remain) = (num / divisor, num % divisor) +val (smaller, bigger) = (0 until num).splitAt((divisor-remain) * each) +if (each != 0) { + smaller.grouped(each) ++ bigger.grouped(each + 1) +} else { + bigger.grouped(each + 1) +} + } + + /** * Return statistics about all of the outputs for a given shuffle. */ def getStatistics(dep: ShuffleDependency[_, _, _]): MapOutputStatistics = { shuffleStatuses(dep.shuffleId).withMapStatuses { statuses => val totalSizes = new Array[Long](dep.partitioner.numPartitions) - for (s <- statuses) { -for (i <- 0 until totalSizes.length) { - totalSizes(i) += s.getSizeForBlock(i) -} + val mapStatusSubmitTasks = ArrayBuffer[Future[_]]() + var taskSlices = parallelism --- End diff -- Why `var`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19747: [Spark-22431][SQL] Ensure that the datatype in th...
Github user skambha commented on a diff in the pull request: https://github.com/apache/spark/pull/19747#discussion_r151331207 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -507,6 +508,7 @@ private[hive] class HiveClientImpl( // these properties are still available to the others that share the same Hive metastore. // If users explicitly alter these Hive-specific properties through ALTER TABLE DDL, we respect // these user-specified values. +verifyColumnDataType(table.dataSchema) --- End diff -- Thanks @gatorsmile for the review. I'll incorporate your other comments in my next commit. In the current codeline, another recent PR changed verifyColumnNames to verifyDataSchema. The reason I could not put the check in verifyDataSchema ( or the old verifyColumnNames): - verifyDataSchema is called in the beginning of the doCreateTable method. But we cannot error out that early as later on in the doCreateTable method, as later on in that method, we create the datasource table. If the datasource table cannot be stored in hive compatible format, it falls back to storing it in Spark sql specific format which will work fine. - For e.g If I put the check there, then the create datasource table would throw an exception right away, which we do not want. ```CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19330 **[Test build #83929 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83929/testReport)** for PR 19330 at commit [`6aa4b8b`](https://github.com/apache/spark/commit/6aa4b8b3ad7b12a9196cd9b6740c0abe07fccf2c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19330 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19330 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83926/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19330 **[Test build #83926 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83926/testReport)** for PR 19330 at commit [`b1886f6`](https://github.com/apache/spark/commit/b1886f6ff3d917ea062ace95c19de7ca18ab18a4). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Max(child: Expression) extends DeclarativeAggregate with OrderSpecified ` * `case class Min(child: Expression) extends DeclarativeAggregate with OrderSpecified ` * `case class Least(children: Seq[Expression]) extends Expression with OrderSpecified ` * `case class Greatest(children: Seq[Expression]) extends Expression with OrderSpecified ` * `case class OrderMaps(child: Expression) extends UnaryExpression with ExpectsInputTypes ` * `class InterpretedOrdering(orders: Seq[SortOrder]) extends Ordering[InternalRow] ` * `case class In(value: Expression, list: Seq[Expression]) extends Predicate with OrderSpecified ` * `case class InSet(child: Expression, hset: Set[Any])` * `abstract class BinaryComparison extends BinaryOperator with Predicate with OrderSpecified ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19757: [SPARK-22529] [SQL] Relation stats should be consistent ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19757 **[Test build #83928 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83928/testReport)** for PR 19757 at commit [`c75ae70`](https://github.com/apache/spark/commit/c75ae7062f550008155529156f0001ecc9a67e88). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19601#discussion_r151329119 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/UnsafeColumnVector.java --- @@ -0,0 +1,517 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.vectorized; + +import java.nio.ByteBuffer; + +import org.apache.commons.lang.NotImplementedException; + +import org.apache.spark.sql.catalyst.expressions.UnsafeArrayData; +import org.apache.spark.sql.types.*; +import org.apache.spark.unsafe.Platform; + +/** + * A column backed by UnsafeArrayData on byte[]. + */ +public final class UnsafeColumnVector extends WritableColumnVector { --- End diff -- Since this `UnsafeColumnVector` represents array column, will we use the APIs like `getBoolean`, `getBooleans`...etc.? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19601#discussion_r151325703 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/UnsafeColumnVector.java --- @@ -0,0 +1,517 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.vectorized; + +import java.nio.ByteBuffer; + +import org.apache.commons.lang.NotImplementedException; + +import org.apache.spark.sql.catalyst.expressions.UnsafeArrayData; +import org.apache.spark.sql.types.*; +import org.apache.spark.unsafe.Platform; + +/** + * A column backed by UnsafeArrayData on byte[]. + */ +public final class UnsafeColumnVector extends WritableColumnVector { --- End diff -- You are right `UnsafeColumnVector.putByteArray` is used to put the whole array in `byte[]` for `UnsafeArrayData`. I will put some comment to explain the usage of this API to make it clear. Good catch for `getBooleans`. It seems to be my fault since it has to take care of `rowId`. I will fix this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19601#discussion_r151324069 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java --- @@ -93,28 +93,6 @@ public static void populate(WritableColumnVector col, InternalRow row, int field } } - /** - * Returns the array data as the java primitive array. - * For example, an array of IntegerType will return an int[]. - * Throws exceptions for unhandled schemas. - */ - public static Object toPrimitiveJavaArray(ColumnarArray array) { --- End diff -- This method was used only for test. I removed this by replacing this with another method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19762: [SPARK-22535][PySpark] Sleep before killing the p...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19762 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19763: [SPARK-22537][core] Aggregation of map output statistics...
Github user CodingCat commented on the issue: https://github.com/apache/spark/pull/19763 my question is "how many times we have seen this operation of collecting statistics is the bottleneck?" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/19762 Thanks! merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19763: [SPARK-22537][core] Aggregation of map output sta...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/19763#discussion_r151322438 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -473,16 +477,41 @@ private[spark] class MapOutputTrackerMaster( } /** + * Try to equally divide Range(0, num) to divisor slices + */ + def equallyDivide(num: Int, divisor: Int): Iterator[Seq[Int]] = { +assert(divisor > 0, "Divisor should be positive") +val (each, remain) = (num / divisor, num % divisor) +val (smaller, bigger) = (0 until num).splitAt((divisor-remain) * each) +if (each != 0) { + smaller.grouped(each) ++ bigger.grouped(each + 1) +} else { + bigger.grouped(each + 1) +} + } + + /** * Return statistics about all of the outputs for a given shuffle. */ def getStatistics(dep: ShuffleDependency[_, _, _]): MapOutputStatistics = { shuffleStatuses(dep.shuffleId).withMapStatuses { statuses => val totalSizes = new Array[Long](dep.partitioner.numPartitions) - for (s <- statuses) { -for (i <- 0 until totalSizes.length) { - totalSizes(i) += s.getSizeForBlock(i) -} + val mapStatusSubmitTasks = ArrayBuffer[Future[_]]() + var taskSlices = parallelism + + equallyDivide(totalSizes.length, taskSlices).foreach { +reduceIds => + mapStatusSubmitTasks += threadPoolMapStats.submit( +new Runnable { + override def run(): Unit = { +for (s <- statuses; i <- reduceIds) { + totalSizes(i) += s.getSizeForBlock(i) +} + } +} + ) } + mapStatusSubmitTasks.foreach(_.get()) --- End diff -- this part can be simplified by using scala's Future, ```scala val futureArray = equallyDivide(totalSizes.length, taskSlices).map { reduceIds => Future { // whatever you want to do here } } Await.result(Future.sequence(futureArray), Duration.Inf) // or some timeout value you prefer ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19751: [SPARK-20653][core] Add cleaning of old elements ...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/19751#discussion_r151321285 --- Diff: core/src/main/scala/org/apache/spark/status/ElementTrackingStore.scala --- @@ -0,0 +1,168 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.status + +import java.util.concurrent.TimeUnit + +import scala.collection.mutable.{HashMap, ListBuffer} + +import org.apache.spark.SparkConf +import org.apache.spark.util.{ThreadUtils, Utils} +import org.apache.spark.util.kvstore._ + +/** + * A KVStore wrapper that allows tracking the number of elements of specific types, and triggering + * actions once they reach a threshold. This allows writers, for example, to control how much data + * is stored by potentially deleting old data as new data is added. + * + * This store is used when populating data either from a live UI or an event log. On top of firing + * triggers when elements reach a certain threshold, it provides two extra bits of functionality: + * + * - a generic worker thread that can be used to run expensive tasks asynchronously; the tasks can + * be configured to run on the calling thread when more determinism is desired (e.g. unit tests). + * - a generic flush mechanism so that listeners can be notified about when they should flush + * internal state to the store (e.g. after the SHS finishes parsing an event log). + * + * The configured triggers are run on the same thread that triggered the write, after the write + * has completed. + */ +private[spark] class ElementTrackingStore(store: KVStore, conf: SparkConf) extends KVStore { + + import config._ + + private val triggers = new HashMap[Class[_], Seq[Trigger[_]]]() + private val flushTriggers = new ListBuffer[() => Unit]() + private val executor = if (conf.get(ASYNC_TRACKING_ENABLED)) { + Some(ThreadUtils.newDaemonSingleThreadExecutor("element-tracking-store-worker")) + } else { +None + } + + @volatile private var stopped = false + + /** + * Register a trigger that will be fired once the number of elements of a given type reaches + * the given threshold. + * + * Triggers are fired in a separate thread, so that they can do more expensive operations + * than would be allowed on the main threads populating the store. + * + * @param klass The type to monitor. + * @param threshold The number of elements that should trigger the action. + * @param action Action to run when the threshold is reached; takes as a parameter the number + * of elements of the registered type currently known to be in the store. + */ + def addTrigger(klass: Class[_], threshold: Long)(action: Long => Unit): Unit = { +val existing = triggers.getOrElse(klass, Seq()) +triggers(klass) = existing :+ Trigger(threshold, action) + } + + /** + * Adds a trigger to be executed before the store is flushed. This normally happens before + * closing, and is useful for flushing intermediate state to the store, e.g. when replaying + * in-progress applications through the SHS. + * + * Flush triggers are called synchronously in the same thread that is closing the store. + */ + def onFlush(action: => Unit): Unit = { +flushTriggers += { () => action } + } + + /** + * Enqueues an action to be executed asynchronously. + */ + def doAsync(fn: => Unit): Unit = { +executor match { + case Some(exec) => +exec.submit(new Runnable() { + override def run(): Unit = Utils.tryLog { fn } +}) + + case _ => +fn +} + } + + override def read[T](klass: Class[T], naturalKey: Any): T = store.read(klass, naturalKey) + + override def write(value: Any): Unit = store.write(value) + + /** Write an element to the store, optio
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19631 Did another round of review, LGTM overall. @tgravescs do you any comment? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r151320052 --- Diff: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala --- @@ -216,7 +216,9 @@ private[spark] object CoarseGrainedExecutorBackend extends Logging { if (driverConf.contains("spark.yarn.credentials.file")) { logInfo("Will periodically update credentials from: " + driverConf.get("spark.yarn.credentials.file")) -SparkHadoopUtil.get.startCredentialUpdater(driverConf) + Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil") --- End diff -- #19272 is already in, do you plan to update it here in this PR or in a separate PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19751 **[Test build #83927 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83927/testReport)** for PR 19751 at commit [`8b150e0`](https://github.com/apache/spark/commit/8b150e0b2b4909239482038ee0742ced6e4c7511). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19751: [SPARK-20653][core] Add cleaning of old elements from th...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/19751 @vanzin looks like this PR has conflicts now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19762 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83925/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19762 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19762 **[Test build #83925 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83925/testReport)** for PR 19762 at commit [`dffcb05`](https://github.com/apache/spark/commit/dffcb05e3c7b7ef0973769337092b2b7adbc7992). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19330 **[Test build #83926 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83926/testReport)** for PR 19330 at commit [`b1886f6`](https://github.com/apache/spark/commit/b1886f6ff3d917ea062ace95c19de7ca18ab18a4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19763: [SPARK-22537] Aggregation of map output statistics on dr...
Github user gczsjdy commented on the issue: https://github.com/apache/spark/pull/19763 cc @cloud-fan @viirya @gatorsmile @chenghao-intel --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19763: [SPARK-22537] Aggregation of map output statistics on dr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19763 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19763: [SPARK-22537] Aggregation of map output statistic...
GitHub user gczsjdy opened a pull request: https://github.com/apache/spark/pull/19763 [SPARK-22537] Aggregation of map output statistics on driver faces single point bottleneck ## What changes were proposed in this pull request? In adaptive execution, the map output statistics of all mappers will be aggregated after previous stage is successfully executed. Driver takes the aggregation job while it will get slow when the number of `mapper * shuffle partitions` is large, since it only uses single thread to compute. This PR uses multi-thread to deal with this single point bottleneck. ## How was this patch tested? Test cases are in `MapOutputTrackerSuite.scala` You can merge this pull request into a Git repository by running: $ git pull https://github.com/gczsjdy/spark single_point_mapstatistics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19763.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19763 commit 5dd04872e983de861a301c22a124dd8923ccc8c6 Author: GuoChenzhao Date: 2017-11-16T02:58:22Z Use multi-thread to solve single point bottleneck commit 819774fc7087c51a4b7b03213bfb330331d6f108 Author: GuoChenzhao Date: 2017-11-16T03:01:21Z Add test case commit da028258bd172b6d3ff89504097fb6651f5c05c0 Author: GuoChenzhao Date: 2017-11-16T03:24:47Z Style --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19762 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83921/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19762 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19762 **[Test build #83921 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83921/testReport)** for PR 19762 at commit [`a1f054b`](https://github.com/apache/spark/commit/a1f054ba9aff2bef5caf4534ef56dfd5e22eb96e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19753 This LGTM, @jkbradley would you be able to give this a look? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19601#discussion_r151311790 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java --- @@ -93,28 +93,6 @@ public static void populate(WritableColumnVector col, InternalRow row, int field } } - /** - * Returns the array data as the java primitive array. - * For example, an array of IntegerType will return an int[]. - * Throws exceptions for unhandled schemas. - */ - public static Object toPrimitiveJavaArray(ColumnarArray array) { --- End diff -- Why this method? Looks it is just used in test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19601: [SPARK-22383][SQL] Generate code to directly get ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19601#discussion_r151315786 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/UnsafeColumnVector.java --- @@ -0,0 +1,517 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.vectorized; + +import java.nio.ByteBuffer; + +import org.apache.commons.lang.NotImplementedException; + +import org.apache.spark.sql.catalyst.expressions.UnsafeArrayData; +import org.apache.spark.sql.types.*; +import org.apache.spark.unsafe.Platform; + +/** + * A column backed by UnsafeArrayData on byte[]. + */ +public final class UnsafeColumnVector extends WritableColumnVector { --- End diff -- This abstraction looks confused at first glance. It seems not following some `ColumnVector` APIs usage. Looks like this uses `putByteArray` to set up byte array `data` which stores the data of this array column. IIUC, this is proposed to represent only array column, but some APIs implementation looks weird. For example, `getBoolean` respects `rowId` parameter and `getBooleans` doesn't. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19753#discussion_r151315757 --- Diff: python/pyspark/ml/feature.py --- @@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, Ja "(>= 2). If a feature is found to have > maxCategories values, then " + "it is declared continuous.", typeConverter=TypeConverters.toInt) +handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle invalid data " + + "(unseen labels or NULL values). Options are 'skip' (filter out " + + "rows with invalid data), 'error' (throw an error), or 'keep' (put " + + "invalid data in a special additional bucket, at index numCategories).", + typeConverter=TypeConverters.toString) + @keyword_only -def __init__(self, maxCategories=20, inputCol=None, outputCol=None): +def __init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): """ -__init__(self, maxCategories=20, inputCol=None, outputCol=None) +__init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error") """ super(VectorIndexer, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid) -self._setDefault(maxCategories=20) +self._setDefault(maxCategories=20, handleInvalid="error") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only @since("1.4.0") -def setParams(self, maxCategories=20, inputCol=None, outputCol=None): +def setParams(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): --- End diff -- Thanks for the explanation, that makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19761 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19761 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83924/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19761 **[Test build #83924 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83924/testReport)** for PR 19761 at commit [`adbf94e`](https://github.com/apache/spark/commit/adbf94eef67780b2ef547eff250ea3e11e4deda2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19671: [SPARK-22297][CORE TESTS] Flaky test: BlockManagerSuite ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19671 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83920/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19671: [SPARK-22297][CORE TESTS] Flaky test: BlockManagerSuite ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19671 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19671: [SPARK-22297][CORE TESTS] Flaky test: BlockManagerSuite ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19671 **[Test build #83920 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83920/testReport)** for PR 19671 at commit [`6da1d34`](https://github.com/apache/spark/commit/6da1d348ee5a14cb4cb34d6c2ca8be824aab5264). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19750 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19750 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83919/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19750 **[Test build #83919 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83919/testReport)** for PR 19750 at commit [`81b3f3d`](https://github.com/apache/spark/commit/81b3f3dd223785e02ad2df2a6e40046a76539681). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19753#discussion_r151311832 --- Diff: python/pyspark/ml/feature.py --- @@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, Ja "(>= 2). If a feature is found to have > maxCategories values, then " + "it is declared continuous.", typeConverter=TypeConverters.toInt) +handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle invalid data " + + "(unseen labels or NULL values). Options are 'skip' (filter out " + + "rows with invalid data), 'error' (throw an error), or 'keep' (put " + + "invalid data in a special additional bucket, at index numCategories).", + typeConverter=TypeConverters.toString) + @keyword_only -def __init__(self, maxCategories=20, inputCol=None, outputCol=None): +def __init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): """ -__init__(self, maxCategories=20, inputCol=None, outputCol=None) +__init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error") """ super(VectorIndexer, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid) -self._setDefault(maxCategories=20) +self._setDefault(maxCategories=20, handleInvalid="error") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only @since("1.4.0") -def setParams(self, maxCategories=20, inputCol=None, outputCol=None): +def setParams(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): --- End diff -- You can also check `Param._set` method in pyspark, you will find, it skips input params which value is `None` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19753#discussion_r151311569 --- Diff: python/pyspark/ml/feature.py --- @@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, Ja "(>= 2). If a feature is found to have > maxCategories values, then " + "it is declared continuous.", typeConverter=TypeConverters.toInt) +handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle invalid data " + + "(unseen labels or NULL values). Options are 'skip' (filter out " + + "rows with invalid data), 'error' (throw an error), or 'keep' (put " + + "invalid data in a special additional bucket, at index numCategories).", + typeConverter=TypeConverters.toString) + @keyword_only -def __init__(self, maxCategories=20, inputCol=None, outputCol=None): +def __init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): """ -__init__(self, maxCategories=20, inputCol=None, outputCol=None) +__init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error") """ super(VectorIndexer, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid) -self._setDefault(maxCategories=20) +self._setDefault(maxCategories=20, handleInvalid="error") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only @since("1.4.0") -def setParams(self, maxCategories=20, inputCol=None, outputCol=None): +def setParams(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): --- End diff -- ah, but, unfortunately, I think you're wrong. The `inputCol=None` represent, if user do not specify the inputCol, there is no default value, and exception will be thrown. Duplicating default params is an issue, but already exists in all the pyspark.ml estimator/models. e.g., you can check `StringIndexer` in pyspark, it also has `handleInvalid` param. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19746 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19746 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83918/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19746 **[Test build #83918 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83918/testReport)** for PR 19746 at commit [`03bd63c`](https://github.com/apache/spark/commit/03bd63c654cc3f8982ec726c50472759f704918d). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class VectorSizeHint @Since(\"2.3.0\") (@Since(\"2.3.0\") override val uid: String)` * ` class InvalidEntryException(msg: String) extends Exception(msg)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12162: [SPARK-14289][WIP] Support multiple eviction strategies ...
Github user michaelmior commented on the issue: https://github.com/apache/spark/pull/12162 As best I can tell, the code that was pushed here is incomplete. However, Spark's default cache eviction policy is LRU. You can find the code which performs eviction [here](https://github.com/apache/spark/blob/1e82335413bc2384073ead0d6d581c862036d0f5/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L501). It basically just works by storing all the data in a `LinkedHashMap` configured to track which elements were accessed most recently. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19741: [SPARK-14228][CORE][YARN] Lost executor of RPC di...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19741#discussion_r151308593 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala --- @@ -268,8 +268,13 @@ private[spark] abstract class YarnSchedulerBackend( logWarning(reason.toString) driverEndpoint.ask[Boolean](r).onFailure { --- End diff -- IIRC there are two different `RemoveExecutor` messages in the code, as confusing as that may be. But if this one is used in multiple places then it's probably not worth changing right now, unless you're up for verifying the return value is not needed in the other places. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19633#discussion_r151308496 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -424,11 +424,19 @@ case class FileSourceScanExec( val defaultMaxSplitBytes = fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes -val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism -val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum -val bytesPerCore = totalBytes / defaultParallelism -val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) +// Ignore bytesPerCore when dynamic allocation is enabled. See SPARK-22411 +val maxSplitBytes = + if (Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf)) { +defaultMaxSplitBytes --- End diff -- >For small data, sometimes users care about the number of output files generated as well. Can you please elaborate more, do you want more output files or less output files. if less, I think I already mentioned in the above comment. Seems you patch already ignored `bytesPerCore` when dynamic allocation is enabled, are you suggesting something different? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19643#discussion_r151307924 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1838,12 +1852,21 @@ class SparkContext(config: SparkConf) extends Logging { case _ => path } } + if (key != null) { val timestamp = System.currentTimeMillis if (addedJars.putIfAbsent(key, timestamp).isEmpty) { logInfo(s"Added JAR $path at $key with timestamp $timestamp") postEnvironmentUpdate() } + +if (addToCurrentClassLoader) { + Utils.getContextOrSparkClassLoader match { +case cl: MutableURLClassLoader => cl.addURL(Utils.resolveURI(path).toURL) --- End diff -- I'm not sure does it support remote jars on HTTPS or Hadoop FileSystemsï¼In the executor side, we handle this explicitly by downloading jars to local and add to classpath, but here looks like we don't have such logic. I'm not sure how this `URLClassLoader` communicate with Hadoop or Https without certificates. The `addJar` is just adding jars to fileserver, so that executor could fetch them from driver and add to classpath. It will not affect driver's classpath. If we support adding jars to current driver's classloader, then how do we leverage this newly added jars? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19170: [SPARK-21961][Core] Filter out BlockStatuses Accumulator...
Github user zhouyejoe commented on the issue: https://github.com/apache/spark/pull/19170 I will work on it. Thanks for review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19741: [SPARK-14228][CORE][YARN] Lost executor of RPC di...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19741#discussion_r151305271 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala --- @@ -268,8 +268,13 @@ private[spark] abstract class YarnSchedulerBackend( logWarning(reason.toString) driverEndpoint.ask[Boolean](r).onFailure { --- End diff -- I think when you use `send`, all the related things as you mentioned above should be changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19468 I am actually quite swamped right now, hence the delay in getting to this PR. I will try to take a pass over the weekend; but if it does get merged if @vanzin is ok with it, fine by me ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19741: [SPARK-14228][CORE][YARN] Lost executor of RPC di...
Github user devaraj-kavali commented on a diff in the pull request: https://github.com/apache/spark/pull/19741#discussion_r151303638 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala --- @@ -268,8 +268,13 @@ private[spark] abstract class YarnSchedulerBackend( logWarning(reason.toString) driverEndpoint.ask[Boolean](r).onFailure { --- End diff -- Thanks @vanzin for the review. I think it is a good suggestion, there are other places where the `RemoveExecutor` message is sending using the `ask`, are you suggesting to change those as well? > (And the driver endpoint will probably need a minor change too.) You mean moving the case `RemoveExecutor(executorId, reason)` from `receiveAndReply` to `receive`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19760 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83916/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19760 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19760 **[Test build #83916 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83916/testReport)** for PR 19760 at commit [`fcf3fbf`](https://github.com/apache/spark/commit/fcf3fbfc47fbf4ea7290f74e65f324eee68f7aba). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19601: [SPARK-22383][SQL] Generate code to directly get value o...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19601 @cloud-fan could you please review this again? I merged with the `ColumnarArray`. As you suggested, the latest implementation does not change `ColumnVector` and `ColumnarArray`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19750 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83915/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19750 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19750 **[Test build #83915 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83915/testReport)** for PR 19750 at commit [`a004ce9`](https://github.com/apache/spark/commit/a004ce915a21e0dc8093bfdde4fe4290e9fa5fc2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19753#discussion_r151298582 --- Diff: python/pyspark/ml/feature.py --- @@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, Ja "(>= 2). If a feature is found to have > maxCategories values, then " + "it is declared continuous.", typeConverter=TypeConverters.toInt) +handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle invalid data " + + "(unseen labels or NULL values). Options are 'skip' (filter out " + + "rows with invalid data), 'error' (throw an error), or 'keep' (put " + + "invalid data in a special additional bucket, at index numCategories).", + typeConverter=TypeConverters.toString) + @keyword_only -def __init__(self, maxCategories=20, inputCol=None, outputCol=None): +def __init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): """ -__init__(self, maxCategories=20, inputCol=None, outputCol=None) +__init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error") """ super(VectorIndexer, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid) -self._setDefault(maxCategories=20) +self._setDefault(maxCategories=20, handleInvalid="error") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only @since("1.4.0") -def setParams(self, maxCategories=20, inputCol=None, outputCol=None): +def setParams(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): --- End diff -- Another Q: I see there's a pattern of `setParams` using `None` as a default value for all/most of its arguments in other featurizers, perhaps we should do the same (i.e. have a default argument of `handleValid=None` here)? IMO specifying the default parameter value in one place is preferable to duplicating it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19753: [SPARK-22521][ML] VectorIndexerModel support hand...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19753#discussion_r151298765 --- Diff: python/pyspark/ml/feature.py --- @@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, Ja "(>= 2). If a feature is found to have > maxCategories values, then " + "it is declared continuous.", typeConverter=TypeConverters.toInt) +handleInvalid = Param(Params._dummy(), "handleInvalid", "How to handle invalid data " + + "(unseen labels or NULL values). Options are 'skip' (filter out " + + "rows with invalid data), 'error' (throw an error), or 'keep' (put " + + "invalid data in a special additional bucket, at index numCategories).", + typeConverter=TypeConverters.toString) + @keyword_only -def __init__(self, maxCategories=20, inputCol=None, outputCol=None): +def __init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): """ -__init__(self, maxCategories=20, inputCol=None, outputCol=None) +__init__(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error") """ super(VectorIndexer, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.VectorIndexer", self.uid) -self._setDefault(maxCategories=20) +self._setDefault(maxCategories=20, handleInvalid="error") kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only @since("1.4.0") -def setParams(self, maxCategories=20, inputCol=None, outputCol=None): +def setParams(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"): --- End diff -- The same goes for the constructor (IMO we should default to `handleInvalid=None` there too), but open to hearing your thoughts. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19762 **[Test build #83925 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83925/testReport)** for PR 19762 at commit [`dffcb05`](https://github.com/apache/spark/commit/dffcb05e3c7b7ef0973769337092b2b7adbc7992). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19761 **[Test build #83924 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83924/testReport)** for PR 19761 at commit [`adbf94e`](https://github.com/apache/spark/commit/adbf94eef67780b2ef547eff250ea3e11e4deda2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19761 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19761 **[Test build #83923 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83923/testReport)** for PR 19761 at commit [`ba4d590`](https://github.com/apache/spark/commit/ba4d590b192cfe220bfed0e1bd690af746cf2ad1). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19761: [WIP][SPARK-22479][SQL][BRANCH-2.2] Exclude credentials ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19761 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83923/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for addJar ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19643 Hi @jerryshao. Would you maybe have some time to take a look for this one please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19643#discussion_r151296077 --- Diff: R/pkg/R/context.R --- @@ -319,6 +319,27 @@ spark.addFile <- function(path, recursive = FALSE) { invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path)), recursive)) } +#' Adds a JAR dependency for Spark tasks to be executed in the future. +#' +#' The \code{path} passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. +#' If \code{addToCurrentClassLoader} is true, add the jar to the current driver. --- End diff -- @mariusvniekerk are you okay with proceeding this here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19761: [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19761 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19761: [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19761 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83922/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19761: [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19761 **[Test build #83922 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83922/testReport)** for PR 19761 at commit [`fd36d2f`](https://github.com/apache/spark/commit/fd36d2f54939a27406f4077a04a6f7dad328f840). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19630 Looks clean and pretty solid in general. Let me take another look to double check, probably, within this weekend and maybe I will leave it to @ueshin if I can take the look ahead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19761: [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19761 **[Test build #83923 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83923/testReport)** for PR 19761 at commit [`ba4d590`](https://github.com/apache/spark/commit/ba4d590b192cfe220bfed0e1bd690af746cf2ad1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19761: [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19761 **[Test build #83922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83922/testReport)** for PR 19761 at commit [`fd36d2f`](https://github.com/apache/spark/commit/fd36d2f54939a27406f4077a04a6f7dad328f840). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19762 **[Test build #83921 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83921/testReport)** for PR 19762 at commit [`a1f054b`](https://github.com/apache/spark/commit/a1f054ba9aff2bef5caf4534ef56dfd5e22eb96e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19762: [SPARK-22535][PySpark] Sleep before killing the p...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/19762 [SPARK-22535][PySpark] Sleep before killing the python worker in PythonRunner.MonitorThread ## What changes were proposed in this pull request? `PythonRunner.MonitorThread` should give the task a little time to finish before forcibly killing the python worker. This will reduce the chance of the race condition a lot. I also improved the log a bit to find out the task to blame when it's stuck. ## How was this patch tested? Jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-22535 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19762.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19762 commit a1f054ba9aff2bef5caf4534ef56dfd5e22eb96e Author: Shixiong Zhu Date: 2017-11-16T00:35:09Z Sleep before killing the python worker in PythonRunner.MonitorThread --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19762: [SPARK-22535][PySpark] Sleep before killing the python w...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19762 cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19753: [SPARK-22521][ML] VectorIndexerModel support handle unse...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19753 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org