[GitHub] spark issue #20239: [SPARK-23047][PYTHON][SQL] Change MapVector to NullableM...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20239 Btw, I don't mean to block this pr but why does only `MapVector` have `Nullable` version, just out of curiosity. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20183#discussion_r161157993 --- Diff: core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala --- @@ -52,6 +54,10 @@ private[spark] class BroadcastManager( private val nextBroadcastId = new AtomicLong(0) + private[broadcast] val cachedValues = { +new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK) --- End diff -- ohh, I see. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20183#discussion_r161157178 --- Diff: core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala --- @@ -52,6 +54,10 @@ private[spark] class BroadcastManager( private val nextBroadcastId = new AtomicLong(0) + private[broadcast] val cachedValues = { +new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK) --- End diff -- `can remove mappings...` That why I say entries can be removed automatically. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20204#discussion_r161155869 --- Diff: python/run-tests-with-coverage --- @@ -0,0 +1,69 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +set -o pipefail +set -e + +# This variable indicates which coverage executable to run to combine coverages +# and generate HTMLs, for example, 'coverage3' in Python 3. +COV_EXEC="${COV_EXEC:-coverage}" +FWDIR="$(cd "`dirname $0`"; pwd)" +pushd "$FWDIR" > /dev/null + +# Ensure that coverage executable is installed. +if ! hash $COV_EXEC 2>/dev/null; then + echo "Missing coverage executable in your path, skipping PySpark coverage" + exit 1 +fi + +# Set up the directories for coverage results. +export COVERAGE_DIR="$FWDIR/test_coverage" +rm -fr "$COVERAGE_DIR/coverage_data" +rm -fr "$COVERAGE_DIR/htmlcov" +mkdir -p "$COVERAGE_DIR/coverage_data" + +# Current directory are added in the python path so that it doesn't refer our built +# pyspark zip library first. +export PYTHONPATH="$FWDIR:$PYTHONPATH" +# Also, our sitecustomize.py and coverage_daemon.py are included in the path. +export PYTHONPATH="$COVERAGE_DIR:$PYTHONPATH" + +# We use 'spark.python.daemon.module' configuration to insert the coverage supported workers. +export SPARK_CONF_DIR="$COVERAGE_DIR/conf" + +# This environment variable enables the coverage. +export COVERAGE_PROCESS_START="$FWDIR/.coveragerc" + +# If you'd like to run a specific unittest class, you could do such as +# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests +./run-tests $@ --- End diff -- nit: `"$@"` instead of `$@`, just in case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20214: [SPARK-23023][SQL] Cast field data to strings in ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/20214#discussion_r161156501 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -237,13 +237,18 @@ class Dataset[T] private[sql]( private[sql] def showString( _numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = { val numRows = _numRows.max(0).min(Int.MaxValue - 1) -val takeResult = toDF().take(numRows + 1) +val newDf = toDF() +val castExprs = newDf.schema.map { f => f.dataType match { + // Since binary types in top-level schema fields have a specific format to print, + // so we do not cast them to strings here. + case BinaryType => s"`${f.name}`" + case _: UserDefinedType[_] => s"`${f.name}`" --- End diff -- oh, yea. I missed that. Thanks, I'll make a separate pr. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20245 **[Test build #86024 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86024/testReport)** for PR 20245 at commit [`473b37a`](https://github.com/apache/spark/commit/473b37a1d2abd74f66d29883a1e8bb488cf6bb94). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20183#discussion_r161156210 --- Diff: core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala --- @@ -52,6 +54,10 @@ private[spark] class BroadcastManager( private val nextBroadcastId = new AtomicLong(0) + private[broadcast] val cachedValues = { +new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK) --- End diff -- BTW - welcome back @jerryshao ! long time no see! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20183#discussion_r161156002 --- Diff: core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala --- @@ -52,6 +54,10 @@ private[spark] class BroadcastManager( private val nextBroadcastId = new AtomicLong(0) + private[broadcast] val cachedValues = { +new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK) --- End diff -- That should be fine even if that hard references not removed, since the memory consumption should be quite minor. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20245 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20245 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20245 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86019/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20245 **[Test build #86019 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86019/testReport)** for PR 20245 at commit [`473b37a`](https://github.com/apache/spark/commit/473b37a1d2abd74f66d29883a1e8bb488cf6bb94). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20183 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20183#discussion_r161154730 --- Diff: core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala --- @@ -52,6 +54,10 @@ private[spark] class BroadcastManager( private val nextBroadcastId = new AtomicLong(0) + private[broadcast] val cachedValues = { +new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK) --- End diff -- This is what I found from the doc: >Hash-based Map implementation that allows mappings to be removed by the garbage collector. When you construct a ReferenceMap, you can specify what kind of references are used to store the map's keys and values. If non-hard references are used, then the garbage collector can remove mappings if a key or value becomes unreachable, or if the JVM's memory is running low. For information on how the different reference types behave, see Reference. It only mentions that non-hard references can be removed by GC, please correct me if I'm wrong. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20183 thanks, merging to master/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...
Github user ivoson commented on the issue: https://github.com/apache/spark/pull/20244 This is the stack trace of the Exception. ``` java.lang.ClassCastException: org.apache.spark.rdd.CheckpointRDDPartition cannot be cast to org.apache.spark.streaming.rdd.MapWithStateRDDPartition at org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:152) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336) at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20183#discussion_r161154173 --- Diff: core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala --- @@ -52,6 +54,10 @@ private[spark] class BroadcastManager( private val nextBroadcastId = new AtomicLong(0) + private[broadcast] val cachedValues = { +new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK) --- End diff -- according to the document of `ReferenceMap`, if key or value is eligible for GC, the entry will be removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/20244 @ivoson Tengfei, please post the full stack trace of the `ClassCastException`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20229: [SPARK-23045][ML][SparkR] Update RFormula to use ...
Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/20229#discussion_r161153997 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -230,16 +231,17 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) val encodedTerms = resolvedFormula.terms.map { case Seq(term) if dataset.schema(term).dataType == StringType => val encodedCol = tmpColumn("onehot") -var encoder = new OneHotEncoder() - .setInputCol(indexed(term)) - .setOutputCol(encodedCol) // Formula w/o intercept, one of the categories in the first category feature is // being used as reference category, we will not drop any category for that feature. if (!hasIntercept && !keepReferenceCategory) { - encoder = encoder.setDropLast(false) + encoderStages += new OneHotEncoderEstimator(uid) +.setInputCols(Array(indexed(term))) +.setOutputCols(Array(encodedCol)) +.setDropLast(false) --- End diff -- There is at most 1 encoder with `dropLast(false)`, the next line sets `keepReferenceCategory = true` to ensure we won't take this code path for the remaining columns. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20242: [MINOR][BUILD] Fix Java linter errors
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20242 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86015/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20242: [MINOR][BUILD] Fix Java linter errors
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20242 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20183 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20183 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86016/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20242: [MINOR][BUILD] Fix Java linter errors
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20242 **[Test build #86015 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86015/testReport)** for PR 20242 at commit [`752daa3`](https://github.com/apache/spark/commit/752daa38736d9f620efc47d65fe9277315a397e9). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20183 **[Test build #86016 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86016/testReport)** for PR 20183 at commit [`a6e5e81`](https://github.com/apache/spark/commit/a6e5e810402d0fc807d86dba9f699d2851be1f3a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20214: [SPARK-23023][SQL] Cast field data to strings in ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20214#discussion_r161153123 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -237,13 +237,18 @@ class Dataset[T] private[sql]( private[sql] def showString( _numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = { val numRows = _numRows.max(0).min(Int.MaxValue - 1) -val takeResult = toDF().take(numRows + 1) +val newDf = toDF() +val castExprs = newDf.schema.map { f => f.dataType match { + // Since binary types in top-level schema fields have a specific format to print, + // so we do not cast them to strings here. + case BinaryType => s"`${f.name}`" + case _: UserDefinedType[_] => s"`${f.name}`" --- End diff -- How about something like: ```scala case udt: UserDefinedType[_] => (c, evPrim, evNull) => { val udtTerm = ctx.addReferenceObj("udt", udt) s"$evPrim = UTF8String.fromString($udtTerm.deserialize($c).toString());" } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20240: [SPARK-23049][SQL] `spark.sql.files.ignoreCorruptFiles` ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20240 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20240: [SPARK-23049][SQL] `spark.sql.files.ignoreCorruptFiles` ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20240 **[Test build #86014 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86014/testReport)** for PR 20240 at commit [`33ae3ca`](https://github.com/apache/spark/commit/33ae3ca34aa237c630927c96d9421ea53ed6a775). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20240: [SPARK-23049][SQL] `spark.sql.files.ignoreCorruptFiles` ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20240 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86014/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20217: [SPARK-23026] [PySpark] Add RegisterUDF to PySpark
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20217 @cloud-fan, do you prefer to have a new API just to be clear, BTW? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20183#discussion_r161151338 --- Diff: core/src/main/scala/org/apache/spark/broadcast/BroadcastManager.scala --- @@ -52,6 +54,10 @@ private[spark] class BroadcastManager( private val nextBroadcastId = new AtomicLong(0) + private[broadcast] val cachedValues = { +new ReferenceMap(AbstractReferenceMap.HARD, AbstractReferenceMap.WEAK) --- End diff -- If the key is a hard reference, does it mean that this key will never be cleaned from map automatically based on GC? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20183#discussion_r161149468 --- Diff: core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala --- @@ -206,36 +206,50 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: T, id: Long) private def readBroadcastBlock(): T = Utils.tryOrIOException { TorrentBroadcast.synchronized { - setConf(SparkEnv.get.conf) - val blockManager = SparkEnv.get.blockManager - blockManager.getLocalValues(broadcastId) match { -case Some(blockResult) => - if (blockResult.data.hasNext) { -val x = blockResult.data.next().asInstanceOf[T] -releaseLock(broadcastId) -x - } else { -throw new SparkException(s"Failed to get locally stored broadcast data: $broadcastId") - } -case None => - logInfo("Started reading broadcast variable " + id) - val startTimeMs = System.currentTimeMillis() - val blocks = readBlocks() - logInfo("Reading broadcast variable " + id + " took" + Utils.getUsedTimeMs(startTimeMs)) - - try { -val obj = TorrentBroadcast.unBlockifyObject[T]( - blocks.map(_.toInputStream()), SparkEnv.get.serializer, compressionCodec) -// Store the merged copy in BlockManager so other tasks on this executor don't -// need to re-fetch it. -val storageLevel = StorageLevel.MEMORY_AND_DISK -if (!blockManager.putSingle(broadcastId, obj, storageLevel, tellMaster = false)) { - throw new SparkException(s"Failed to store $broadcastId in BlockManager") + val broadcastCache = SparkEnv.get.broadcastManager.cachedValues + + Option(broadcastCache.get(broadcastId)).map(_.asInstanceOf[T]).getOrElse { --- End diff -- I see, thanks for pointing out. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20183 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86010/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20183 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with returnType=S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20163 **[Test build #86023 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86023/testReport)** for PR 20163 at commit [`d307cee`](https://github.com/apache/spark/commit/d307ceeaa581dc219ae744482a6d3c374b3a3025). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20242: [MINOR][BUILD] Fix Java linter errors
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/20242 Thanks @dongwang218, LGTM It seems like the java linter checks are not included in https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-lint/. I'll update the scripts so that they're automatically checked. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20183: [SPARK-22986][Core] Use a cache to avoid instantiating m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20183 **[Test build #86010 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86010/testReport)** for PR 20183 at commit [`8e13585`](https://github.com/apache/spark/commit/8e13585f563162362d4e20f473438a0d0f9ce3d3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20226 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20226 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86022/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20226 **[Test build #86022 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86022/testReport)** for PR 20226 at commit [`9bcd905`](https://github.com/apache/spark/commit/9bcd9052416d7d6127b560f9d7ce77965a0005ce). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with returnType=S...
Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/20163 jenkins retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20224 As high level comment, to add IDs helps performance/error diagnosis in production environments. I strongly support to always enable this. Let me look at technical detail later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...
Github user ho3rexqj commented on a diff in the pull request: https://github.com/apache/spark/pull/20183#discussion_r161148920 --- Diff: core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala --- @@ -206,36 +206,50 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: T, id: Long) private def readBroadcastBlock(): T = Utils.tryOrIOException { TorrentBroadcast.synchronized { - setConf(SparkEnv.get.conf) - val blockManager = SparkEnv.get.blockManager - blockManager.getLocalValues(broadcastId) match { -case Some(blockResult) => - if (blockResult.data.hasNext) { -val x = blockResult.data.next().asInstanceOf[T] -releaseLock(broadcastId) -x - } else { -throw new SparkException(s"Failed to get locally stored broadcast data: $broadcastId") - } -case None => - logInfo("Started reading broadcast variable " + id) - val startTimeMs = System.currentTimeMillis() - val blocks = readBlocks() - logInfo("Reading broadcast variable " + id + " took" + Utils.getUsedTimeMs(startTimeMs)) - - try { -val obj = TorrentBroadcast.unBlockifyObject[T]( - blocks.map(_.toInputStream()), SparkEnv.get.serializer, compressionCodec) -// Store the merged copy in BlockManager so other tasks on this executor don't -// need to re-fetch it. -val storageLevel = StorageLevel.MEMORY_AND_DISK -if (!blockManager.putSingle(broadcastId, obj, storageLevel, tellMaster = false)) { - throw new SparkException(s"Failed to store $broadcastId in BlockManager") + val broadcastCache = SparkEnv.get.broadcastManager.cachedValues + + Option(broadcastCache.get(broadcastId)).map(_.asInstanceOf[T]).getOrElse { --- End diff -- `ReferenceMap` is not thread safe, no - however, all operations on `broadcastCache` occur within the context of a synchronized block; TorrentBroadcast.scala lines 208-254. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19001 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86013/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19001 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19001 **[Test build #86013 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86013/testReport)** for PR 19001 at commit [`7b8a072`](https://github.com/apache/spark/commit/7b8a0729b38ba2fbdc1c4359fcb82a1b6cde5b5c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` throw new IOException(\"Cannot find class \" + inputFormatClassName, e);` * ` throw new IOException(\"Unable to find the InputFormat class \" + inputFormatClassName, e);` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20183#discussion_r161147892 --- Diff: core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala --- @@ -206,36 +206,50 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: T, id: Long) private def readBroadcastBlock(): T = Utils.tryOrIOException { TorrentBroadcast.synchronized { - setConf(SparkEnv.get.conf) - val blockManager = SparkEnv.get.blockManager - blockManager.getLocalValues(broadcastId) match { -case Some(blockResult) => - if (blockResult.data.hasNext) { -val x = blockResult.data.next().asInstanceOf[T] -releaseLock(broadcastId) -x - } else { -throw new SparkException(s"Failed to get locally stored broadcast data: $broadcastId") - } -case None => - logInfo("Started reading broadcast variable " + id) - val startTimeMs = System.currentTimeMillis() - val blocks = readBlocks() - logInfo("Reading broadcast variable " + id + " took" + Utils.getUsedTimeMs(startTimeMs)) - - try { -val obj = TorrentBroadcast.unBlockifyObject[T]( - blocks.map(_.toInputStream()), SparkEnv.get.serializer, compressionCodec) -// Store the merged copy in BlockManager so other tasks on this executor don't -// need to re-fetch it. -val storageLevel = StorageLevel.MEMORY_AND_DISK -if (!blockManager.putSingle(broadcastId, obj, storageLevel, tellMaster = false)) { - throw new SparkException(s"Failed to store $broadcastId in BlockManager") + val broadcastCache = SparkEnv.get.broadcastManager.cachedValues + + Option(broadcastCache.get(broadcastId)).map(_.asInstanceOf[T]).getOrElse { --- End diff -- Is this `ReferenceMap` thread safe? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20243: [SPARK-23052][SS] Migrate ConsoleSink to data source V2 ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20243 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86017/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20226: [SPARK-23034][SQL][UI] Display tablename for `Hiv...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/20226#discussion_r161147457 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala --- @@ -62,6 +62,8 @@ case class HiveTableScanExec( override def conf: SQLConf = sparkSession.sessionState.conf + override def nodeName: String = s"${super.nodeName} (${relation.tableMeta.qualifiedName})" --- End diff -- @gatorsmile : I have updated the PR after going through all the *ScanExec implementations Changes introduced in this PR: Scan impl | overridden `nodeName` | - DataSourceV2ScanExec | `Scan DataSourceV2 [output_attribute1, output_attribute2, ..]` ExternalRDDScanExec | `Scan ExternalRDD [output_attribute1, output_attribute2, ..]` FileSourceScanExec | `Scan FileSource ${tableIdentifier.map(_.unquotedString).getOrElse(relation.location)}"` HiveTableScanExec | `Scan HiveTable relation.tableMeta.qualifiedName` InMemoryTableScanExec | `Scan In-memory relation.tableName` LocalTableScanExec | `Scan LocalTable [output_attribute1, output_attribute2, ..]` RDDScanExec | `Scan RDD name [output_attribute1, output_attribute2, ..]` RowDataSourceScanExec | `Scan FileSource ${tableIdentifier.map(_.unquotedString).getOrElse(relation)}` Things not affected: - DataSourceScanExec : already uses `Scan relation tableIdentifier.unquotedString` - RDDScanExec forces clients to specify the `nodeName` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL][UI] Display tablename for `HiveTableS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20226 **[Test build #86022 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86022/testReport)** for PR 20226 at commit [`9bcd905`](https://github.com/apache/spark/commit/9bcd9052416d7d6127b560f9d7ce77965a0005ce). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20243: [SPARK-23052][SS] Migrate ConsoleSink to data source V2 ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20243 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20243: [SPARK-23052][SS] Migrate ConsoleSink to data source V2 ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20243 **[Test build #86017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86017/testReport)** for PR 20243 at commit [`71cc6e4`](https://github.com/apache/spark/commit/71cc6e41cc19af8e672c67624ca16f330804ccc8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20211#discussion_r161146793 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -457,13 +458,26 @@ class RelationalGroupedDataset protected[sql]( val groupingNamedExpressions = groupingExprs.map { case ne: NamedExpression => ne - case other => Alias(other, other.toString)() + case other => Alias(other, toPrettySQL(other))() } val groupingAttributes = groupingNamedExpressions.map(_.toAttribute) val child = df.logicalPlan val project = Project(groupingNamedExpressions ++ child.output, child) -val output = expr.dataType.asInstanceOf[StructType].toAttributes -val plan = FlatMapGroupsInPandas(groupingAttributes, expr, output, project) +val udfOutput: Seq[Attribute] = expr.dataType.asInstanceOf[StructType].toAttributes +val additionalGroupingAttributes = mutable.ArrayBuffer[Attribute]() + +for (attribute <- groupingAttributes) { + if (!udfOutput.map(_.name).contains(attribute.name)) { --- End diff -- Maybe this relates to the discussion above (https://github.com/apache/spark/pull/20211#discussion_r160524679). Let's wait and see for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20214 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20214 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86012/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20214: [SPARK-23023][SQL] Cast field data to strings in showStr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20214 **[Test build #86012 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86012/testReport)** for PR 20214 at commit [`afe0af5`](https://github.com/apache/spark/commit/afe0af504b8a799dadfcd8c18ade339432f889b0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...
Github user ivoson commented on a diff in the pull request: https://github.com/apache/spark/pull/20244#discussion_r161145542 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2399,6 +2417,93 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi } } + /** + * In this test, we simply simulate the scene in concurrent jobs using the same + * rdd which is marked to do checkpoint: + * Job one has already finished the spark job, and start the process of doCheckpoint; + * Job two is submitted, and submitMissingTasks is called. + * In submitMissingTasks, if taskSerialization is called before doCheckpoint is done, + * while part calculates from stage.rdd.partitions is called after doCheckpoint is done, + * we may get a ClassCastException when execute the task because of some rdd will do + * Partition cast. + * + * With this test case, just want to indicate that we should do taskSerialization and + * part calculate in submitMissingTasks with the same rdd checkpoint status. + */ + test("task part misType with checkpoint rdd in concurrent execution scenes") { +// set checkpointDir. +val tempDir = Utils.createTempDir() +val checkpointDir = File.createTempFile("temp", "", tempDir) +checkpointDir.delete() +sc.setCheckpointDir(checkpointDir.toString) + +val latch = new CountDownLatch(2) +val semaphore1 = new Semaphore(0) +val semaphore2 = new Semaphore(0) + +val rdd = new WrappedRDD(sc.makeRDD(1 to 100, 4)) +rdd.checkpoint() + +val checkpointRunnable = new Runnable { + override def run() = { +// Simply simulate what RDD.doCheckpoint() do here. +rdd.doCheckpointCalled = true +val checkpointData = rdd.checkpointData.get +RDDCheckpointData.synchronized { + if (checkpointData.cpState == CheckpointState.Initialized) { +checkpointData.cpState = CheckpointState.CheckpointingInProgress + } +} + +val newRDD = checkpointData.doCheckpoint() + +// Release semaphore1 after job triggered in checkpoint finished. +semaphore1.release() +semaphore2.acquire() +// Update our state and truncate the RDD lineage. +RDDCheckpointData.synchronized { + checkpointData.cpRDD = Some(newRDD) + checkpointData.cpState = CheckpointState.Checkpointed + rdd.markCheckpointed() +} +semaphore1.release() + +latch.countDown() + } +} + +val submitMissingTasksRunnable = new Runnable { + override def run() = { +// Simply simulate the process of submitMissingTasks. +val ser = SparkEnv.get.closureSerializer.newInstance() +semaphore1.acquire() +// Simulate task serialization while submitMissingTasks. +// Task serialized with rdd checkpoint not finished. +val cleanedFunc = sc.clean(Utils.getIteratorSize _) +val func = (ctx: TaskContext, it: Iterator[Int]) => cleanedFunc(it) +val taskBinaryBytes = JavaUtils.bufferToArray( + ser.serialize((rdd, func): AnyRef)) +semaphore2.release() +semaphore1.acquire() +// Part calculated with rdd checkpoint already finished. +val (taskRdd, taskFunc) = ser.deserialize[(RDD[Int], (TaskContext, Iterator[Int]) => Unit)]( + ByteBuffer.wrap(taskBinaryBytes), Thread.currentThread.getContextClassLoader) +val part = rdd.partitions(0) +intercept[ClassCastException] { --- End diff -- it is a reproduce case, i will fix this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...
Github user ivoson commented on a diff in the pull request: https://github.com/apache/spark/pull/20244#discussion_r161145538 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2399,6 +2417,93 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi } } + /** + * In this test, we simply simulate the scene in concurrent jobs using the same + * rdd which is marked to do checkpoint: + * Job one has already finished the spark job, and start the process of doCheckpoint; + * Job two is submitted, and submitMissingTasks is called. + * In submitMissingTasks, if taskSerialization is called before doCheckpoint is done, + * while part calculates from stage.rdd.partitions is called after doCheckpoint is done, + * we may get a ClassCastException when execute the task because of some rdd will do + * Partition cast. + * + * With this test case, just want to indicate that we should do taskSerialization and + * part calculate in submitMissingTasks with the same rdd checkpoint status. + */ + test("task part misType with checkpoint rdd in concurrent execution scenes") { --- End diff -- thanks for the suggest. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...
Github user ivoson commented on a diff in the pull request: https://github.com/apache/spark/pull/20244#discussion_r161145547 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -96,6 +98,22 @@ class MyRDD( override def toString: String = "DAGSchedulerSuiteRDD " + id } +/** Wrapped rdd partition. */ +class WrappedPartition(val partition: Partition) extends Partition { + def index: Int = partition.index +} + +/** Wrapped rdd with WrappedPartition. */ +class WrappedRDD(parent: RDD[Int]) extends RDD[Int](parent) { + protected def getPartitions: Array[Partition] = { +parent.partitions.map(p => new WrappedPartition(p)) + } + + def compute(split: Partition, context: TaskContext): Iterator[Int] = { +parent.compute(split.asInstanceOf[WrappedPartition].partition, context) --- End diff -- thanks for the comment, i will work on this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/20244 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20222 **[Test build #86021 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86021/testReport)** for PR 20222 at commit [`440d11f`](https://github.com/apache/spark/commit/440d11f53a6f4b7531d96659c88a7e7961021778). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/20244 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20222 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/20244#discussion_r161141879 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2399,6 +2417,93 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi } } + /** + * In this test, we simply simulate the scene in concurrent jobs using the same + * rdd which is marked to do checkpoint: + * Job one has already finished the spark job, and start the process of doCheckpoint; + * Job two is submitted, and submitMissingTasks is called. + * In submitMissingTasks, if taskSerialization is called before doCheckpoint is done, + * while part calculates from stage.rdd.partitions is called after doCheckpoint is done, + * we may get a ClassCastException when execute the task because of some rdd will do + * Partition cast. + * + * With this test case, just want to indicate that we should do taskSerialization and + * part calculate in submitMissingTasks with the same rdd checkpoint status. + */ + test("task part misType with checkpoint rdd in concurrent execution scenes") { --- End diff -- maybe "SPARK-23053: avoid CastException in concurrent execution with checkpoint" better? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/20244#discussion_r161141499 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -96,6 +98,22 @@ class MyRDD( override def toString: String = "DAGSchedulerSuiteRDD " + id } +/** Wrapped rdd partition. */ +class WrappedPartition(val partition: Partition) extends Partition { + def index: Int = partition.index +} + +/** Wrapped rdd with WrappedPartition. */ +class WrappedRDD(parent: RDD[Int]) extends RDD[Int](parent) { + protected def getPartitions: Array[Partition] = { +parent.partitions.map(p => new WrappedPartition(p)) + } + + def compute(split: Partition, context: TaskContext): Iterator[Int] = { +parent.compute(split.asInstanceOf[WrappedPartition].partition, context) --- End diff -- I think this line is the key point for `WrppedPartition` and `WrappedRDD`, please give comments for explaining your intention. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/20244#discussion_r161144809 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -2399,6 +2417,93 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi } } + /** + * In this test, we simply simulate the scene in concurrent jobs using the same + * rdd which is marked to do checkpoint: + * Job one has already finished the spark job, and start the process of doCheckpoint; + * Job two is submitted, and submitMissingTasks is called. + * In submitMissingTasks, if taskSerialization is called before doCheckpoint is done, + * while part calculates from stage.rdd.partitions is called after doCheckpoint is done, + * we may get a ClassCastException when execute the task because of some rdd will do + * Partition cast. + * + * With this test case, just want to indicate that we should do taskSerialization and + * part calculate in submitMissingTasks with the same rdd checkpoint status. + */ + test("task part misType with checkpoint rdd in concurrent execution scenes") { +// set checkpointDir. +val tempDir = Utils.createTempDir() +val checkpointDir = File.createTempFile("temp", "", tempDir) +checkpointDir.delete() +sc.setCheckpointDir(checkpointDir.toString) + +val latch = new CountDownLatch(2) +val semaphore1 = new Semaphore(0) +val semaphore2 = new Semaphore(0) + +val rdd = new WrappedRDD(sc.makeRDD(1 to 100, 4)) +rdd.checkpoint() + +val checkpointRunnable = new Runnable { + override def run() = { +// Simply simulate what RDD.doCheckpoint() do here. +rdd.doCheckpointCalled = true +val checkpointData = rdd.checkpointData.get +RDDCheckpointData.synchronized { + if (checkpointData.cpState == CheckpointState.Initialized) { +checkpointData.cpState = CheckpointState.CheckpointingInProgress + } +} + +val newRDD = checkpointData.doCheckpoint() + +// Release semaphore1 after job triggered in checkpoint finished. +semaphore1.release() +semaphore2.acquire() +// Update our state and truncate the RDD lineage. +RDDCheckpointData.synchronized { + checkpointData.cpRDD = Some(newRDD) + checkpointData.cpState = CheckpointState.Checkpointed + rdd.markCheckpointed() +} +semaphore1.release() + +latch.countDown() + } +} + +val submitMissingTasksRunnable = new Runnable { + override def run() = { +// Simply simulate the process of submitMissingTasks. +val ser = SparkEnv.get.closureSerializer.newInstance() +semaphore1.acquire() +// Simulate task serialization while submitMissingTasks. +// Task serialized with rdd checkpoint not finished. +val cleanedFunc = sc.clean(Utils.getIteratorSize _) +val func = (ctx: TaskContext, it: Iterator[Int]) => cleanedFunc(it) +val taskBinaryBytes = JavaUtils.bufferToArray( + ser.serialize((rdd, func): AnyRef)) +semaphore2.release() +semaphore1.acquire() +// Part calculated with rdd checkpoint already finished. +val (taskRdd, taskFunc) = ser.deserialize[(RDD[Int], (TaskContext, Iterator[Int]) => Unit)]( + ByteBuffer.wrap(taskBinaryBytes), Thread.currentThread.getContextClassLoader) +val part = rdd.partitions(0) +intercept[ClassCastException] { --- End diff -- I think this not a "test", this just a "reproduce" for the problem you want to fix. We should prove your code added in `DAGScheduler.scala` can fix that problem and with the original code base, a `ClassCastException` raised. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20222 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86008/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20222 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20222: [SPARK-23028] Bump master branch version to 2.4.0-SNAPSH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20222 **[Test build #86008 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86008/testReport)** for PR 20222 at commit [`440d11f`](https://github.com/apache/spark/commit/440d11f53a6f4b7531d96659c88a7e7961021778). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20242: [MINOR][BUILD] Fix Java linter errors
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/20242 LGTM, @dongjoon-hyun is the current changes include all the lint issues, or you still have further changes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20184: [SPARK-22987][Core] UnsafeExternalSorter cases OOM when ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/20184 @liutang123 , can you please tell us how to produce your issue easily? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20216: [SPARK-23024][WEB-UI]Spark ui about the contents of the ...
Github user guoxiaolongzte commented on the issue: https://github.com/apache/spark/pull/20216 @ajbozarth @srowen Fix the code, increase the arrow of the form page, maintain the consistency of the function. after fix: ![4](https://user-images.githubusercontent.com/26266482/34861201-29dcd1e4-f79e-11e7-8015-28c320e4b4bc.png) ![5](https://user-images.githubusercontent.com/26266482/34861202-2a114334-f79e-11e7-8deb-428836770bef.png) ![6](https://user-images.githubusercontent.com/26266482/34861203-2a3f6a3e-f79e-11e7-9c2a-924ea67b12c7.png) ![7](https://user-images.githubusercontent.com/26266482/34861204-2a70296c-f79e-11e7-915a-fc6a64e07108.png) ![8](https://user-images.githubusercontent.com/26266482/34861205-2aa42262-f79e-11e7-8c97-828ec24e0f71.png) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20225 **[Test build #86020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86020/testReport)** for PR 20225 at commit [`49f1eb6`](https://github.com/apache/spark/commit/49f1eb63840b80984542553ff1037f61e7149a83). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20239: [SPARK-23047][PYTHON][SQL] Change MapVector to NullableM...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20239 I'm not sure we can change to `NullableMapVector` and I'm just worrying whether the `MapVector` is never happened here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...
Github user jose-torres commented on the issue: https://github.com/apache/spark/pull/20225 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...
Github user jose-torres commented on the issue: https://github.com/apache/spark/pull/20225 The most recent test build failure is from an earlier commit which I think is obsoleted. I think #86007 is correct but we should retest this please to confirm. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20222: [SPARK-23028] Bump master branch version to 2.4.0...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20222#discussion_r161140353 --- Diff: dev/run-tests-jenkins.py --- @@ -181,8 +181,8 @@ def main(): short_commit_hash = ghprb_actual_commit[0:7] # format: http://linux.die.net/man/1/timeout -# must be less than the timeout configured on Jenkins (currently 300m) -tests_timeout = "250m" +# must be less than the timeout configured on Jenkins (currently 350m) +tests_timeout = "300m" --- End diff -- Ah, this was the root causes. Thanks, @HyukjinKwon . Sorry, @shaneknapp . I sent an annoying email before knowing this here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20245 nitpicking though, could you check? @gatorsmile @wzhfy @mbasmanova --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types for co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20245 **[Test build #86019 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86019/testReport)** for PR 20245 at commit [`473b37a`](https://github.com/apache/spark/commit/473b37a1d2abd74f66d29883a1e8bb488cf6bb94). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20245: [SPARK-21213][SQL][FOLLOWUP] Use compatible types...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/20245 [SPARK-21213][SQL][FOLLOWUP] Use compatible types for comparisons in compareAndGetNewStats ## What changes were proposed in this pull request? This pr fixed code to compare values in `compareAndGetNewStats`. The test below fails in the current master; ``` val oldStats2 = CatalogStatistics(sizeInBytes = BigInt(Long.MaxValue) * 2) val newStats5 = CommandUtils.compareAndGetNewStats( Some(oldStats2), newTotalSize = BigInt(Long.MaxValue) * 2, None) assert(newStats5.isEmpty) ``` ## How was this patch tested? Added some tests in `CommandUtilsSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark SPARK-21213-FOLLOWUP Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20245.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20245 commit 473b37a1d2abd74f66d29883a1e8bb488cf6bb94 Author: Takeshi Yamamuro Date: 2018-01-12T04:57:32Z Use compatible types for comparisons in compareAndGetNewStats --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...
GitHub user ivoson reopened a pull request: https://github.com/apache/spark/pull/20244 [SPARK-23053][CORE] taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status â¦d is the same when calculate taskSerialization and task partitions Change-Id: Ib9839ca552653343d264135c116742effa6feb60 ## What changes were proposed in this pull request? When we run concurrent jobs using the same rdd which is marked to do checkpoint. If one job has finished running the job, and start the process of RDD.doCheckpoint, while another job is submitted, then submitStage and submitMissingTasks will be called. In [submitMissingTasks](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961), will serialize taskBinaryBytes and calculate task partitions which are both affected by the status of checkpoint, if the former is calculated before doCheckpoint finished, while the latter is calculated after doCheckpoint finished, when run task, rdd.compute will be called, for some rdds with particular partition type such as [MapWithStateRDD](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala) who will do partition type cast, will get a ClassCastException because the part params is actually a CheckpointRDDPartition. ## How was this patch tested? the exist uts and also add a test case in DAGScheduerSuite to show the exception case. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ivoson/spark branch-taskpart-mistype Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20244.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20244 commit 0dea573e9e724d591803b73f678e14f94e0af447 Author: huangtengfei Date: 2018-01-12T02:53:29Z submitMissingTasks should make sure the checkpoint status of stage.rdd is the same when calculate taskSerialization and task partitions Change-Id: Ib9839ca552653343d264135c116742effa6feb60 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/20244 reopen this... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...
Github user ivoson commented on the issue: https://github.com/apache/spark/pull/20244 @xuanyuanking could review this please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20244: [SPARK-23053][CORE] taskBinarySerialization and task par...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20244 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...
Github user ivoson closed the pull request at: https://github.com/apache/spark/pull/20244 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20225 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86007/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20244: [SPARK-23053][CORE] taskBinarySerialization and t...
GitHub user ivoson opened a pull request: https://github.com/apache/spark/pull/20244 [SPARK-23053][CORE] taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status â¦d is the same when calculate taskSerialization and task partitions Change-Id: Ib9839ca552653343d264135c116742effa6feb60 ## What changes were proposed in this pull request? When we run concurrent jobs using the same rdd which is marked to do checkpoint. If one job has finished running the job, and start the process of RDD.doCheckpoint, while another job is submitted, then submitStage and submitMissingTasks will be called. In [submitMissingTasks](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961), will serialize taskBinaryBytes and calculate task partitions which are both affected by the status of checkpoint, if the former is calculated before doCheckpoint finished, while the latter is calculated after doCheckpoint finished, when run task, rdd.compute will be called, for some rdds with particular partition type such as [MapWithStateRDD](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/rdd/MapWithStateRDD.scala) who will do partition type cast, will get a ClassCastException because the part params is actually a CheckpointRDDPartition. ## How was this patch tested? the exist uts and also add a test case in DAGScheduerSuite to show the exception case. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ivoson/spark branch-taskpart-mistype Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20244.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20244 commit 0dea573e9e724d591803b73f678e14f94e0af447 Author: huangtengfei Date: 2018-01-12T02:53:29Z submitMissingTasks should make sure the checkpoint status of stage.rdd is the same when calculate taskSerialization and task partitions Change-Id: Ib9839ca552653343d264135c116742effa6feb60 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20225 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with returnType=S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20163 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with returnType=S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20163 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86003/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with returnType=S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20163 **[Test build #86003 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86003/testReport)** for PR 20163 at commit [`d307cee`](https://github.com/apache/spark/commit/d307ceeaa581dc219ae744482a6d3c374b3a3025). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20232 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86002/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20232 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20232 **[Test build #86002 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86002/testReport)** for PR 20232 at commit [`b5a7ad2`](https://github.com/apache/spark/commit/b5a7ad28eceb01d32ce15e49ddbb1710c6973d34). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20209 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20209 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86001/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20209 **[Test build #86001 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86001/testReport)** for PR 20209 at commit [`c7db798`](https://github.com/apache/spark/commit/c7db798d7faf343a27c778fd48f2cdd77d732107). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20225 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85997/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20225: [SPARK-23033] Don't use task level retry for continuous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20225 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20183: [SPARK-22986][Core] Use a cache to avoid instanti...
Github user ho3rexqj commented on a diff in the pull request: https://github.com/apache/spark/pull/20183#discussion_r161135870 --- Diff: core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala --- @@ -206,37 +206,51 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: T, id: Long) private def readBroadcastBlock(): T = Utils.tryOrIOException { TorrentBroadcast.synchronized { - setConf(SparkEnv.get.conf) - val blockManager = SparkEnv.get.blockManager - blockManager.getLocalValues(broadcastId) match { -case Some(blockResult) => - if (blockResult.data.hasNext) { -val x = blockResult.data.next().asInstanceOf[T] -releaseLock(broadcastId) -x - } else { -throw new SparkException(s"Failed to get locally stored broadcast data: $broadcastId") - } -case None => - logInfo("Started reading broadcast variable " + id) - val startTimeMs = System.currentTimeMillis() - val blocks = readBlocks() - logInfo("Reading broadcast variable " + id + " took" + Utils.getUsedTimeMs(startTimeMs)) - - try { -val obj = TorrentBroadcast.unBlockifyObject[T]( - blocks.map(_.toInputStream()), SparkEnv.get.serializer, compressionCodec) -// Store the merged copy in BlockManager so other tasks on this executor don't -// need to re-fetch it. -val storageLevel = StorageLevel.MEMORY_AND_DISK -if (!blockManager.putSingle(broadcastId, obj, storageLevel, tellMaster = false)) { - throw new SparkException(s"Failed to store $broadcastId in BlockManager") + val broadcastCache = SparkEnv.get.broadcastManager.cachedValues + + Option(broadcastCache.get(broadcastId)).map(_.asInstanceOf[T]).getOrElse({ +setConf(SparkEnv.get.conf) --- End diff -- No, sorry - the cache update takes place within that block. With the exception of those blocks (lines 220-222 and lines 244-246), yes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org