[GitHub] spark issue #20767: [SPARK-23623] [SS] Avoid concurrent use of cached consum...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20767 **[Test build #88285 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88285/testReport)** for PR 20767 at commit [`5363ea8`](https://github.com/apache/spark/commit/5363ea8c4a22340fda257658aba122839fe5a15a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20767: [SPARK-23623] [SS] Avoid concurrent use of cached consum...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20767 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20767: [SPARK-23623] [SS] Avoid concurrent use of cached consum...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20767 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88285/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20687: [SPARK-23500][SQL] Fix complex type simplificatio...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20687#discussion_r174974450 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala --- @@ -331,4 +330,47 @@ class ComplexTypesSuite extends PlanTest with ExpressionEvalHelper { .analyze comparePlans(Optimizer execute rel, expected) } + + test("SPARK-23500: Simplify complex ops that aren't at the plan root") { +val structRel = relation + .select(GetStructField(CreateNamedStruct(Seq("att1", 'nullable_id)), 0, None) as "foo") + .groupBy($"foo")("1").analyze +val structExpected = relation + .select('nullable_id as "foo") + .groupBy($"foo")("1").analyze +comparePlans(Optimizer execute structRel, structExpected) + +// These tests must use nullable attributes from the base relation for the following reason: +// in the 'original' plans below, the Aggregate node produced by groupBy() has a +// nullable AttributeReference to a1, because both array indexing and map lookup are +// nullable expressions. After optimization, the same attribute is now non-nullable, +// but the AttributeReference is not updated to reflect this. In the 'expected' plans, +// the grouping expressions have the same nullability as the original attribute in the +// relation. If that attribute is non-nullable, the tests will fail as the plans will +// compare differently, so for these tests we must use a nullable attribute. See +// SPARK-23634. +val arrayRel = relation + .select(GetArrayItem(CreateArray(Seq('nullable_id, 'nullable_id + 1L)), 0) as "a1") + .groupBy($"a1")("1").analyze +val arrayExpected = relation.select('nullable_id as "a1").groupBy($"a1")("1").analyze +comparePlans(Optimizer execute arrayRel, arrayExpected) + +val mapRel = relation + .select(GetMapValue(CreateMap(Seq("id", 'nullable_id)), "id") as "m1") + .groupBy($"m1")("1").analyze +val mapExpected = relation + .select('nullable_id as "m1") + .groupBy($"m1")("1").analyze +comparePlans(Optimizer execute mapRel, mapExpected) + --- End diff -- It seems that the current test case become too long. For the following negative cases, let's split to another test case. Maybe, with the following title? ``` test("SPARK-23500: Aggregation expressions should not be simplified.") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20687: [SPARK-23500][SQL] Fix complex type simplificatio...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20687#discussion_r174974033 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala --- @@ -19,57 +19,47 @@ package org.apache.spark.sql.catalyst.optimizer import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.plans.logical.Aggregate --- End diff -- This should be before line 21 in alphabetical order. You can check this locally with `dev/scalastyle`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20767: [SPARK-23623] [SS] Avoid concurrent use of cached consum...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20767 **[Test build #88285 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88285/testReport)** for PR 20767 at commit [`5363ea8`](https://github.com/apache/spark/commit/5363ea8c4a22340fda257658aba122839fe5a15a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20767: [SPARK-23623] [SS] Avoid concurrent use of cached consum...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20767 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20767: [SPARK-23623] [SS] Avoid concurrent use of cached consum...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/20767 @tedyu @zsxwing thank you very much for catching the bugs. I have simplified the logic quite a bit. Note that I removed the invariant that I had introduced earlier. Additionally, I locally ran the stress test with 100 threads and 1 read attempts, which ran for 2 mins. It passed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20767: [SPARK-23623] [SS] Avoid concurrent use of cached consum...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20767 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1545/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20767: [SPARK-23623] [SS] Avoid concurrent use of cached...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20767#discussion_r174973494 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala --- @@ -342,80 +415,103 @@ private[kafka010] object CachedKafkaConsumer extends Logging { } } - def releaseKafkaConsumer( - topic: String, - partition: Int, - kafkaParams: ju.Map[String, Object]): Unit = { -val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String] -val topicPartition = new TopicPartition(topic, partition) -val key = CacheKey(groupId, topicPartition) - + private def releaseConsumer(intConsumer: InternalKafkaConsumer): Unit = { synchronized { - val consumer = cache.get(key) - if (consumer != null) { -consumer.inuse = false - } else { -logWarning(s"Attempting to release consumer that does not exist") - } -} - } - /** - * Removes (and closes) the Kafka Consumer for the given topic, partition and group id. - */ - def removeKafkaConsumer( - topic: String, - partition: Int, - kafkaParams: ju.Map[String, Object]): Unit = { -val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String] -val topicPartition = new TopicPartition(topic, partition) -val key = CacheKey(groupId, topicPartition) - -synchronized { - val removedConsumer = cache.remove(key) - if (removedConsumer != null) { -removedConsumer.close() + // If it has been marked for close, then do it any way + if (intConsumer.inuse && intConsumer.markedForClose) intConsumer.close() --- End diff -- I rewrote the logic. Hopefully, it's simpler to reason about it now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20799: [SPARK-23635][YARN] AM env variable should not overwrite...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/20799 looks good to me --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20827: [SPARK-23666][SQL] Do not display exprIds of Alias in us...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20827 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20827: [SPARK-23666][SQL] Do not display exprIds of Alias in us...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20827 **[Test build #88284 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88284/testReport)** for PR 20827 at commit [`68650ff`](https://github.com/apache/spark/commit/68650ff8c2f3a90c55b5bf4345c16a92fda3782a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20827: [SPARK-23666][SQL] Do not display exprIds of Alias in us...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20827 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1544/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20813: [SPARK-23670][SQL] Fix memory leak on SparkPlanGr...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20813 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20827: [SPARK-23666][SQL] Do not display exprIds of Alias in us...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20827 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20659: [DO-NOT-MERGE] Try to update Hive to 2.3.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20659 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88275/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20659: [DO-NOT-MERGE] Try to update Hive to 2.3.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20659 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20831: [SPARK-23614][SQL] Fix incorrect reuse exchange w...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/20831#discussion_r174969959 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala --- @@ -68,6 +69,15 @@ case class InMemoryRelation( override protected def innerChildren: Seq[SparkPlan] = Seq(child) + override def doCanonicalize(): logical.LogicalPlan = +copy(output = output.map(QueryPlan.normalizeExprId(_, child.output)), + storageLevel = StorageLevel.NONE, + child = child.canonicalized, + tableName = None)( + _cachedColumnBuffers, + sizeInBytesStats, + statsOfPlanToCache) --- End diff -- `statsOfPlanToCache` and `sizeInBytesStats`, too? For instance, `ResolveHint` drops hints in canonicalization: https://github.com/apache/spark/blob/3675af7247e841e9a689666dc20891ba55c612b3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala#L44 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20659: [DO-NOT-MERGE] Try to update Hive to 2.3.2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20659 **[Test build #88275 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88275/testReport)** for PR 20659 at commit [`e041704`](https://github.com/apache/spark/commit/e041704c8898234299b8af1d85b09c1acc2532ab). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20813: [SPARK-23670][SQL] Fix memory leak on SparkPlanGraphWrap...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20813 I forgot to publish my comment above when I looked at this 2 days ago, so I'll fix it during merge. Merging to master / 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20744: [SPARK-23608][CORE][WebUI] Add synchronization in...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20744 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20839: [SPARK-23699][PYTHON][SQL] Raise same type of error caug...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20839 @HyukjinKwon, @ueshin please take a look when you can --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20814: [SPARK-23671][core] Fix condition to enable the S...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20814 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20744: [SPARK-23608][CORE][WebUI] Add synchronization in SHS be...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20744 This doesn't have anything to do with pyspark, so we can ignore those. Merging to master / 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20814: [SPARK-23671][core] Fix condition to enable the SHS thre...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20814 Merging to master / 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20785: [SPARK-23640][CORE] Fix hadoop config may overrid...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20785#discussion_r174968586 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2434,7 +2434,8 @@ private[spark] object Utils extends Logging { */ def getSparkOrYarnConfig(conf: SparkConf, key: String, default: String): String = { val sparkValue = conf.get(key, default) -if (conf.get(SparkLauncher.SPARK_MASTER, null) == "yarn") { +if (conf.get(SparkLauncher.SPARK_MASTER, null) == "yarn" --- End diff -- No. The logic you want here is the equivalent of: ``` if conf.contains(key) get spark conf elif is_running_on_yarn() get conf from yarn else return default ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20767: [SPARK-23623] [SS] Avoid concurrent use of cached...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20767#discussion_r174968594 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala --- @@ -342,80 +415,103 @@ private[kafka010] object CachedKafkaConsumer extends Logging { } } - def releaseKafkaConsumer( - topic: String, - partition: Int, - kafkaParams: ju.Map[String, Object]): Unit = { -val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String] -val topicPartition = new TopicPartition(topic, partition) -val key = CacheKey(groupId, topicPartition) - + private def releaseConsumer(intConsumer: InternalKafkaConsumer): Unit = { synchronized { - val consumer = cache.get(key) - if (consumer != null) { -consumer.inuse = false - } else { -logWarning(s"Attempting to release consumer that does not exist") - } -} - } - /** - * Removes (and closes) the Kafka Consumer for the given topic, partition and group id. - */ - def removeKafkaConsumer( - topic: String, - partition: Int, - kafkaParams: ju.Map[String, Object]): Unit = { -val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String] -val topicPartition = new TopicPartition(topic, partition) -val key = CacheKey(groupId, topicPartition) - -synchronized { - val removedConsumer = cache.remove(key) - if (removedConsumer != null) { -removedConsumer.close() + // If it has been marked for close, then do it any way + if (intConsumer.inuse && intConsumer.markedForClose) intConsumer.close() + intConsumer.inuse = false + + // Clear the consumer from the cache if this is indeed the consumer present in the cache + val key = new CacheKey(intConsumer.topicPartition, intConsumer.kafkaParams) + val cachedIntConsumer = cache.get(key) + if (cachedIntConsumer != null) { +if (cachedIntConsumer.eq(intConsumer)) { + // The released consumer is indeed the cached one. + cache.remove(key) +} else { + // The released consumer is not the cached one. Don't do anything. + // This should not happen as long as we maintain the invariant mentioned above. + logWarning( +s"Cached consumer not the same one as the one being release" + + s"\ncached = $cachedIntConsumer [${System.identityHashCode(cachedIntConsumer)}]" + + s"\nreleased = $intConsumer [${System.identityHashCode(intConsumer)}]") +} + } else { +// The released consumer is not in the cache. Don't do anything. +// This should not happen as long as we maintain the invariant mentioned above. +logWarning(s"Attempting to release consumer that is not in the cache") } } } /** * Get a cached consumer for groupId, assigned to topic and partition. * If matching consumer doesn't already exist, will be created using kafkaParams. + * The returned consumer must be released explicitly using [[KafkaDataConsumer.release()]]. + * + * Note: This method guarantees that the consumer returned is not currently in use by any one + * else. Within this guarantee, this will make a best effort attempt to re-use consumers by + * caching them and tracking when they are in use. */ - def getOrCreate( - topic: String, - partition: Int, - kafkaParams: ju.Map[String, Object]): CachedKafkaConsumer = synchronized { -val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String] -val topicPartition = new TopicPartition(topic, partition) -val key = CacheKey(groupId, topicPartition) - -// If this is reattempt at running the task, then invalidate cache and start with -// a new consumer + def acquire( + topicPartition: TopicPartition, + kafkaParams: ju.Map[String, Object], + useCache: Boolean): KafkaDataConsumer = synchronized { +val key = new CacheKey(topicPartition, kafkaParams) +val existingInternalConsumer = cache.get(key) + +lazy val newInternalConsumer = new InternalKafkaConsumer(topicPartition, kafkaParams) + if (TaskContext.get != null && TaskContext.get.attemptNumber >= 1) { - removeKafkaConsumer(topic, partition, kafkaParams) - val consumer = new CachedKafkaConsumer(topicPartition, kafkaParams) - consumer.inuse = true - cache.put(key, consumer) - consumer -} else { - if
[GitHub] spark issue #20327: [SPARK-12963][CORE] NM host for driver end points
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20327 **[Test build #88283 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88283/testReport)** for PR 20327 at commit [`ae4ad4a`](https://github.com/apache/spark/commit/ae4ad4a7568cf5845861237d848468c4dc8cf840). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20815: [SPARK-23658][LAUNCHER] InProcessAppHandle uses t...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20815 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20767: [SPARK-23623] [SS] Avoid concurrent use of cached...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20767#discussion_r174968294 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala --- @@ -342,80 +415,103 @@ private[kafka010] object CachedKafkaConsumer extends Logging { } } - def releaseKafkaConsumer( - topic: String, - partition: Int, - kafkaParams: ju.Map[String, Object]): Unit = { -val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String] -val topicPartition = new TopicPartition(topic, partition) -val key = CacheKey(groupId, topicPartition) - + private def releaseConsumer(intConsumer: InternalKafkaConsumer): Unit = { synchronized { - val consumer = cache.get(key) - if (consumer != null) { -consumer.inuse = false - } else { -logWarning(s"Attempting to release consumer that does not exist") - } -} - } - /** - * Removes (and closes) the Kafka Consumer for the given topic, partition and group id. - */ - def removeKafkaConsumer( - topic: String, - partition: Int, - kafkaParams: ju.Map[String, Object]): Unit = { -val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String] -val topicPartition = new TopicPartition(topic, partition) -val key = CacheKey(groupId, topicPartition) - -synchronized { - val removedConsumer = cache.remove(key) - if (removedConsumer != null) { -removedConsumer.close() + // If it has been marked for close, then do it any way + if (intConsumer.inuse && intConsumer.markedForClose) intConsumer.close() + intConsumer.inuse = false + + // Clear the consumer from the cache if this is indeed the consumer present in the cache + val key = new CacheKey(intConsumer.topicPartition, intConsumer.kafkaParams) + val cachedIntConsumer = cache.get(key) + if (cachedIntConsumer != null) { +if (cachedIntConsumer.eq(intConsumer)) { + // The released consumer is indeed the cached one. + cache.remove(key) +} else { + // The released consumer is not the cached one. Don't do anything. + // This should not happen as long as we maintain the invariant mentioned above. + logWarning( +s"Cached consumer not the same one as the one being release" + + s"\ncached = $cachedIntConsumer [${System.identityHashCode(cachedIntConsumer)}]" + + s"\nreleased = $intConsumer [${System.identityHashCode(intConsumer)}]") +} + } else { +// The released consumer is not in the cache. Don't do anything. +// This should not happen as long as we maintain the invariant mentioned above. +logWarning(s"Attempting to release consumer that is not in the cache") } } } /** * Get a cached consumer for groupId, assigned to topic and partition. * If matching consumer doesn't already exist, will be created using kafkaParams. + * The returned consumer must be released explicitly using [[KafkaDataConsumer.release()]]. + * + * Note: This method guarantees that the consumer returned is not currently in use by any one + * else. Within this guarantee, this will make a best effort attempt to re-use consumers by + * caching them and tracking when they are in use. */ - def getOrCreate( - topic: String, - partition: Int, - kafkaParams: ju.Map[String, Object]): CachedKafkaConsumer = synchronized { -val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String] -val topicPartition = new TopicPartition(topic, partition) -val key = CacheKey(groupId, topicPartition) - -// If this is reattempt at running the task, then invalidate cache and start with -// a new consumer + def acquire( + topicPartition: TopicPartition, + kafkaParams: ju.Map[String, Object], + useCache: Boolean): KafkaDataConsumer = synchronized { +val key = new CacheKey(topicPartition, kafkaParams) +val existingInternalConsumer = cache.get(key) + +lazy val newInternalConsumer = new InternalKafkaConsumer(topicPartition, kafkaParams) + if (TaskContext.get != null && TaskContext.get.attemptNumber >= 1) { - removeKafkaConsumer(topic, partition, kafkaParams) - val consumer = new CachedKafkaConsumer(topicPartition, kafkaParams) - consumer.inuse = true - cache.put(key, consumer) - consumer -} else { - if
[GitHub] spark issue #20840: [SPARK-23702][SS] Forbid watermarks on both sides of a s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20840 **[Test build #88282 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88282/testReport)** for PR 20840 at commit [`9d742ab`](https://github.com/apache/spark/commit/9d742ab4d22021654aa2778d601d806b7a7d106a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20327: [SPARK-12963][CORE] NM host for driver end points
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20327 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20840: [SPARK-23702][SS] Forbid watermarks on both sides of a s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20840 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20840: [SPARK-23702][SS] Forbid watermarks on both sides...
GitHub user jose-torres opened a pull request: https://github.com/apache/spark/pull/20840 [SPARK-23702][SS] Forbid watermarks on both sides of a streaming aggregate. ## What changes were proposed in this pull request? Forbid watermarks on both sides of a streaming aggregate. As detailed in the jira, we don't currently support this and doing so would require significant revision to the execution model. ## How was this patch tested? new unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/jose-torres/spark disallowAgg Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20840.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20840 commit 9d742ab4d22021654aa2778d601d806b7a7d106a Author: Jose TorresDate: 2018-03-16T00:03:54Z disallow watermark on both sides --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20815: [SPARK-23658][LAUNCHER] InProcessAppHandle uses the wron...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20815 Merging to master / 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20813: [SPARK-23670][SQL] Fix memory leak on SparkPlanGr...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20813#discussion_r174202008 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusStore.scala --- @@ -53,6 +53,9 @@ class SQLAppStatusStore( def executionsCount(): Long = { store.count(classOf[SQLExecutionUIData]) } + def planGraphCount(): Long = { --- End diff -- nit: add empty line before --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20767: [SPARK-23623] [SS] Avoid concurrent use of cached consum...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/20767 @koeninger good question Cody! I think we should fix this limitation eventually. The only reason I am not doing that in this PR is to keep the changes minimum for backporting to 2.3.x. Eventually, we should not do such cache management, and rather use something like [Apache Common Pool](https://commons.apache.org/proper/commons-pool/index.html). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20839: [SPARK-23699][PYTHON][SQL] Raise same type of error caug...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20839 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20839: [SPARK-23699][PYTHON][SQL] Raise same type of error caug...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20839 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88281/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20839: [SPARK-23699][PYTHON][SQL] Raise same type of error caug...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20839 **[Test build #88281 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88281/testReport)** for PR 20839 at commit [`5caf63c`](https://github.com/apache/spark/commit/5caf63cc32a7546823e64d774faee9fb63a6b286). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20807: SPARK-23660: Fix exception in yarn cluster mode w...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20807#discussion_r174964951 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -417,8 +417,11 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments) extends } } - private def sparkContextInitialized(sc: SparkContext) = { + private def sparkContextInitialized(sc: SparkContext) = synchronized { --- End diff -- Some other code in this class uses synchronization on `this`, so I think it would be better to synchronize on `sparkContextPromise` in this case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20807: SPARK-23660: Fix exception in yarn cluster mode w...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20807#discussion_r174965130 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -497,6 +500,8 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments) extends // if the user app did not create a SparkContext. throw new IllegalStateException("User did not initialize spark context!") } + // After initialisation notify user class thread to continue + synchronized { notify() } --- End diff -- Since you have to do this in two places, I'd create a method (e.g. `resumeDriver`) close to where `sparkContextInitialized` is declared, so that it's easier to find the context of why this is needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20807: SPARK-23660: Fix exception in yarn cluster mode w...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20807#discussion_r174965182 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -497,6 +500,8 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments) extends // if the user app did not create a SparkContext. throw new IllegalStateException("User did not initialize spark context!") } + // After initialisation notify user class thread to continue --- End diff -- nit: rest of the code uses American spelling ("initialization"), so this should be consistent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20829 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20829 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88278/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20829 **[Test build #88278 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88278/testReport)** for PR 20829 at commit [`5ce7671`](https://github.com/apache/spark/commit/5ce7671267a4dd240b80b6230e9e3729c96de7e6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20839: [SPARK-23699][PYTHON][SQL] Raise same type of error caug...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20839 **[Test build #88281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88281/testReport)** for PR 20839 at commit [`5caf63c`](https://github.com/apache/spark/commit/5caf63cc32a7546823e64d774faee9fb63a6b286). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20839: [SPARK-23699][PYTHON][SQL] Raise same type of error caug...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20839 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20839: [SPARK-23699][PYTHON][SQL] Raise same type of error caug...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20839 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1543/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20838 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20838 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20839: [SPARK-23699][PYTHON][SQL] Raise same type of err...
GitHub user BryanCutler opened a pull request: https://github.com/apache/spark/pull/20839 [SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow enabled ## What changes were proposed in this pull request? When using Arrow for createDataFrame or toPandas and an error is encountered with fallback disabled, this will raise the same type of error instead of a RuntimeError. ## How was this patch tested? Updated existing tests to verify error type. You can merge this pull request into a Git repository by running: $ git pull https://github.com/BryanCutler/spark arrow-raise-same-error-SPARK-23699 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20839.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20839 commit 5caf63cc32a7546823e64d774faee9fb63a6b286 Author: Bryan CutlerDate: 2018-03-15T23:07:27Z raise same type of error when not falling back --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20838: [SPARK-23698] Resolve undefined names in Python 3
GitHub user cclauss opened a pull request: https://github.com/apache/spark/pull/20838 [SPARK-23698] Resolve undefined names in Python 3 ## What changes were proposed in this pull request? Fix issues arising from the fact that __file__, __long__, __raw_input()__, __unicode__, __xrange()__, etc. were remove from Python 3. __Undefined names__ have the potential to raise NameError at runtime. Apply the Python porting best practice: [Use feature detection instead of version detection](https://docs.python.org/3/howto/pyporting.html#use-feature-detection-instead-of-version-detection). ## How was this patch tested? * $ __python2 -m flake8 . --count --select=E9,F82 --show-source --statistics__ * $ __python3 -m flake8 . --count --select=E9,F82 --show-source --statistics__ Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cclauss/spark fix-undefined-names Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20838.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20838 commit bdcd740a5144efef0db1f2b6c29b64c85f3d8ef5 Author: cclaussDate: 2018-03-15T23:08:04Z [SPARK-23698] Reduce undefined names in Python 3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20687: [SPARK-23500][SQL] Fix complex type simplification rules...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20687 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88280/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20687: [SPARK-23500][SQL] Fix complex type simplification rules...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20687 **[Test build #88280 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88280/testReport)** for PR 20687 at commit [`8adaa47`](https://github.com/apache/spark/commit/8adaa4748b7b0d7990b45e58d82c7d9f2248923f). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20687: [SPARK-23500][SQL] Fix complex type simplification rules...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20687 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20687: [SPARK-23500][SQL] Fix complex type simplification rules...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20687 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1542/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20687: [SPARK-23500][SQL] Fix complex type simplification rules...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20687 **[Test build #88280 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88280/testReport)** for PR 20687 at commit [`8adaa47`](https://github.com/apache/spark/commit/8adaa4748b7b0d7990b45e58d82c7d9f2248923f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20687: [SPARK-23500][SQL] Fix complex type simplification rules...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20687 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20744: [SPARK-23608][CORE][WebUI] Add synchronization in SHS be...
Github user zhouyejoe commented on the issue: https://github.com/apache/spark/pull/20744 It is weird that the PySpark unit tests failed, I don't think it is related. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20829 **[Test build #88278 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88278/testReport)** for PR 20829 at commit [`5ce7671`](https://github.com/apache/spark/commit/5ce7671267a4dd240b80b6230e9e3729c96de7e6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20579: [SPARK-23372][SQL] Writing empty struct in parquet fails...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20579 **[Test build #88279 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88279/testReport)** for PR 20579 at commit [`3588af3`](https://github.com/apache/spark/commit/3588af39eb889ab72f6800b546ba9f2107f15dc0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20579: [SPARK-23372][SQL] Writing empty struct in parquet fails...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20579 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1541/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20579: [SPARK-23372][SQL] Writing empty struct in parquet fails...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20579 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler
Github user yogeshg commented on the issue: https://github.com/apache/spark/pull/20829 I fixed code paths that failed tests, waiting for @SparkQA . Offline talk with @MrBago suggests that we can perhaps decrease the number of maps in `transform` method. Looking into that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20831: [SPARK-23614][SQL] Fix incorrect reuse exchange when cac...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20831 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20777 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20777 **[Test build #88276 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88276/testReport)** for PR 20777 at commit [`d6cd73a`](https://github.com/apache/spark/commit/d6cd73aacd0683a96a6250d6ab0f9cb9b5eca6f7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20777 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88276/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20829 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88277/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20829 **[Test build #88277 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88277/testReport)** for PR 20829 at commit [`bf2f5b3`](https://github.com/apache/spark/commit/bf2f5b39b3056b2301afd1e19582fea0e3700f3a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20829 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20829 **[Test build #88277 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88277/testReport)** for PR 20829 at commit [`bf2f5b3`](https://github.com/apache/spark/commit/bf2f5b39b3056b2301afd1e19582fea0e3700f3a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20829: [SPARK-23690][ML] Add handleinvalid to VectorAsse...
Github user yogeshg commented on a diff in the pull request: https://github.com/apache/spark/pull/20829#discussion_r174941546 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -234,7 +234,7 @@ class StringIndexerModel ( val metadata = NominalAttribute.defaultAttr .withName($(outputCol)).withValues(filteredLabels).toMetadata() // If we are skipping invalid records, filter them out. -val (filteredDataset, keepInvalid) = getHandleInvalid match { --- End diff -- Thanks for picking this out! I changed this because I was matching on `$(handleInvalid)` in VectorAssembler and that seems to be the recommended way of doing this. Should I include this in the current PR and add a note or open a separate PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20777 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20777 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1540/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to Python ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20777 **[Test build #88276 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88276/testReport)** for PR 20777 at commit [`d6cd73a`](https://github.com/apache/spark/commit/d6cd73aacd0683a96a6250d6ab0f9cb9b5eca6f7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20799: [SPARK-23635][YARN] AM env variable should not overwrite...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/20799 Thanks for the changes @jerryshao, they look good to me. Do you have any other thoughts/comments @tgravescs ? Since this is a behavior change, I want to make sure we are not missing anything ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r174935305 --- Diff: python/pyspark/ml/feature.py --- @@ -465,26 +473,26 @@ class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, " Default False", typeConverter=TypeConverters.toBoolean) @keyword_only -def __init__(self, minTF=1.0, minDF=1.0, vocabSize=1 << 18, binary=False, inputCol=None, - outputCol=None): +def __init__(self, minTF=1.0, minDF=1.0, maxDF=sys.maxsize, vocabSize=1 << 18, binary=False, --- End diff -- Will make the change now. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20828: [SPARK-23687][SS] Add a memory source for continuous pro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20828 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88271/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20828: [SPARK-23687][SS] Add a memory source for continuous pro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20828 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]Support wildcard in HDFS path for load...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20611 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20828: [SPARK-23687][SS] Add a memory source for continuous pro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20828 **[Test build #88271 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88271/testReport)** for PR 20828 at commit [`7c8d30e`](https://github.com/apache/spark/commit/7c8d30e00ca7497cd4ed621fc0f5dacc807efc04). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]Support wildcard in HDFS path for load...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20611 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88272/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]Support wildcard in HDFS path for load...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20611 **[Test build #88272 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88272/testReport)** for PR 20611 at commit [`8341465`](https://github.com/apache/spark/commit/8341465594f9316f9f52678fd70faa7f910edfcc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20579: [SPARK-23372][SQL] Writing empty struct in parquet fails...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20579 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88273/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20579: [SPARK-23372][SQL] Writing empty struct in parquet fails...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20579 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20579: [SPARK-23372][SQL] Writing empty struct in parquet fails...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20579 **[Test build #88273 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88273/testReport)** for PR 20579 at commit [`ad15411`](https://github.com/apache/spark/commit/ad154115e520b82eec0b252fa19e66abdc1da832). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20786: [SPARK-14681][ML] Provide label/impurity stats fo...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20786#discussion_r174921341 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala --- @@ -84,35 +86,73 @@ private[ml] object Node { /** * Create a new Node from the old Node format, recursively creating child nodes as needed. */ - def fromOld(oldNode: OldNode, categoricalFeatures: Map[Int, Int]): Node = { + def fromOld(oldNode: OldNode, categoricalFeatures: Map[Int, Int], +isClassification: Boolean): Node = { if (oldNode.isLeaf) { // TODO: Once the implementation has been moved to this API, then include sufficient // statistics here. - new LeafNode(prediction = oldNode.predict.predict, -impurity = oldNode.impurity, impurityStats = null) + if (isClassification) { +new ClassificationLeafNode(prediction = oldNode.predict.predict, + impurity = oldNode.impurity, impurityStats = null) + } else { +new RegressionLeafNode(prediction = oldNode.predict.predict, + impurity = oldNode.impurity, impurityStats = null) + } } else { val gain = if (oldNode.stats.nonEmpty) { oldNode.stats.get.gain } else { 0.0 } - new InternalNode(prediction = oldNode.predict.predict, impurity = oldNode.impurity, -gain = gain, leftChild = fromOld(oldNode.leftNode.get, categoricalFeatures), -rightChild = fromOld(oldNode.rightNode.get, categoricalFeatures), -split = Split.fromOld(oldNode.split.get, categoricalFeatures), impurityStats = null) + if (isClassification) { +new ClassificationInternalNode(prediction = oldNode.predict.predict, + impurity = oldNode.impurity, gain = gain, + leftChild = fromOld(oldNode.leftNode.get, categoricalFeatures, true) +.asInstanceOf[ClassificationNode], + rightChild = fromOld(oldNode.rightNode.get, categoricalFeatures, true) +.asInstanceOf[ClassificationNode], + split = Split.fromOld(oldNode.split.get, categoricalFeatures), impurityStats = null) + } else { +new RegressionInternalNode(prediction = oldNode.predict.predict, + impurity = oldNode.impurity, gain = gain, + leftChild = fromOld(oldNode.leftNode.get, categoricalFeatures, false) +.asInstanceOf[RegressionNode], + rightChild = fromOld(oldNode.rightNode.get, categoricalFeatures, false) +.asInstanceOf[RegressionNode], + split = Split.fromOld(oldNode.split.get, categoricalFeatures), impurityStats = null) + } } } } -/** - * Decision tree leaf node. - * @param prediction Prediction this node makes - * @param impurity Impurity measure at this node (for training data) - */ -class LeafNode private[ml] ( --- End diff -- While it would be nice to have this be a trait instead of a class, I am worried about breaking public APIs. However, one could argue that this isn't a public API since the constructor is private (though the class is public). I'll CC people on https://issues.apache.org/jira/browse/SPARK-7131 where these classes were made public to get feedback. Let's give a couple of days for feedback before proceeding. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20750: [SPARK-23581][SQL] Add interpreted unsafe projection
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20750 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88270/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20750: [SPARK-23581][SQL] Add interpreted unsafe projection
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20750 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20750: [SPARK-23581][SQL] Add interpreted unsafe projection
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20750 **[Test build #88270 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88270/testReport)** for PR 20750 at commit [`7e15f30`](https://github.com/apache/spark/commit/7e15f3061b5fbd48344ce36e773766a83b2bf410). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20659: [DO-NOT-MERGE] Try to update Hive to 2.3.2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20659 **[Test build #88275 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88275/testReport)** for PR 20659 at commit [`e041704`](https://github.com/apache/spark/commit/e041704c8898234299b8af1d85b09c1acc2532ab). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20659: [DO-NOT-MERGE] Try to update Hive to 2.3.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20659 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1539/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20659: [DO-NOT-MERGE] Try to update Hive to 2.3.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20659 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20659: [DO-NOT-MERGE] Try to update Hive to 2.3.2
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20659 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r174911085 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -70,19 +70,21 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol wit def getMinDF: Double = $(minDF) /** - * Specifies the maximum number of different documents a term must appear in to be included - * in the vocabulary. - * If this is an integer greater than or equal to 1, this specifies the number of documents - * the term must appear in; if this is a double in [0,1), then this specifies the fraction of - * documents. + * Specifies the maximum number of different documents a term could appear in to be included + * in the vocabulary. A term that appears more than the threshold will be ignored. If this is an + * integer greater than or equal to 1, this specifies the maximum number of documents the term + * could appear in; if this is a double in [0,1), then this specifies the maximum fraction of + * documents the term could appear in. --- End diff -- Agree, your wording is clearer. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20837: [SPARK-23686][ML][WIP] Better instrumentation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20837 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88274/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20837: [SPARK-23686][ML][WIP] Better instrumentation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20837 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org