[GitHub] spark pull request: [SPARK-12993][PYSPARK] Remove usage of ADD_FIL...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10913#issuecomment-174819214 **[Test build #50067 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50067/consoleFull)** for PR 10913 at commit [`f8c09de`](https://github.com/apache/spark/commit/f8c09de63aff3bcb220f5fa80926e83f4479c8b1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10914#issuecomment-174860449 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10086] [MLlib] [Streaming] [PySpark] ig...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/10909#issuecomment-174860765 Recent failures in the last 4 days: * https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50016/testReport/ * https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49996/testReport/ * https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49989/testReport/ * https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49870/testReport/ Merged into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][Minor] A few minor tweaks to CSV reader.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10919#issuecomment-174865380 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50076/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/10920 [SPARK-12937][SQL] bloom filter serialization This PR adds serialization support for BloomFilter. A version number is added to version the serialized binary format. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark bloom-filter Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10920.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10920 commit 4b05a35d58cdabccd915582894d303ba437bee0f Author: Wenchen FanDate: 2016-01-26T07:23:51Z bloom filter serialization --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10920#discussion_r50801787 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/Version.java --- @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.sketch; + +/** + * Version number of the serialized binary format for bloom filter or count-min sketch. + */ +public enum Version { --- End diff -- bloom filter and count-min sketch can have different version values, but we can share same version class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10920#issuecomment-174869956 cc @rxin @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/10835#discussion_r50802605 --- Diff: core/src/test/scala/org/apache/spark/executor/TaskMetricsSuite.scala --- @@ -17,12 +17,543 @@ package org.apache.spark.executor -import org.apache.spark.SparkFunSuite +import org.scalatest.Assertions + +import org.apache.spark._ +import org.apache.spark.scheduler.AccumulableInfo +import org.apache.spark.storage.{BlockId, BlockStatus, StorageLevel, TestBlockId} + class TaskMetricsSuite extends SparkFunSuite { - test("[SPARK-5701] updateShuffleReadMetrics: ShuffleReadMetrics not added when no shuffle deps") { -val taskMetrics = new TaskMetrics() -taskMetrics.mergeShuffleReadMetrics() -assert(taskMetrics.shuffleReadMetrics.isEmpty) + import AccumulatorParam._ + import InternalAccumulator._ + import StorageLevel._ + import TaskMetricsSuite._ + + test("create") { +val internalAccums = InternalAccumulator.create() +val tm1 = new TaskMetrics +val tm2 = new TaskMetrics(internalAccums) +assert(tm1.accumulatorUpdates().size === internalAccums.size) +assert(tm1.shuffleReadMetrics.isEmpty) +assert(tm1.shuffleWriteMetrics.isEmpty) +assert(tm1.inputMetrics.isEmpty) +assert(tm1.outputMetrics.isEmpty) +assert(tm2.accumulatorUpdates().size === internalAccums.size) +assert(tm2.shuffleReadMetrics.isEmpty) +assert(tm2.shuffleWriteMetrics.isEmpty) +assert(tm2.inputMetrics.isEmpty) +assert(tm2.outputMetrics.isEmpty) +// TaskMetrics constructor expects minimal set of initial accumulators +intercept[IllegalArgumentException] { new TaskMetrics(Seq.empty[Accumulator[_]]) } + } + + test("create with unnamed accum") { +intercept[IllegalArgumentException] { + new TaskMetrics( +InternalAccumulator.create() ++ Seq( + new Accumulator(0, IntAccumulatorParam, None, internal = true))) +} + } + + test("create with duplicate name accum") { +intercept[IllegalArgumentException] { + new TaskMetrics( +InternalAccumulator.create() ++ Seq( + new Accumulator(0, IntAccumulatorParam, Some(RESULT_SIZE), internal = true))) +} + } + + test("create with external accum") { +intercept[IllegalArgumentException] { + new TaskMetrics( +InternalAccumulator.create() ++ Seq( + new Accumulator(0, IntAccumulatorParam, Some("x" +} + } + + test("create shuffle read metrics") { +import shuffleRead._ +val accums = InternalAccumulator.createShuffleReadAccums() + .map { a => (a.name.get, a) }.toMap[String, Accumulator[_]] +accums(REMOTE_BLOCKS_FETCHED).setValueAny(1) +accums(LOCAL_BLOCKS_FETCHED).setValueAny(2) +accums(REMOTE_BYTES_READ).setValueAny(3L) +accums(LOCAL_BYTES_READ).setValueAny(4L) +accums(FETCH_WAIT_TIME).setValueAny(5L) +accums(RECORDS_READ).setValueAny(6L) +val sr = new ShuffleReadMetrics(accums) +assert(sr.remoteBlocksFetched === 1) +assert(sr.localBlocksFetched === 2) +assert(sr.remoteBytesRead === 3L) +assert(sr.localBytesRead === 4L) +assert(sr.fetchWaitTime === 5L) +assert(sr.recordsRead === 6L) + } + + test("create shuffle write metrics") { +import shuffleWrite._ +val accums = InternalAccumulator.createShuffleWriteAccums() + .map { a => (a.name.get, a) }.toMap[String, Accumulator[_]] +accums(BYTES_WRITTEN).setValueAny(1L) +accums(RECORDS_WRITTEN).setValueAny(2L) +accums(WRITE_TIME).setValueAny(3L) +val sw = new ShuffleWriteMetrics(accums) +assert(sw.bytesWritten === 1L) +assert(sw.recordsWritten === 2L) +assert(sw.writeTime === 3L) + } + + test("create input metrics") { +import input._ +val accums = InternalAccumulator.createInputAccums() + .map { a => (a.name.get, a) }.toMap[String, Accumulator[_]] +accums(BYTES_READ).setValueAny(1L) +accums(RECORDS_READ).setValueAny(2L) +accums(READ_METHOD).setValueAny(DataReadMethod.Hadoop.toString) +val im = new InputMetrics(accums) +assert(im.bytesRead === 1L) +assert(im.recordsRead === 2L) +assert(im.readMethod === DataReadMethod.Hadoop) + } + + test("create output metrics") { +import output._ +val accums = InternalAccumulator.createOutputAccums() + .map { a => (a.name.get, a) }.toMap[String, Accumulator[_]] +accums(BYTES_WRITTEN).setValueAny(1L) +accums(RECORDS_WRITTEN).setValueAny(2L) +
[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10914#issuecomment-174878692 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10762#discussion_r50803015 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -919,6 +919,7 @@ object PushPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelper { (rightFilterConditions ++ commonFilterCondition). reduceLeftOption(And).map(Filter(_, newJoin)).getOrElse(newJoin) case FullOuter => f // DO Nothing for Full Outer Join +case NaturalJoin(_) => sys.error("Untransformed NaturalJoin node") --- End diff -- Do we need to catch it? I think we can guarantee there is no `NaturalJoin` after `CheckAnalysis` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12993][PYSPARK] Remove usage of ADD_FIL...
GitHub user zjffdu opened a pull request: https://github.com/apache/spark/pull/10913 [SPARK-12993][PYSPARK] Remove usage of ADD_FILES in pyspark You can merge this pull request into a Git repository by running: $ git pull https://github.com/zjffdu/spark SPARK-12993 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10913.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10913 commit f8c09de63aff3bcb220f5fa80926e83f4479c8b1 Author: Jeff ZhangDate: 2016-01-26T04:17:48Z [SPARK-12993][PYSPARK] Remove usage of ADD_FILES in pyspark --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add catalyst type aliases b...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/10915 [SPARK-11780][SQL] Add catalyst type aliases backwards compatibility Changed a target at branch-1.6 from #10635. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark pr9935-v3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10915.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10915 commit 9ef7185f5a9ce1f672559e00a34854c5afa4 Author: Takeshi YAMAMURODate: 2016-01-26T05:15:47Z Add catalyst type aliases backwards compatibility --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add catalyst type aliases b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10915#issuecomment-174845577 **[Test build #50070 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50070/consoleFull)** for PR 10915 at commit [`9ef7185`](https://github.com/apache/spark/commit/9ef7185f5a9ce1f672559e00a34854c5afa4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10911#issuecomment-174847369 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50061/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12834] Change ser/de of JavaArray and J...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10772#issuecomment-174858967 LGTM Merging with master Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10085 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12865][SPARK-12866][SQL] Migrate SparkS...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10905#discussion_r50800538 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ASTNode.scala --- @@ -60,6 +60,12 @@ case class ASTNode( /** Source text. */ lazy val source = stream.toString(startIndex, stopIndex) + /** Get the source text that remains after this token. */ + lazy val remainder = { --- End diff -- if you are updating the pr, can you add explicit types for all the public vals? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12854][SQL] Implement complex types sup...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10820#issuecomment-174867931 **[Test build #50080 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50080/consoleFull)** for PR 10820 at commit [`f378335`](https://github.com/apache/spark/commit/f378335858c1c10400936f046430f8e7f4c70c3c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10911#discussion_r50802122 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -309,4 +311,84 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = { sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed) } + + /** + * Builds a Count-min Sketch over a specified column. + * + * @param colName name of the column over which the sketch is built + * @param depth depth of the sketch + * @param width width of the sketch + * @param seed random seed + * @return a [[CountMinSketch]] over column `colName` + * @since 2.0.0 + */ + def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch = { +countMinSketch(Column(colName), depth, width, seed) + } + + /** + * Builds a Count-min Sketch over a specified column. + * + * @param colName name of the column over which the sketch is built + * @param eps relative error of the sketch + * @param confidence confidence of the sketch + * @param seed random seed + * @return a [[CountMinSketch]] over column `colName` + * @since 2.0.0 + */ + def countMinSketch( + colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch = { +countMinSketch(Column(colName), eps, confidence, seed) + } + + /** + * Builds a Count-min Sketch over a specified column. + * + * @param col the column over which the sketch is built + * @param depth depth of the sketch + * @param width width of the sketch + * @param seed random seed + * @return a [[CountMinSketch]] over column `colName` + * @since 2.0.0 + */ + def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch = { +countMinSketch(col, CountMinSketch.create(depth, width, seed)) + } + + /** + * Builds a Count-min Sketch over a specified column. + * + * @param col the column over which the sketch is built + * @param eps relative error of the sketch + * @param confidence confidence of the sketch + * @param seed random seed + * @return a [[CountMinSketch]] over column `colName` + * @since 2.0.0 + */ + def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch = { +countMinSketch(col, CountMinSketch.create(eps, confidence, seed)) + } + + private def countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch = { +val singleCol = df.select(col) +val colType = singleCol.schema.head.dataType +val supportedTypes: Set[DataType] = Set(ByteType, ShortType, IntegerType, LongType, StringType) + +require( + supportedTypes.contains(colType), + s"Count-min Sketch only supports string type and integral types, " + +s"and does not support type $colType." +) + +singleCol.rdd.aggregate(zero)( --- End diff -- Maybe we can improve it by UDAF in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12926][SQL] SQLContext to disallow user...
Github user tejasapatil commented on the pull request: https://github.com/apache/spark/pull/10849#issuecomment-174873141 Fixed scala style test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/10835#discussion_r50802114 --- Diff: core/src/main/scala/org/apache/spark/status/api/v1/AllStagesResource.scala --- @@ -237,7 +237,8 @@ private[v1] object AllStagesResource { } def convertAccumulableInfo(acc: InternalAccumulableInfo): AccumulableInfo = { -new AccumulableInfo(acc.id, acc.name, acc.update, acc.value) +new AccumulableInfo( + acc.id, acc.name, acc.update.map(_.toString), acc.value.map(_.toString).orNull) --- End diff -- This was kind of confusing on first glance until I rembered that we have the weird UI AccumulableInfo and the other version which is used elsewhere and which has been renamed to `InternalAccumulableInfo` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/10835#discussion_r50802781 --- Diff: core/src/main/scala/org/apache/spark/InternalAccumulator.scala --- @@ -17,42 +17,193 @@ package org.apache.spark +import org.apache.spark.storage.{BlockId, BlockStatus} -// This is moved to its own file because many more things will be added to it in SPARK-10620. + +/** + * A collection of fields and methods concerned with internal accumulators that represent + * task level metrics. + */ private[spark] object InternalAccumulator { - val PEAK_EXECUTION_MEMORY = "peakExecutionMemory" - val TEST_ACCUMULATOR = "testAccumulator" - - // For testing only. - // This needs to be a def since we don't want to reuse the same accumulator across stages. - private def maybeTestAccumulator: Option[Accumulator[Long]] = { -if (sys.props.contains("spark.testing")) { - Some(new Accumulator( -0L, AccumulatorParam.LongAccumulatorParam, Some(TEST_ACCUMULATOR), internal = true)) -} else { - None + + import AccumulatorParam._ + + // Prefixes used in names of internal task level metrics + val METRICS_PREFIX = "internal.metrics." + val SHUFFLE_READ_METRICS_PREFIX = METRICS_PREFIX + "shuffle.read." + val SHUFFLE_WRITE_METRICS_PREFIX = METRICS_PREFIX + "shuffle.write." + val OUTPUT_METRICS_PREFIX = METRICS_PREFIX + "output." + val INPUT_METRICS_PREFIX = METRICS_PREFIX + "input." + + // Names of internal task level metrics + val EXECUTOR_DESERIALIZE_TIME = METRICS_PREFIX + "executorDeserializeTime" + val EXECUTOR_RUN_TIME = METRICS_PREFIX + "executorRunTime" + val RESULT_SIZE = METRICS_PREFIX + "resultSize" + val JVM_GC_TIME = METRICS_PREFIX + "jvmGCTime" + val RESULT_SERIALIZATION_TIME = METRICS_PREFIX + "resultSerializationTime" + val MEMORY_BYTES_SPILLED = METRICS_PREFIX + "memoryBytesSpilled" + val DISK_BYTES_SPILLED = METRICS_PREFIX + "diskBytesSpilled" + val PEAK_EXECUTION_MEMORY = METRICS_PREFIX + "peakExecutionMemory" + val UPDATED_BLOCK_STATUSES = METRICS_PREFIX + "updatedBlockStatuses" + val TEST_ACCUM = METRICS_PREFIX + "testAccumulator" + + // scalastyle:off + + // Names of shuffle read metrics + object shuffleRead { +val REMOTE_BLOCKS_FETCHED = SHUFFLE_READ_METRICS_PREFIX + "remoteBlocksFetched" +val LOCAL_BLOCKS_FETCHED = SHUFFLE_READ_METRICS_PREFIX + "localBlocksFetched" +val REMOTE_BYTES_READ = SHUFFLE_READ_METRICS_PREFIX + "remoteBytesRead" +val LOCAL_BYTES_READ = SHUFFLE_READ_METRICS_PREFIX + "localBytesRead" +val FETCH_WAIT_TIME = SHUFFLE_READ_METRICS_PREFIX + "fetchWaitTime" +val RECORDS_READ = SHUFFLE_READ_METRICS_PREFIX + "recordsRead" + } + + // Names of shuffle write metrics + object shuffleWrite { +val BYTES_WRITTEN = SHUFFLE_WRITE_METRICS_PREFIX + "bytesWritten" +val RECORDS_WRITTEN = SHUFFLE_WRITE_METRICS_PREFIX + "recordsWritten" +val WRITE_TIME = SHUFFLE_WRITE_METRICS_PREFIX + "writeTime" + } + + // Names of output metrics + object output { +val WRITE_METHOD = OUTPUT_METRICS_PREFIX + "writeMethod" +val BYTES_WRITTEN = OUTPUT_METRICS_PREFIX + "bytesWritten" +val RECORDS_WRITTEN = OUTPUT_METRICS_PREFIX + "recordsWritten" + } + + // Names of input metrics + object input { +val READ_METHOD = INPUT_METRICS_PREFIX + "readMethod" +val BYTES_READ = INPUT_METRICS_PREFIX + "bytesRead" +val RECORDS_READ = INPUT_METRICS_PREFIX + "recordsRead" + } + + // scalastyle:on + + /** + * Create an internal [[Accumulator]] by name, which must begin with [[METRICS_PREFIX]]. + */ + def create(name: String): Accumulator[_] = { +assert(name.startsWith(METRICS_PREFIX), + s"internal accumulator name must start with '$METRICS_PREFIX': $name") +getParam(name) match { + case p @ LongAccumulatorParam => newMetric[Long](0L, name, p) + case p @ IntAccumulatorParam => newMetric[Int](0, name, p) + case p @ StringAccumulatorParam => newMetric[String]("", name, p) + case p @ UpdatedBlockStatusesAccumulatorParam => +newMetric[Seq[(BlockId, BlockStatus)]](Seq(), name, p) + case p => throw new IllegalArgumentException( +s"unsupported accumulator param '${p.getClass.getSimpleName}' for metric '$name'.") +} + } + + /** + * Get the [[AccumulatorParam]] associated with the internal metric name, + * which must begin with [[METRICS_PREFIX]]. + */ + def getParam(name: String): AccumulatorParam[_] = { +assert(name.startsWith(METRICS_PREFIX), + s"internal accumulator name must
[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10916#issuecomment-174881926 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50079/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11775][PYSPARK][SQL] Allow PySpark to r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9766#issuecomment-174817125 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11775][PYSPARK][SQL] Allow PySpark to r...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9766#issuecomment-174817117 **[Test build #50066 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50066/consoleFull)** for PR 9766 at commit [`2e17865`](https://github.com/apache/spark/commit/2e178651b4f4e9c44f1cbdcba821492ebd48ebc1). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11775][PYSPARK][SQL] Allow PySpark to r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9766#issuecomment-174817126 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50066/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12401][SQL] Add integration tests for p...
Github user maropu commented on the pull request: https://github.com/apache/spark/pull/10596#issuecomment-174823978 @liancheng Okay and ready to merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11775][PYSPARK][SQL] Allow PySpark to r...
Github user zjffdu commented on the pull request: https://github.com/apache/spark/pull/9766#issuecomment-174832497 please test it again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...
Github user zjffdu commented on the pull request: https://github.com/apache/spark/pull/10914#issuecomment-174852141 Thanks @jerryshao --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12834] Change ser/de of JavaArray and J...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10772 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12995][GraphX] Remove deprecate APIs fr...
Github user maropu commented on the pull request: https://github.com/apache/spark/pull/10918#issuecomment-174859672 @srowen This is an activity from the discussion in #4402. I checked that GraphX has deprecate APIs used only in Pregel and this pr removes them. If there aren't any problems, I'll also remove deprecate ones from the test codes in GraphX. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12983] [CORE] [DOC] Correct metrics.pro...
Github user BenFradet commented on the pull request: https://github.com/apache/spark/pull/10902#issuecomment-174859378 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10920#discussion_r50801991 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java --- @@ -32,13 +38,14 @@ static int numWords(long numBits) { } BitArray(long numBits) { -if (numBits <= 0) { - throw new IllegalArgumentException("numBits must be positive"); -} -this.data = new long[numWords(numBits)]; +this(new long[numWords(numBits)]); + } + + private BitArray(long[] data) { +this.data = data; long bitCount = 0; -for (long value : data) { - bitCount += Long.bitCount(value); +for (long datum : data) { --- End diff -- it is a little bit weird to say datam here, since you are actually working with 64 "datum" at once. maybe "word"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10911#discussion_r50801910 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -309,4 +311,84 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = { sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed) } + + /** + * Builds a Count-min Sketch over a specified column. + * + * @param colName name of the column over which the sketch is built + * @param depth depth of the sketch + * @param width width of the sketch + * @param seed random seed + * @return a [[CountMinSketch]] over column `colName` + * @since 2.0.0 + */ + def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch = { +countMinSketch(Column(colName), depth, width, seed) + } + + /** + * Builds a Count-min Sketch over a specified column. + * + * @param colName name of the column over which the sketch is built + * @param eps relative error of the sketch + * @param confidence confidence of the sketch + * @param seed random seed + * @return a [[CountMinSketch]] over column `colName` + * @since 2.0.0 + */ + def countMinSketch( + colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch = { +countMinSketch(Column(colName), eps, confidence, seed) + } + + /** + * Builds a Count-min Sketch over a specified column. + * + * @param col the column over which the sketch is built + * @param depth depth of the sketch + * @param width width of the sketch + * @param seed random seed + * @return a [[CountMinSketch]] over column `colName` + * @since 2.0.0 + */ + def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch = { +countMinSketch(col, CountMinSketch.create(depth, width, seed)) + } + + /** + * Builds a Count-min Sketch over a specified column. + * + * @param col the column over which the sketch is built + * @param eps relative error of the sketch + * @param confidence confidence of the sketch + * @param seed random seed + * @return a [[CountMinSketch]] over column `colName` + * @since 2.0.0 + */ + def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch = { +countMinSketch(col, CountMinSketch.create(eps, confidence, seed)) + } + + private def countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch = { +val singleCol = df.select(col) +val colType = singleCol.schema.head.dataType +val supportedTypes: Set[DataType] = Set(ByteType, ShortType, IntegerType, LongType, StringType) + +require( + supportedTypes.contains(colType), --- End diff -- how about `colType == StringType || colType.isInstanceOf[IntegralType]`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10920#discussion_r50802030 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java --- @@ -83,7 +87,7 @@ * bloom filters are appropriately sized to avoid saturating them. * * @param other The bloom filter to combine this bloom filter with. It is not mutated. - * @throws IllegalArgumentException if {@code isCompatible(that) == false} + * @throws IncompatibleMergeException if {@code isCompatible(that) == false} --- End diff -- you are using "other" instead of "that" here. make them consistent --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10920#issuecomment-174881467 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50081/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10762#discussion_r50803405 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -474,6 +474,7 @@ class DataFrame private[sql]( val rightCol = withPlan(joined.right).resolve(col).toAttribute.withNullability(true) Alias(Coalesce(Seq(leftCol, rightCol)), col)() } + case NaturalJoin(_) => sys.error("NaturalJoin with using clause is not supported.") --- End diff -- Then this case is unreachable as `JoinType.apply` won't produce natural join. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10911#issuecomment-174816049 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50055/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11775][PYSPARK][SQL] Allow PySpark to r...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9766#issuecomment-174816148 **[Test build #50066 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50066/consoleFull)** for PR 9766 at commit [`2e17865`](https://github.com/apache/spark/commit/2e178651b4f4e9c44f1cbdcba821492ebd48ebc1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10911#issuecomment-174816044 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-529] [core] [yarn] Add type-safe config...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10205#issuecomment-174821201 **[Test build #50068 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50068/consoleFull)** for PR 10205 at commit [`d125e03`](https://github.com/apache/spark/commit/d125e03362d298a08a242ee52a20910e95dfaaa0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-529] [core] [yarn] Add type-safe config...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10205#issuecomment-174821494 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-529] [core] [yarn] Add type-safe config...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10205#issuecomment-174821491 **[Test build #50068 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50068/consoleFull)** for PR 10205 at commit [`d125e03`](https://github.com/apache/spark/commit/d125e03362d298a08a242ee52a20910e95dfaaa0). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-529] [core] [yarn] Add type-safe config...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10205#issuecomment-174821495 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50068/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12993][PYSPARK] Remove usage of ADD_FIL...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10913#issuecomment-174826905 **[Test build #50067 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50067/consoleFull)** for PR 10913 at commit [`f8c09de`](https://github.com/apache/spark/commit/f8c09de63aff3bcb220f5fa80926e83f4479c8b1). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12993][PYSPARK] Remove usage of ADD_FIL...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10913#issuecomment-174827000 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50067/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12993][PYSPARK] Remove usage of ADD_FIL...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10913#issuecomment-174826995 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10835#issuecomment-174828808 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50064/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10835#issuecomment-174828804 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10917#issuecomment-174849282 **[Test build #50072 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50072/consoleFull)** for PR 10917 at commit [`8207dc1`](https://github.com/apache/spark/commit/8207dc109f21527438cbd80894e9b49d63159f12). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12995][GraphX] Remove deprecate APIs fr...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/10918 [SPARK-12995][GraphX] Remove deprecate APIs from Pregel You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark RemoveDeprecateInPregel Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10918.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10918 commit fea631129df389b97f5695c11a6bb0c1fef0fb0c Author: Takeshi YAMAMURODate: 2016-01-26T06:21:17Z Remove deprecate APIs from Pregel --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10914#issuecomment-174860248 **[Test build #50069 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50069/consoleFull)** for PR 10914 at commit [`90118ca`](https://github.com/apache/spark/commit/90118ca76c2cbe381bc06614c02cd3b089951c10). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12995][GraphX] Remove deprecate APIs fr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10918#issuecomment-174860265 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10914#issuecomment-174860451 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50069/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12995][GraphX] Remove deprecate APIs fr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10918#issuecomment-174860277 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50074/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9835] [ML] Implement IterativelyReweigh...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/10639#discussion_r50801341 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala --- @@ -0,0 +1,100 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.optim + +import org.apache.spark.Logging +import org.apache.spark.ml.feature.Instance +import org.apache.spark.mllib.linalg._ +import org.apache.spark.rdd.RDD + +/** + * Model fitted by [[IterativelyReweightedLeastSquares]]. + * @param coefficients model coefficients + * @param intercept model intercept + */ +private[ml] class IterativelyReweightedLeastSquaresModel( +val coefficients: DenseVector, +val intercept: Double) extends Serializable + +/** + * Implements the method of iteratively reweighted least squares (IRLS) which is used to solve + * certain optimization problems by an iterative method. In each step of the iterations, it + * involves solving a weighted lease squares (WLS) problem by [[WeightedLeastSquares]]. + * It can be used to find maximum likelihood estimates of a generalized linear model (GLM), + * find M-estimator in robust regression and other optimization problems. --- End diff -- It would be good to provide a reference about IRLS. The IRLS page on Wikipedia is specialized for Lp regression. I would recommend Green's paper as a reference: http://www.jstor.org/stable/2345503 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12937][SQL] bloom filter serialization
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10920#discussion_r50802349 --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java --- @@ -161,4 +194,24 @@ public BloomFilter mergeInPlace(BloomFilter other) throws IncompatibleMergeExcep this.bits.putAll(that.bits); return this; } + + @Override + public void writeTo(OutputStream out) throws IOException { +DataOutputStream dos = new DataOutputStream(out); + +dos.writeInt(Version.V1.getVersionNumber()); +bits.writeTo(dos); +dos.writeInt(numHashFunctions); + } + + public static BloomFilterImpl readFrom(InputStream in) throws IOException { +DataInputStream dis = new DataInputStream(in); + +int version = dis.readInt(); +if (version != Version.V1.getVersionNumber()) { + throw new IOException("Unexpected Bloom Filter version number (" + version + ")"); --- End diff -- BloomFilter, or Bloom filter --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10917#issuecomment-174874774 **[Test build #50072 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50072/consoleFull)** for PR 10917 at commit [`8207dc1`](https://github.com/apache/spark/commit/8207dc109f21527438cbd80894e9b49d63159f12). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10917#issuecomment-174874622 @nongli It's not doing anything to get the hash code of int field, but do a [simple multiplication and addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153) to get the hash code of the row. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10916#issuecomment-174881888 **[Test build #50079 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50079/consoleFull)** for PR 10916 at commit [`43beb4b`](https://github.com/apache/spark/commit/43beb4ba499814c698df7537018ab6fafefa738e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10916#issuecomment-174881924 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10762#discussion_r50803483 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1159,6 +1161,25 @@ class Analyzer( } } } + + /** + * Removes natural joins. --- End diff -- I think we need more comments here, how we resolve a natural join to normal join? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10762#discussion_r50803540 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -474,6 +474,7 @@ class DataFrame private[sql]( val rightCol = withPlan(joined.right).resolve(col).toAttribute.withNullability(true) Alias(Coalesce(Seq(leftCol, rightCol)), col)() } + case NaturalJoin(_) => sys.error("NaturalJoin with using clause is not supported.") --- End diff -- yup - although we should still throw some exception here just in case we refactor code in the future so this is reachable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8171] [Web UI] Simulated infinite scrol...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10910#issuecomment-174801261 **[Test build #50065 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50065/consoleFull)** for PR 10910 at commit [`4d7c433`](https://github.com/apache/spark/commit/4d7c43373d43126b488540d7659274277665f51c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/10914#discussion_r50796394 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -537,10 +537,11 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli } _executorAllocationManager = - if (dynamicAllocationEnabled) { + if (dynamicAllocationEnabled && !isLocal) { Some(new ExecutorAllocationManager(this, listenerBus, _conf)) } else { None + --- End diff -- Remove this empty line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10914#issuecomment-174837394 **[Test build #50069 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50069/consoleFull)** for PR 10914 at commit [`90118ca`](https://github.com/apache/spark/commit/90118ca76c2cbe381bc06614c02cd3b089951c10). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10916#issuecomment-174848191 **[Test build #50071 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50071/consoleFull)** for PR 10916 at commit [`46737b5`](https://github.com/apache/spark/commit/46737b5c9fecbc68b1e4e830b2a1b189a2e72158). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10916#issuecomment-174855435 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50071/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10916#issuecomment-174855251 **[Test build #50071 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50071/consoleFull)** for PR 10916 at commit [`46737b5`](https://github.com/apache/spark/commit/46737b5c9fecbc68b1e4e830b2a1b189a2e72158). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SetDatabaseCommand(databaseName: String) extends RunnableCommand ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11622][MLLIB] Make LibSVMRelation exten...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/9595#issuecomment-174861333 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...
Github user nongli commented on the pull request: https://github.com/apache/spark/pull/10917#issuecomment-174866182 @cloud-fan Simple is just a single int right? It's not even doing anything in the previous case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add catalyst type aliases b...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10915#issuecomment-174866516 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add catalyst type aliases b...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10915#issuecomment-174866518 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50070/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add catalyst type aliases b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10915#issuecomment-174866357 **[Test build #50070 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50070/consoleFull)** for PR 10915 at commit [`9ef7185`](https://github.com/apache/spark/commit/9ef7185f5a9ce1f672559e00a34854c5afa4). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12926][SQL] SQLContext to disallow user...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10849#issuecomment-174874233 **[Test build #50082 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50082/consoleFull)** for PR 10849 at commit [`f982d54`](https://github.com/apache/spark/commit/f982d5449fc52ef9b844761f92306fb7d238b542). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/10835#discussion_r50802278 --- Diff: core/src/test/scala/org/apache/spark/executor/TaskMetricsSuite.scala --- @@ -17,12 +17,345 @@ package org.apache.spark.executor -import org.apache.spark.SparkFunSuite +import org.apache.spark._ +import org.apache.spark.storage.{BlockId, BlockStatus, StorageLevel, TestBlockId} + class TaskMetricsSuite extends SparkFunSuite { - test("[SPARK-5701] updateShuffleReadMetrics: ShuffleReadMetrics not added when no shuffle deps") { -val taskMetrics = new TaskMetrics() -taskMetrics.mergeShuffleReadMetrics() -assert(taskMetrics.shuffleReadMetrics.isEmpty) + import AccumulatorParam._ + import InternalAccumulator._ + import StorageLevel._ + import TaskMetricsSuite._ + + test("create") { --- End diff -- Cool, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12828][SQL]add natural join support
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10762#discussion_r50803191 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -474,6 +474,7 @@ class DataFrame private[sql]( val rightCol = withPlan(joined.right).resolve(col).toAttribute.withNullability(true) Alias(Coalesce(Seq(leftCol, rightCol)), col)() } + case NaturalJoin(_) => sys.error("NaturalJoin with using clause is not supported.") --- End diff -- Are we going to support natural join in `DataFrame`? If so, I think we should also change `JoinType.apply` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12895][SPARK-12896] Migrate TaskMetrics...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10835#issuecomment-174828051 **[Test build #50064 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50064/consoleFull)** for PR 10835 at commit [`7e7c2f4`](https://github.com/apache/spark/commit/7e7c2f41f8d8cd302a89cc1ef15b552fb5e28e2d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8171] [Web UI] Simulated infinite scrol...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10910#issuecomment-174837910 **[Test build #50065 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50065/consoleFull)** for PR 10910 at commit [`4d7c433`](https://github.com/apache/spark/commit/4d7c43373d43126b488540d7659274277665f51c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8171] [Web UI] Simulated infinite scrol...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10910#issuecomment-174838126 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50065/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/10916 [SPARK-12968][SQL] Implement command to set current database JIRA: https://issues.apache.org/jira/browse/SPARK-12968 Implement command to set current database. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 ddl-use-database Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10916.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10916 commit 46737b5c9fecbc68b1e4e830b2a1b189a2e72158 Author: Liang-Chi HsiehDate: 2016-01-26T05:33:13Z Implement command to set current database. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10917#issuecomment-174846863 cc @nongli @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12888][SQL][follow-up] benchmark the ne...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/10917 [SPARK-12888][SQL][follow-up] benchmark the new hash expression Adds the benchmark results as comments. The codegen version is slower than the interpreted version for `simple` case becasue of 3 reasons: 1. codegen version use a more complex hash algorithm than interpreted version, i.e. `Murmur3_x86_32.hashInt` vs [simple multiplication and addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153). 2. codegen version will write the hash value to a row first and then read it out. I tried to create a `GenerateHasher` that can generate code to return hash value directly and got about 60% speed up for the `simple` case, does it worth? 3. the row in `simple` case only has one int field, so the runtime reflection may be removed because of branch prediction, which makes the interpreted version faster. The `array` case is also slow for similar reasons, e.g. array elements are of same type, so interpreted version can probably get rid of runtime reflection by branch prediction. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark hash-benchmark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10917.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10917 commit 8207dc109f21527438cbd80894e9b49d63159f12 Author: Wenchen FanDate: 2016-01-26T02:24:38Z add benchmark results --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10911#issuecomment-174848875 cc @JoshRosen is the python tests broken? ``` Running PySpark tests. Output is in /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log Error: unrecognized module 'root'. Supported modules: pyspark-mllib, pyspark-core, pyspark-ml, pyspark-sql, pyspark-streaming [error] running /home/jenkins/workspace/SparkPullRequestBuilder/python/run-tests --modules=pyspark-mllib,pyspark-ml,pyspark-sql,root --parallelism=4 ; received return code 255 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10085#issuecomment-174857126 LGTM Merging with master Thanks for the PR! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/10916#issuecomment-174861508 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10086] [MLlib] [Streaming] [PySpark] ig...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10909 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10914#issuecomment-174878554 **[Test build #50073 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50073/consoleFull)** for PR 10914 at commit [`0467617`](https://github.com/apache/spark/commit/0467617746590b3083deafaa763ee4cae50d4dc0). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10914#issuecomment-174878693 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50073/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12401][SQL] Add integration tests for p...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10596#issuecomment-174811227 **[Test build #50062 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50062/consoleFull)** for PR 10596 at commit [`dbc6829`](https://github.com/apache/spark/commit/dbc6829ca8584009972826e48864ba416ded6479). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...
GitHub user zjffdu opened a pull request: https://github.com/apache/spark/pull/10914 [SPARK-12994][CORE] It is not necessary to create ExecutorAllocationM⦠â¦anager in local mode You can merge this pull request into a Git repository by running: $ git pull https://github.com/zjffdu/spark SPARK-12994 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10914.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10914 commit 90118ca76c2cbe381bc06614c02cd3b089951c10 Author: Jeff ZhangDate: 2016-01-26T05:02:27Z [SPARK-12994][CORE] It is not necessary to create ExecutorAllocationManager in local mode --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...
Github user zhonghaihua commented on the pull request: https://github.com/apache/spark/pull/10794#issuecomment-174830247 @marmbrus @liancheng @yhuai Could you verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8171] [Web UI] Simulated infinite scrol...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10910#issuecomment-174838124 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12993][PYSPARK] Remove usage of ADD_FIL...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10913#issuecomment-174849947 Can you update the pull request description to describe why we are removing this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10916#issuecomment-174849740 cc @hvanhovell for review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12994][CORE] It is not necessary to cre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10914#issuecomment-174852807 **[Test build #50073 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50073/consoleFull)** for PR 10914 at commit [`0467617`](https://github.com/apache/spark/commit/0467617746590b3083deafaa763ee4cae50d4dc0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10916#issuecomment-174866893 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9835] [ML] Implement IterativelyReweigh...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/10639#discussion_r50801095 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala --- @@ -0,0 +1,100 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.optim + +import org.apache.spark.Logging +import org.apache.spark.ml.feature.Instance +import org.apache.spark.mllib.linalg._ +import org.apache.spark.rdd.RDD + +/** + * Model fitted by [[IterativelyReweightedLeastSquares]]. + * @param coefficients model coefficients + * @param intercept model intercept + */ +private[ml] class IterativelyReweightedLeastSquaresModel( +val coefficients: DenseVector, +val intercept: Double) extends Serializable + +/** + * Implements the method of iteratively reweighted least squares (IRLS) which is used to solve + * certain optimization problems by an iterative method. In each step of the iterations, it + * involves solving a weighted lease squares (WLS) problem by [[WeightedLeastSquares]]. + * It can be used to find maximum likelihood estimates of a generalized linear model (GLM), + * find M-estimator in robust regression and other optimization problems. + * + * @param initialModel the initial guess model. + * @param reweightFunc the reweight function which is used to update offsets and weights + * at each iteration. + * @param fitIntercept whether to fit intercept. + * @param regParam L2 regularization parameter used by WLS. + * @param maxIter maximum number of iterations. + * @param tol the convergence tolerance. + */ +private[ml] class IterativelyReweightedLeastSquares( +val initialModel: WeightedLeastSquaresModel, +val reweightFunc: (Instance, WeightedLeastSquaresModel) => (Double, Double), +val fitIntercept: Boolean, +val regParam: Double, +val maxIter: Int, +val tol: Double) extends Logging with Serializable { + + def fit(instances: RDD[Instance]): IterativelyReweightedLeastSquaresModel = { + +var converged = false +var iter = 0 + +var offsetsAndWeights: RDD[(Double, Double)] = null +var model: WeightedLeastSquaresModel = initialModel +var oldModel: WeightedLeastSquaresModel = initialModel + +while (iter < maxIter && !converged) { + + oldModel = model + + // Update offsets and weights using reweightFunc + offsetsAndWeights = instances.map { instance => reweightFunc(instance, oldModel) } + + // Estimate new model + val newInstances = instances.zip(offsetsAndWeights).map { --- End diff -- `zip` is not efficient. Generate `newInstances` directly: ~~~scala val newInstances = instances.map { instance => val (newOffset, newWeight) = reweightFunc(instance, oldModel) Instance(newOffset, newWeight, instance.features) } ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9835] [ML] Implement IterativelyReweigh...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/10639#discussion_r50801093 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala --- @@ -0,0 +1,100 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.optim + +import org.apache.spark.Logging +import org.apache.spark.ml.feature.Instance +import org.apache.spark.mllib.linalg._ +import org.apache.spark.rdd.RDD + +/** + * Model fitted by [[IterativelyReweightedLeastSquares]]. + * @param coefficients model coefficients + * @param intercept model intercept + */ +private[ml] class IterativelyReweightedLeastSquaresModel( +val coefficients: DenseVector, +val intercept: Double) extends Serializable + +/** + * Implements the method of iteratively reweighted least squares (IRLS) which is used to solve + * certain optimization problems by an iterative method. In each step of the iterations, it + * involves solving a weighted lease squares (WLS) problem by [[WeightedLeastSquares]]. + * It can be used to find maximum likelihood estimates of a generalized linear model (GLM), + * find M-estimator in robust regression and other optimization problems. + * + * @param initialModel the initial guess model. + * @param reweightFunc the reweight function which is used to update offsets and weights + * at each iteration. + * @param fitIntercept whether to fit intercept. + * @param regParam L2 regularization parameter used by WLS. + * @param maxIter maximum number of iterations. + * @param tol the convergence tolerance. + */ +private[ml] class IterativelyReweightedLeastSquares( +val initialModel: WeightedLeastSquaresModel, +val reweightFunc: (Instance, WeightedLeastSquaresModel) => (Double, Double), +val fitIntercept: Boolean, +val regParam: Double, +val maxIter: Int, +val tol: Double) extends Logging with Serializable { + + def fit(instances: RDD[Instance]): IterativelyReweightedLeastSquaresModel = { + +var converged = false +var iter = 0 + +var offsetsAndWeights: RDD[(Double, Double)] = null +var model: WeightedLeastSquaresModel = initialModel +var oldModel: WeightedLeastSquaresModel = initialModel --- End diff -- `= null` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12968][SQL] Implement command to set cu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10916#issuecomment-174866819 **[Test build #50079 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50079/consoleFull)** for PR 10916 at commit [`43beb4b`](https://github.com/apache/spark/commit/43beb4ba499814c698df7537018ab6fafefa738e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org