[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3429#discussion_r22027516 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala --- @@ -33,6 +31,147 @@ import org.apache.spark.sql.catalyst.types.decimal.Decimal /* Implicit conversions */ import scala.collection.JavaConversions._ +/** + * 1. The Underlying data type in catalyst and in Hive + * In catalyst: + * Primitive = + * java.lang.String + * int / scala.Int + * boolean / scala.Boolean + * float / scala.Float + * double / scala.Double + * long / scala.Long + * short / scala.Short + * byte / scala.Byte + * org.apache.spark.sql.catalyst.types.decimal.Decimal + * Array[Byte] + * java.sql.Date + * java.sql.Timestamp + * Complicated Types = --- End diff -- Complicated = Complex --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3429#discussion_r22027824 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala --- @@ -119,12 +271,44 @@ private[hive] trait HiveInspectors { System.arraycopy(writable.getBytes, 0, temp, 0, temp.length) temp case poi: WritableConstantDateObjectInspector = poi.getWritableConstantValue.get() -case hvoi: HiveVarcharObjectInspector = hvoi.getPrimitiveJavaObject(data).getValue -case hdoi: HiveDecimalObjectInspector = HiveShim.toCatalystDecimal(hdoi, data) -// org.apache.hadoop.hive.serde2.io.TimestampWritable.set will reset current time object -// if next timestamp is null, so Timestamp object is cloned -case ti: TimestampObjectInspector = ti.getPrimitiveJavaObject(data).clone() -case pi: PrimitiveObjectInspector = pi.getPrimitiveJavaObject(data) +case mi: StandardConstantMapObjectInspector = + // take the value from the map inspector object, rather than the input data + mi.getWritableConstantValue.map { case (k, v) = +(unwrap(k, mi.getMapKeyObjectInspector), + unwrap(v, mi.getMapValueObjectInspector)) + }.toMap +case li: StandardConstantListObjectInspector = + // take the value from the list inspector object, rather than the input data + li.getWritableConstantValue.map(unwrap(_, li.getListElementObjectInspector)).toSeq +// if the value is null, we don't care about the object inspector type +case _ if data == null = null +case poi: VoidObjectInspector = null // always be null for void object inspector +case pi: PrimitiveObjectInspector = pi match { + // We think HiveVarchar is also a String + case hvoi: HiveVarcharObjectInspector if hvoi.preferWritable() = +hvoi.getPrimitiveWritableObject(data).getHiveVarchar.getValue + case hvoi: HiveVarcharObjectInspector = hvoi.getPrimitiveJavaObject(data).getValue + case x: StringObjectInspector if x.preferWritable() = +x.getPrimitiveWritableObject(data).toString --- End diff -- I guess we should return a `Writable`, namely a `Text` object, rather than a `String` here? Should we remove the `toString` call? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3429#discussion_r22027948 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala --- @@ -119,12 +271,44 @@ private[hive] trait HiveInspectors { System.arraycopy(writable.getBytes, 0, temp, 0, temp.length) temp case poi: WritableConstantDateObjectInspector = poi.getWritableConstantValue.get() -case hvoi: HiveVarcharObjectInspector = hvoi.getPrimitiveJavaObject(data).getValue -case hdoi: HiveDecimalObjectInspector = HiveShim.toCatalystDecimal(hdoi, data) -// org.apache.hadoop.hive.serde2.io.TimestampWritable.set will reset current time object -// if next timestamp is null, so Timestamp object is cloned -case ti: TimestampObjectInspector = ti.getPrimitiveJavaObject(data).clone() -case pi: PrimitiveObjectInspector = pi.getPrimitiveJavaObject(data) +case mi: StandardConstantMapObjectInspector = + // take the value from the map inspector object, rather than the input data + mi.getWritableConstantValue.map { case (k, v) = +(unwrap(k, mi.getMapKeyObjectInspector), + unwrap(v, mi.getMapValueObjectInspector)) + }.toMap +case li: StandardConstantListObjectInspector = + // take the value from the list inspector object, rather than the input data + li.getWritableConstantValue.map(unwrap(_, li.getListElementObjectInspector)).toSeq +// if the value is null, we don't care about the object inspector type +case _ if data == null = null +case poi: VoidObjectInspector = null // always be null for void object inspector +case pi: PrimitiveObjectInspector = pi match { + // We think HiveVarchar is also a String + case hvoi: HiveVarcharObjectInspector if hvoi.preferWritable() = +hvoi.getPrimitiveWritableObject(data).getHiveVarchar.getValue + case hvoi: HiveVarcharObjectInspector = hvoi.getPrimitiveJavaObject(data).getValue + case x: StringObjectInspector if x.preferWritable() = +x.getPrimitiveWritableObject(data).toString --- End diff -- Oh I see where I'm wrong. We need Catalyst objects rather than Hive objects here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3429#discussion_r22028090 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveInspectorSuite.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.sql.Date +import java.util + +import org.apache.hadoop.hive.serde2.io.DoubleWritable +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.types.decimal.Decimal +import org.scalatest.FunSuite + +import org.apache.hadoop.hive.ql.udf.UDAFPercentile +import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspector, StructObjectInspector, ObjectInspectorFactory} +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions +import org.apache.hadoop.io.LongWritable + +import org.apache.spark.sql.catalyst.expressions.{Literal, Row} + +class HiveInspectorSuite extends FunSuite with HiveInspectors { --- End diff -- A general comment about this test suite: would be better to use `===` rather than `==` in assertions to enable more friendly error messages with actual data values when tests fail. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3731#issuecomment-67454633 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24575/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...
GitHub user adrian-wang opened a pull request: https://github.com/apache/spark/pull/3732 [SPARK-4508] [SQL] build native date type to conform behavior to Hive Store daysSinceEpoch as an Int value(4 bytes) to represent DateType, instead of using java.sql.Date(8 bytes as Long) in catalyst row. This ensures the same comparison behavior of Hive and Catalyst. Subsumes #3381 I thinks there are already some tests in JavaSQLSuite, and for python it will not affect python's datetime class. You can merge this pull request into a Git repository by running: $ git pull https://github.com/adrian-wang/spark datenative Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3732.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3732 commit b9fbdb53b36b7635e5d3bd6c73c4318665dfb41c Author: Daoyuan Wang daoyuan.w...@intel.com Date: 2014-11-20T05:02:52Z spark native date type commit 9110ef072bc1aaf4c43bf307ae4e438130863661 Author: Daoyuan Wang daoyuan.w...@intel.com Date: 2014-11-20T07:17:26Z api change commit 2e167a4a8e71e6827d320106fa5727434100958c Author: Daoyuan Wang daoyuan.w...@intel.com Date: 2014-11-20T07:18:17Z remove outdated files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3731#issuecomment-67454627 [Test build #24575 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24575/consoleFull) for PR 3731 at commit [`b9843f2`](https://github.com/apache/spark/commit/b9843f2c673f30c5111f2a2a29e15dcde00042db). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3732#issuecomment-67454747 [Test build #24576 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24576/consoleFull) for PR 3732 at commit [`2e167a4`](https://github.com/apache/spark/commit/2e167a4a8e71e6827d320106fa5727434100958c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3429#issuecomment-67454924 In general this LGTM except for some minor styling comments, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3732#issuecomment-67455049 [Test build #24576 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24576/consoleFull) for PR 3732 at commit [`2e167a4`](https://github.com/apache/spark/commit/2e167a4a8e71e6827d320106fa5727434100958c). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `final class Date extends Ordered[Date] with Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3732#issuecomment-67455053 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24576/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...
Github user ilganeli commented on the pull request: https://github.com/apache/spark/pull/3518#issuecomment-67455654 Hi @JoshRosen, I have made the updates we've discussed. Example output is shown below for two cases of unserializable RDDs. The first is when an individual RDD is unserializable, the latter is for when a nested RDD dependency is unserializable. Case 1: ```scala val unserializableRdd = new MyRDD(sc, 1, Nil) { class UnserializableClass val unserializable = new UnserializableClass } val trace : Array[SerializedRef] = scheduler.tryToSerializeRdd(unserializableRdd) ``` Depth 0: DAGSchedulerSuiteRDD 0 - Failed to serialize parent. Un-serializable reference trace for DAGSchedulerSuiteRDD 0: ** DAGSchedulerSuiteRDD 0: --- Ref (class scala.Tuple2, Hash: 1240412896) --- Ref (org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$2$$anon$1$UnserializableClass@315df4bb, Hash: 828241083) --- Ref (StorageLevel(false, false, false, false, 1), Hash: 1370224403) --- Ref (None, Hash: 1353759820) --- Ref (List(), Hash: 1599566873) --- Ref (scala.Tuple2, Hash: 254955665) --- Ref (DAGSchedulerSuiteRDD 0, Hash: 279781579) ** org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$2$$anon$1$UnserializableClass@315df4bb: --- Ref (class scala.Tuple2, Hash: 1240412896) --- Ref (StorageLevel(false, false, false, false, 1), Hash: 1370224403) --- Ref (None, Hash: 1353759820) --- Ref (List(), Hash: 1599566873) --- Ref (scala.Tuple2, Hash: 254955665) --- Ref (DAGSchedulerSuiteRDD 0, Hash: 279781579) --- Ref (org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$2$$anon$1$UnserializableClass@315df4bb, Hash: 828241083) ** Case 2: ```scala val baseRdd = new MyRDD(sc, 1, Nil) val midRdd = new MyRDD(sc, 1, List(new OneToOneDependency(baseRdd))) val finalRdd = new MyRDD(sc, 1, List(new OneToOneDependency(midRdd))) { class UnserializableClass val unserializable = new UnserializableClass } val trace : Array[SerializedRef] = scheduler.tryToSerializeRdd(finalRdd) ``` Depth 0: DAGSchedulerSuiteRDD 2 - Failed to serialize parent. Depth 1: DAGSchedulerSuiteRDD 1 - Success Depth 2: DAGSchedulerSuiteRDD 0 - Success Un-serializable reference trace for DAGSchedulerSuiteRDD 2: ** DAGSchedulerSuiteRDD 2: --- Ref (DAGSchedulerSuiteRDD 0, Hash: 1968196847) --- Ref (org.apache.spark.OneToOneDependency@29d37757, Hash: 701724503) --- Ref (List(org.apache.spark.OneToOneDependency@29d37757), Hash: 1255445356) --- Ref (DAGSchedulerSuiteRDD 1, Hash: 1787987889) --- Ref (org.apache.spark.OneToOneDependency@3e598df9, Hash: 1046056441) --- Ref (class scala.Tuple2, Hash: 1240412896) --- Ref (StorageLevel(false, false, false, false, 1), Hash: 1370224403) --- Ref (None, Hash: 1353759820) --- Ref (org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$3$$anon$2$UnserializableClass@3887cf88, Hash: 948424584) --- Ref (List(org.apache.spark.OneToOneDependency@3e598df9), Hash: 1618683794) --- Ref (List(), Hash: 1599566873) --- Ref (scala.Tuple2, Hash: 550572371) --- Ref (DAGSchedulerSuiteRDD 2, Hash: 1726715997) ** org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$3$$anon$2$UnserializableClass@3887cf88: --- Ref (DAGSchedulerSuiteRDD 0, Hash: 1968196847) --- Ref (org.apache.spark.OneToOneDependency@29d37757, Hash: 701724503) --- Ref (List(org.apache.spark.OneToOneDependency@29d37757), Hash: 1255445356) --- Ref (DAGSchedulerSuiteRDD 1, Hash: 1787987889) --- Ref (org.apache.spark.OneToOneDependency@3e598df9, Hash: 1046056441) --- Ref (class scala.Tuple2, Hash: 1240412896) --- Ref (StorageLevel(false, false, false, false, 1), Hash: 1370224403) --- Ref (None, Hash: 1353759820) --- Ref (List(org.apache.spark.OneToOneDependency@3e598df9), Hash: 1618683794) --- Ref (List(), Hash: 1599566873) --- Ref (scala.Tuple2, Hash: 550572371) --- Ref (DAGSchedulerSuiteRDD 2, Hash: 1726715997) --- Ref (org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$3$$anon$2$UnserializableClass@3887cf88, Hash: 948424584) ** --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3518#issuecomment-67455620 [Test build #24577 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24577/consoleFull) for PR 3518 at commit [`bb5f700`](https://github.com/apache/spark/commit/bb5f700363dc577b84414e25caedafeb7c247de6). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4881] Use SparkConf#getBoolean instead ...
GitHub user sarutak opened a pull request: https://github.com/apache/spark/pull/3733 [SPARK-4881] Use SparkConf#getBoolean instead of get().toBoolean It's really a minor issue. In ApplicationMaster, there is code like as follows. val preserveFiles = sparkConf.get(spark.yarn.preserve.staging.files, false).toBoolean I think, the code can be simplified like as follows. val preserveFiles = sparkConf.getBoolean(spark.yarn.preserve.staging.files, false) You can merge this pull request into a Git repository by running: $ git pull https://github.com/sarutak/spark SPARK-4881 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3733.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3733 commit c63daa0dd76551fc1fea9b0dcfc91f9a73ee2948 Author: Kousuke Saruta saru...@oss.nttdata.co.jp Date: 2014-12-18T08:33:04Z Simplified code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4881] Use SparkConf#getBoolean instead ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3733#issuecomment-67456060 [Test build #24578 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24578/consoleFull) for PR 3733 at commit [`c63daa0`](https://github.com/apache/spark/commit/c63daa0dd76551fc1fea9b0dcfc91f9a73ee2948). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4871][SQL] Show sql statement in spark ...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/3718#issuecomment-67456216 Retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3556#issuecomment-67456529 [Test build #24581 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24581/consoleFull) for PR 3556 at commit [`37cfdf5`](https://github.com/apache/spark/commit/37cfdf5effe0de72a86974b65a6ddff87debfffa). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3732#issuecomment-67456526 [Test build #24579 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24579/consoleFull) for PR 3732 at commit [`6ef2b1f`](https://github.com/apache/spark/commit/6ef2b1f1ba89c0ef522720118269a5ba168d1f5c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3556#issuecomment-67456774 [Test build #24581 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24581/consoleFull) for PR 3556 at commit [`37cfdf5`](https://github.com/apache/spark/commit/37cfdf5effe0de72a86974b65a6ddff87debfffa). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3556#issuecomment-67456777 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24581/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3556#issuecomment-67457397 [Test build #24582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24582/consoleFull) for PR 3556 at commit [`620ebe3`](https://github.com/apache/spark/commit/620ebe3df79fce1c8dbdea971eea99971af5b9d9). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3555#issuecomment-67457906 [Test build #24583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24583/consoleFull) for PR 3555 at commit [`1893956`](https://github.com/apache/spark/commit/189395672255daad5eb2cdbd5b51a5948338f9f5). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4861][SQL] Refactory command in spark s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3712#issuecomment-67458887 [Test build #24584 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24584/consoleFull) for PR 3712 at commit [`51a82f2`](https://github.com/apache/spark/commit/51a82f2ae3fe9d28455940d953de7b76306f49b2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2956#issuecomment-67459431 [Test build #24585 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24585/consoleFull) for PR 2956 at commit [`a473241`](https://github.com/apache/spark/commit/a47324118358802fcc6821e77ead77fd37003904). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...
Github user tolgap commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-67461517 @bgreeven I have cloned your branch and am trying to run the MNIST dataset. I can't quite understand how to set the number of output neurons though. The `topology` array seems to only apply to the hidden layers. I have seen some tests of MNIST on your code though, so I was curious how this was done? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4883][Shuffle] Add a name to the direct...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/3734 [SPARK-4883][Shuffle] Add a name to the directoryCleaner thread You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-4883 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3734.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3734 commit 71156d630dfc682caf9069b529b911d301827985 Author: zsxwing zsxw...@gmail.com Date: 2014-12-18T09:21:01Z Add a name to the directoryCleaner thread --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4883][Shuffle] Add a name to the direct...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3734#issuecomment-67462082 [Test build #24586 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24586/consoleFull) for PR 3734 at commit [`71156d6`](https://github.com/apache/spark/commit/71156d630dfc682caf9069b529b911d301827985). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3732#issuecomment-67464059 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24579/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3732#issuecomment-67464054 [Test build #24579 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24579/consoleFull) for PR 3732 at commit [`6ef2b1f`](https://github.com/apache/spark/commit/6ef2b1f1ba89c0ef522720118269a5ba168d1f5c). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `final class Date extends Ordered[Date] with Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3518#issuecomment-67464102 [Test build #24577 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24577/consoleFull) for PR 3518 at commit [`bb5f700`](https://github.com/apache/spark/commit/bb5f700363dc577b84414e25caedafeb7c247de6). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class EdgeRef(cur : AnyRef, parent : EdgeRef) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3518#issuecomment-67464113 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24577/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4881] Use SparkConf#getBoolean instead ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3733#issuecomment-67464681 [Test build #24578 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24578/consoleFull) for PR 3733 at commit [`c63daa0`](https://github.com/apache/spark/commit/c63daa0dd76551fc1fea9b0dcfc91f9a73ee2948). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4881] Use SparkConf#getBoolean instead ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3733#issuecomment-67464689 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24578/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3556#issuecomment-67464932 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24582/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3556#issuecomment-67464925 [Test build #24582 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24582/consoleFull) for PR 3556 at commit [`620ebe3`](https://github.com/apache/spark/commit/620ebe3df79fce1c8dbdea971eea99971af5b9d9). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3555#issuecomment-67465486 [Test build #24583 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24583/consoleFull) for PR 3555 at commit [`1893956`](https://github.com/apache/spark/commit/189395672255daad5eb2cdbd5b51a5948338f9f5). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3555#issuecomment-67465495 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24583/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4861][SQL] Refactory command in spark s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3712#issuecomment-67466375 [Test build #24584 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24584/consoleFull) for PR 3712 at commit [`51a82f2`](https://github.com/apache/spark/commit/51a82f2ae3fe9d28455940d953de7b76306f49b2). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4861][SQL] Refactory command in spark s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3712#issuecomment-67466379 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24584/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3222#issuecomment-67466905 [Test build #24587 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24587/consoleFull) for PR 3222 at commit [`03a180f`](https://github.com/apache/spark/commit/03a180f66927c41a737bd8706caa6c4686606252). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2554][SQL] Supporting SumDistinct parti...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3348#issuecomment-67467429 [Test build #24588 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24588/consoleFull) for PR 3348 at commit [`fd28e4d`](https://github.com/apache/spark/commit/fd28e4d9e807e677a29451ee361ff040927ffc02). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67467841 Yes, just talking about oversampling now. In 1, if you mean ceil(rdd.count / numBins) then yes that's basically what I've got now. You won't quite get numBins back, yes. The spacing _will_ be even -- except at partition boundaries. You can push around a few points there to amortize the uneven space. I don't even think you need oversampling for that. I'm suggesting 1 as well. I feel like I sound lazy, but, this is a context where approximation is entirely fine since the purpose is, say, exporting something you could plot in a picture, and because the error is so relatively modest in realistic use cases. It doesn't seem worth the complexity or processing. Maybe I should just document the couple caveats here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4871][SQL] Show sql statement in spark ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3718#issuecomment-67468412 [Test build #24580 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24580/consoleFull) for PR 3718 at commit [`4d2038a`](https://github.com/apache/spark/commit/4d2038a3d9727a1ba38c5efba2b01f8faaf65ce8). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4871][SQL] Show sql statement in spark ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3718#issuecomment-67468425 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24580/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2956#issuecomment-67468461 [Test build #24585 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24585/consoleFull) for PR 2956 at commit [`a473241`](https://github.com/apache/spark/commit/a47324118358802fcc6821e77ead77fd37003904). * This patch **fails PySpark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2956#issuecomment-67468465 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24585/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4883][Shuffle] Add a name to the direct...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3734#issuecomment-67471103 [Test build #24586 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24586/consoleFull) for PR 3734 at commit [`71156d6`](https://github.com/apache/spark/commit/71156d630dfc682caf9069b529b911d301827985). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait ParquetTest ` * `protected class CaseInsensitiveMap(map: Map[String, String]) extends Map[String, String] ` * ` class HiveThriftServer2Listener(val server: HiveServer2) extends SparkListener ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4883][Shuffle] Add a name to the direct...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3734#issuecomment-67471113 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24586/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3556#issuecomment-67472596 @marmbrus Please review again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/3555#issuecomment-67473028 @marmbrus Please review again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2554][SQL] Supporting SumDistinct parti...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3348#issuecomment-67474669 [Test build #24588 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24588/consoleFull) for PR 3348 at commit [`fd28e4d`](https://github.com/apache/spark/commit/fd28e4d9e807e677a29451ee361ff040927ffc02). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2554][SQL] Supporting SumDistinct parti...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3348#issuecomment-67474674 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24588/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...
Github user Lewuathe commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-67475009 I agree with @jkbradley. For now, do not expose an optimizer parameter. Only allow one (LBFGS?). Changing scope of each API should be done considerably. In this case it is the tradeoff between publicity of optimizers and usability of ANN. ANN currently seems to require LBFGS therefore only making it public is the reasonable way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3222#issuecomment-67475430 [Test build #24587 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24587/consoleFull) for PR 3222 at commit [`03a180f`](https://github.com/apache/spark/commit/03a180f66927c41a737bd8706caa6c4686606252). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AdaGradUpdater(` * `class DBN(val stackedRBM: StackedRBM, val nn: MLP)` * `class MLP(` * `class MomentumUpdater(val momentum: Double) extends Updater ` * `class RBM(` * `class StackedRBM(val innerRBMs: Array[RBM])` * `case class MinstItem(label: Int, data: Array[Int]) ` * `class MinstDatasetReader(labelsFile: String, imagesFile: String)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3222#issuecomment-67475437 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24587/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...
Github user liyezhang556520 commented on the pull request: https://github.com/apache/spark/pull/2956#issuecomment-67483754 jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2956#issuecomment-67484171 [Test build #24589 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24589/consoleFull) for PR 2956 at commit [`a473241`](https://github.com/apache/spark/commit/a47324118358802fcc6821e77ead77fd37003904). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs
Github user tgaloppo commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-67486366 Ok, I have addressed (I think) all of those issues, with the exception of modifying GaussianMixtureModel to carry instances of MultivariateGaussian. I do like that idea, but think it would be best to create a new issue around solidifying MultivariateGaussian, then revisit this modification. I'd be more than happy to work on the PR for making MultivariateGaussian public. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3429#issuecomment-67491067 [Test build #24590 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24590/consoleFull) for PR 3429 at commit [`9f0aff3`](https://github.com/apache/spark/commit/9f0aff33e862746d3d295a9dbf2629665d80cc22). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/3429#issuecomment-67491128 Thank you @liancheng , I've updated the code as feedback. @marmbrus I think this PR is ready to be merged once Jenkins agrees too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3731#discussion_r22044330 --- Diff: docs/job-scheduling.md --- @@ -56,6 +56,112 @@ the same RDDs. For example, the [Shark](http://shark.cs.berkeley.edu) JDBC serve queries. In future releases, in-memory storage systems such as [Tachyon](http://tachyon-project.org) will provide another approach to share RDDs. +## Dynamic Resource Allocation + +Spark 1.2 introduces the ability to dynamically scale the set of cluster resources allocated to +your application up and down based on the workload. This means that your application may give +resources back to the cluster if they are no longer used and request them again later when there +is demand. This feature is particularly useful if multiple applications share resources in your +Spark cluster. If a subset of the resources allocated to an application becomes idle, it can be +returned to the cluster's pool of resources and acquired by other applications. In Spark, dynamic +resource allocation is performed on the granularity of the executor and can be enabled through +`spark.dynamicAllocation.enabled`. + +This feature is currently disabled by default and available only on [YARN](running-on-yarn.html). +A future release will extend this to [standalone mode](spark-standalone.html) and +[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note that although Spark on +Mesos already has a similar notion of dynamic resource sharing in fine-grained mode, enabling +dynamic allocation allows your Mesos application to take advantage of coarse-grained low-latency +scheduling while sharing cluster resources efficiently. + +Lastly, it is worth noting that Spark's dynamic resource allocation mechanism is cooperative. +This means if a Spark application enables this feature, other applications on the same cluster +are also expected to do so. Otherwise, the cluster's resources will end up being unfairly +distributed to the applications that do not voluntarily give up unused resources they have +acquired. + +### Configuration and Setup + +All configurations used by this feature live under the `spark.dynamicAllocation.*` namespace. +To enable this feature, your application must set `spark.dynamicAllocation.enabled` to `true` and +provide lower and upper bounds for the number of executors through +`spark.dynamicAllocation.minExecutors` and `spark.dynamicAllocation.maxExecutors`. Other relevant +configurations are described on the [configurations page](configuration.html#dynamic-allocation) +and in the subsequent sections in detail. + +Additionally, your application must use an external shuffle service (described below). To enable --- End diff -- It would be nice to add a short clause explaining why this is the case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3731#discussion_r22044723 --- Diff: docs/job-scheduling.md --- @@ -56,6 +56,112 @@ the same RDDs. For example, the [Shark](http://shark.cs.berkeley.edu) JDBC serve queries. In future releases, in-memory storage systems such as [Tachyon](http://tachyon-project.org) will provide another approach to share RDDs. +## Dynamic Resource Allocation + +Spark 1.2 introduces the ability to dynamically scale the set of cluster resources allocated to +your application up and down based on the workload. This means that your application may give +resources back to the cluster if they are no longer used and request them again later when there +is demand. This feature is particularly useful if multiple applications share resources in your +Spark cluster. If a subset of the resources allocated to an application becomes idle, it can be +returned to the cluster's pool of resources and acquired by other applications. In Spark, dynamic +resource allocation is performed on the granularity of the executor and can be enabled through +`spark.dynamicAllocation.enabled`. + +This feature is currently disabled by default and available only on [YARN](running-on-yarn.html). +A future release will extend this to [standalone mode](spark-standalone.html) and +[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note that although Spark on +Mesos already has a similar notion of dynamic resource sharing in fine-grained mode, enabling +dynamic allocation allows your Mesos application to take advantage of coarse-grained low-latency +scheduling while sharing cluster resources efficiently. + +Lastly, it is worth noting that Spark's dynamic resource allocation mechanism is cooperative. --- End diff -- I would possibly rephrase or leave this paragraph out, as there are situations where different dynamicAllocation.enabled settings for different applications are reasonable. I.e. a cluster might have some production applications that need a static allocation to cache data and respond to queries as fast as possible, while others might be interactive and have highly varying resource use. YARN is meant to take care of the fairness aspect. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3731#discussion_r22044849 --- Diff: docs/job-scheduling.md --- @@ -56,6 +56,112 @@ the same RDDs. For example, the [Shark](http://shark.cs.berkeley.edu) JDBC serve queries. In future releases, in-memory storage systems such as [Tachyon](http://tachyon-project.org) will provide another approach to share RDDs. +## Dynamic Resource Allocation + +Spark 1.2 introduces the ability to dynamically scale the set of cluster resources allocated to +your application up and down based on the workload. This means that your application may give +resources back to the cluster if they are no longer used and request them again later when there +is demand. This feature is particularly useful if multiple applications share resources in your +Spark cluster. If a subset of the resources allocated to an application becomes idle, it can be +returned to the cluster's pool of resources and acquired by other applications. In Spark, dynamic +resource allocation is performed on the granularity of the executor and can be enabled through +`spark.dynamicAllocation.enabled`. + +This feature is currently disabled by default and available only on [YARN](running-on-yarn.html). +A future release will extend this to [standalone mode](spark-standalone.html) and +[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note that although Spark on +Mesos already has a similar notion of dynamic resource sharing in fine-grained mode, enabling +dynamic allocation allows your Mesos application to take advantage of coarse-grained low-latency +scheduling while sharing cluster resources efficiently. + +Lastly, it is worth noting that Spark's dynamic resource allocation mechanism is cooperative. +This means if a Spark application enables this feature, other applications on the same cluster +are also expected to do so. Otherwise, the cluster's resources will end up being unfairly +distributed to the applications that do not voluntarily give up unused resources they have +acquired. + +### Configuration and Setup + +All configurations used by this feature live under the `spark.dynamicAllocation.*` namespace. +To enable this feature, your application must set `spark.dynamicAllocation.enabled` to `true` and +provide lower and upper bounds for the number of executors through +`spark.dynamicAllocation.minExecutors` and `spark.dynamicAllocation.maxExecutors`. Other relevant +configurations are described on the [configurations page](configuration.html#dynamic-allocation) +and in the subsequent sections in detail. + +Additionally, your application must use an external shuffle service (described below). To enable +this, set `spark.shuffle.service.enabled` to `true`. In YARN, this external shuffle service is +implemented in `org.apache.spark.yarn.network.YarnShuffleService` that runs in each `NodeManager` --- End diff -- Should this be broken out into a separate section for users that don't care about dynamic allocation, but want to learn how to use the external shuffle service? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3731#discussion_r22044851 --- Diff: docs/job-scheduling.md --- @@ -56,6 +56,112 @@ the same RDDs. For example, the [Shark](http://shark.cs.berkeley.edu) JDBC serve queries. In future releases, in-memory storage systems such as [Tachyon](http://tachyon-project.org) will provide another approach to share RDDs. +## Dynamic Resource Allocation + +Spark 1.2 introduces the ability to dynamically scale the set of cluster resources allocated to +your application up and down based on the workload. This means that your application may give +resources back to the cluster if they are no longer used and request them again later when there +is demand. This feature is particularly useful if multiple applications share resources in your +Spark cluster. If a subset of the resources allocated to an application becomes idle, it can be +returned to the cluster's pool of resources and acquired by other applications. In Spark, dynamic +resource allocation is performed on the granularity of the executor and can be enabled through +`spark.dynamicAllocation.enabled`. + +This feature is currently disabled by default and available only on [YARN](running-on-yarn.html). +A future release will extend this to [standalone mode](spark-standalone.html) and +[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note that although Spark on +Mesos already has a similar notion of dynamic resource sharing in fine-grained mode, enabling +dynamic allocation allows your Mesos application to take advantage of coarse-grained low-latency +scheduling while sharing cluster resources efficiently. + +Lastly, it is worth noting that Spark's dynamic resource allocation mechanism is cooperative. +This means if a Spark application enables this feature, other applications on the same cluster +are also expected to do so. Otherwise, the cluster's resources will end up being unfairly +distributed to the applications that do not voluntarily give up unused resources they have +acquired. + +### Configuration and Setup + +All configurations used by this feature live under the `spark.dynamicAllocation.*` namespace. +To enable this feature, your application must set `spark.dynamicAllocation.enabled` to `true` and +provide lower and upper bounds for the number of executors through +`spark.dynamicAllocation.minExecutors` and `spark.dynamicAllocation.maxExecutors`. Other relevant +configurations are described on the [configurations page](configuration.html#dynamic-allocation) +and in the subsequent sections in detail. + +Additionally, your application must use an external shuffle service (described below). To enable +this, set `spark.shuffle.service.enabled` to `true`. In YARN, this external shuffle service is +implemented in `org.apache.spark.yarn.network.YarnShuffleService` that runs in each `NodeManager` --- End diff -- Should this be broken out into a separate section for users that don't care about dynamic allocation, but want to learn how to use the external shuffle service? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67493934 How do you compare accuracy? Perplexity means nothing but perplexity -- topic models may be reliably compared only via application task (e.g. classification, recommendation... ). Should I add the dataset for perplexity sanity check to the repo? I am about to use 1000 arxiv papers. This dataset is about 20 MB (5.5 MB zipped). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3731#discussion_r22044912 --- Diff: docs/job-scheduling.md --- @@ -56,6 +56,112 @@ the same RDDs. For example, the [Shark](http://shark.cs.berkeley.edu) JDBC serve queries. In future releases, in-memory storage systems such as [Tachyon](http://tachyon-project.org) will provide another approach to share RDDs. +## Dynamic Resource Allocation + +Spark 1.2 introduces the ability to dynamically scale the set of cluster resources allocated to +your application up and down based on the workload. This means that your application may give +resources back to the cluster if they are no longer used and request them again later when there +is demand. This feature is particularly useful if multiple applications share resources in your +Spark cluster. If a subset of the resources allocated to an application becomes idle, it can be +returned to the cluster's pool of resources and acquired by other applications. In Spark, dynamic +resource allocation is performed on the granularity of the executor and can be enabled through +`spark.dynamicAllocation.enabled`. + +This feature is currently disabled by default and available only on [YARN](running-on-yarn.html). +A future release will extend this to [standalone mode](spark-standalone.html) and +[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note that although Spark on +Mesos already has a similar notion of dynamic resource sharing in fine-grained mode, enabling +dynamic allocation allows your Mesos application to take advantage of coarse-grained low-latency +scheduling while sharing cluster resources efficiently. + +Lastly, it is worth noting that Spark's dynamic resource allocation mechanism is cooperative. +This means if a Spark application enables this feature, other applications on the same cluster +are also expected to do so. Otherwise, the cluster's resources will end up being unfairly +distributed to the applications that do not voluntarily give up unused resources they have +acquired. + +### Configuration and Setup + +All configurations used by this feature live under the `spark.dynamicAllocation.*` namespace. +To enable this feature, your application must set `spark.dynamicAllocation.enabled` to `true` and +provide lower and upper bounds for the number of executors through +`spark.dynamicAllocation.minExecutors` and `spark.dynamicAllocation.maxExecutors`. Other relevant +configurations are described on the [configurations page](configuration.html#dynamic-allocation) +and in the subsequent sections in detail. + +Additionally, your application must use an external shuffle service (described below). To enable +this, set `spark.shuffle.service.enabled` to `true`. In YARN, this external shuffle service is +implemented in `org.apache.spark.yarn.network.YarnShuffleService` that runs in each `NodeManager` +in your cluster. To start this service, follow these steps: + +1. Build Spark with the [YARN profile](building-spark.html). Skip this step if you are using a +pre-packaged distribution. +2. Locate the `spark-version-yarn-shuffle.jar`. This should be under +`$SPARK_HOME/network/yarn/target/scala-version` if you are building Spark yourself, and under +`lib` if you are using a distribution. +2. Add this jar to the classpath of all `NodeManager`s in your cluster. +3. In the `yarn-site.xml` on each node, add `spark_shuffle` to `yarn.nodemanager.aux-services`, +then set `yarn.nodemanager.aux-services.spark_shuffle.class` to +`org.apache.spark.yarn.network.YarnShuffleService`. Additionally, set all relevant +`spark.shuffle.service.*` [configurations](configuration.html). +4. Restart all `NodeManager`s in your cluster. + +### Resource Allocation Policy + +On a high level, Spark should relinquish executors when they are no longer used and acquire --- End diff -- Nit: I think should be At a high level or From a high level --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3731#discussion_r22045044 --- Diff: docs/job-scheduling.md --- @@ -56,6 +56,112 @@ the same RDDs. For example, the [Shark](http://shark.cs.berkeley.edu) JDBC serve queries. In future releases, in-memory storage systems such as [Tachyon](http://tachyon-project.org) will provide another approach to share RDDs. +## Dynamic Resource Allocation + +Spark 1.2 introduces the ability to dynamically scale the set of cluster resources allocated to +your application up and down based on the workload. This means that your application may give +resources back to the cluster if they are no longer used and request them again later when there +is demand. This feature is particularly useful if multiple applications share resources in your +Spark cluster. If a subset of the resources allocated to an application becomes idle, it can be +returned to the cluster's pool of resources and acquired by other applications. In Spark, dynamic +resource allocation is performed on the granularity of the executor and can be enabled through +`spark.dynamicAllocation.enabled`. + +This feature is currently disabled by default and available only on [YARN](running-on-yarn.html). +A future release will extend this to [standalone mode](spark-standalone.html) and +[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note that although Spark on +Mesos already has a similar notion of dynamic resource sharing in fine-grained mode, enabling +dynamic allocation allows your Mesos application to take advantage of coarse-grained low-latency +scheduling while sharing cluster resources efficiently. + +Lastly, it is worth noting that Spark's dynamic resource allocation mechanism is cooperative. +This means if a Spark application enables this feature, other applications on the same cluster +are also expected to do so. Otherwise, the cluster's resources will end up being unfairly +distributed to the applications that do not voluntarily give up unused resources they have +acquired. + +### Configuration and Setup + +All configurations used by this feature live under the `spark.dynamicAllocation.*` namespace. +To enable this feature, your application must set `spark.dynamicAllocation.enabled` to `true` and +provide lower and upper bounds for the number of executors through +`spark.dynamicAllocation.minExecutors` and `spark.dynamicAllocation.maxExecutors`. Other relevant +configurations are described on the [configurations page](configuration.html#dynamic-allocation) +and in the subsequent sections in detail. + +Additionally, your application must use an external shuffle service (described below). To enable +this, set `spark.shuffle.service.enabled` to `true`. In YARN, this external shuffle service is +implemented in `org.apache.spark.yarn.network.YarnShuffleService` that runs in each `NodeManager` +in your cluster. To start this service, follow these steps: + +1. Build Spark with the [YARN profile](building-spark.html). Skip this step if you are using a +pre-packaged distribution. +2. Locate the `spark-version-yarn-shuffle.jar`. This should be under +`$SPARK_HOME/network/yarn/target/scala-version` if you are building Spark yourself, and under +`lib` if you are using a distribution. +2. Add this jar to the classpath of all `NodeManager`s in your cluster. +3. In the `yarn-site.xml` on each node, add `spark_shuffle` to `yarn.nodemanager.aux-services`, +then set `yarn.nodemanager.aux-services.spark_shuffle.class` to +`org.apache.spark.yarn.network.YarnShuffleService`. Additionally, set all relevant +`spark.shuffle.service.*` [configurations](configuration.html). +4. Restart all `NodeManager`s in your cluster. + +### Resource Allocation Policy + +On a high level, Spark should relinquish executors when they are no longer used and acquire +executors when they are needed. Since there is no definitive way to predict whether an executor +that is about to be removed will run a task in the near future, or whether a new executor that is +about to be added will actually be idle, we need a set of heuristics to determine when to remove +and request executors. + + Request Policy + +A Spark application with dynamic allocation enabled requests additional executors when it has +pending tasks waiting to be scheduled. This condition necessarily implies that the existing set +of executors is insufficient to simultaneously saturate all tasks that have been submitted but +not yet finished. + +Spark requests executors in rounds. The actual request is triggered when there have been pending +tasks for
[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3731#issuecomment-67494432 Super nice to have documentation at this level of detail. This mostly looks good, I left a few comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2956#issuecomment-67494552 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24589/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2956#issuecomment-67494546 [Test build #24589 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24589/consoleFull) for PR 2956 at commit [`a473241`](https://github.com/apache/spark/commit/a47324118358802fcc6821e77ead77fd37003904). * This patch **fails PySpark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4884]: Improve Partition docs
Github user msiddalingaiah commented on the pull request: https://github.com/apache/spark/pull/3722#issuecomment-67496889 @ash211 Not a problem, I created a JIRA ticket and updated the title/description. Thanks!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4409][MLlib] Additional Linear Algebra ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3319#issuecomment-67499071 [Test build #24591 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24591/consoleFull) for PR 3319 at commit [`75239f8`](https://github.com/apache/spark/commit/75239f8e5b41a275a0f232108b26cb0e16935bbf). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3222#issuecomment-67501444 [Test build #24592 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24592/consoleFull) for PR 3222 at commit [`164d5b7`](https://github.com/apache/spark/commit/164d5b74aae31683e6b69d8b0e23f77b25e7d99f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3429#issuecomment-67501660 [Test build #24590 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24590/consoleFull) for PR 3429 at commit [`9f0aff3`](https://github.com/apache/spark/commit/9f0aff33e862746d3d295a9dbf2629665d80cc22). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3429#issuecomment-67501667 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24590/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Added setMinCount to Word2Vec.scala
Github user ganonp commented on the pull request: https://github.com/apache/spark/pull/3693#issuecomment-67502378 O wow, I just didn't see that the function and everything inside was lining up... Hurts to look at. Thanks for those links and your patience. Spark now makes up about 70% of my job, so I'll definitely be contributing more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/3409#issuecomment-67507324 Looks good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3409 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/3607#discussion_r22050878 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala --- @@ -39,23 +39,37 @@ private[spark] class ClientArguments(args: Array[String], sparkConf: SparkConf) var appName: String = Spark var priority = 0 - // Additional memory to allocate to containers - // For now, use driver's memory overhead as our AM container's memory overhead - val amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead, -math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN)) - - val executorMemoryOverhead = sparkConf.getInt(spark.yarn.executor.memoryOverhead, -math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)) - private val isDynamicAllocationEnabled = sparkConf.getBoolean(spark.dynamicAllocation.enabled, false) parseArgs(args.toList) + + val isClusterMode = userClass != null --- End diff -- ClientBase has this same check. Perhaps we should just make this accessible in the ClientBaseArguments so that ClientBase can just read it from here. private val isLaunchingDriver = args.userClass != null --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/3607#discussion_r22051045 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala --- @@ -39,23 +39,37 @@ private[spark] class ClientArguments(args: Array[String], sparkConf: SparkConf) var appName: String = Spark var priority = 0 - // Additional memory to allocate to containers - // For now, use driver's memory overhead as our AM container's memory overhead - val amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead, -math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN)) - - val executorMemoryOverhead = sparkConf.getInt(spark.yarn.executor.memoryOverhead, -math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)) - private val isDynamicAllocationEnabled = sparkConf.getBoolean(spark.dynamicAllocation.enabled, false) parseArgs(args.toList) + + val isClusterMode = userClass != null + loadEnvironmentArgs() validateArgs() + // Additional memory to allocate to containers. In different modes, we use different configs. + val amMemoryOverheadConf = if (isClusterMode) { +spark.yarn.driver.memoryOverhead + } else { +spark.yarn.am.memoryOverhead + } + val amMemoryOverhead = sparkConf.getInt(amMemoryOverheadConf, +math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN)) + + val executorMemoryOverhead = sparkConf.getInt(spark.yarn.executor.memoryOverhead, +math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)) + /** Load any default arguments provided through environment variables and Spark properties. */ private def loadEnvironmentArgs(): Unit = { +// In cluster mode, the driver and the AM live in the same JVM, so this does not apply +if (!isClusterMode) { + amMemory = Utils.memoryStringToMb(sparkConf.get(spark.yarn.am.memory, 512m)) +} else { + println(spark.yarn.am.memory is set but does not apply in cluster mode, + --- End diff -- we might as well make it consistent and add warning about spark.yarn.am.memoryOverhead being set in cluster mode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/3607#discussion_r22051552 --- Diff: docs/running-on-yarn.md --- @@ -22,6 +22,14 @@ Most of the configs are the same for Spark on YARN as for other deployment modes table class=table trthProperty Name/ththDefault/ththMeaning/th/tr tr + tdcodespark.yarn.am.memory/code/td + td512m/td + td +Amount of memory to use for the Yarn ApplicationMaster in client mode. In cluster mode, use `spark.driver.memory` instead. --- End diff -- would be nice to specify format like the spark.executor.memory docs: in the same format as JVM memory strings (e.g. 512m, 2g) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/3607#discussion_r22051716 --- Diff: docs/running-on-yarn.md --- @@ -92,6 +100,13 @@ Most of the configs are the same for Spark on YARN as for other deployment modes /td /tr tr + tdcodespark.yarn.am.memoryOverhead/code/td + tdAM memory * 0.07, with minimum of 384 /td --- End diff -- would be nice to add comment to spark.yarn.driver.memoryOverhead saying it applies in cluster mode. This config is a bit different from the others as the memory overhead is purely a yarn thing and doesn't apply in other modes. ie There is no existing spark.driver.memoryOverhead. We could potentially just use one config for this. I'm not sure if that will be more confusing or not though... @sryza @vanzin @andrewor14 thoughts on that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3779. yarn spark.yarn.applicationMaster....
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/3471#issuecomment-67512122 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3779. yarn spark.yarn.applicationMaster....
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3471#issuecomment-67512322 [Test build #24593 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24593/consoleFull) for PR 3471 at commit [`20b9887`](https://github.com/apache/spark/commit/20b9887bb9529f2792123778e6eeca6ba0e51c37). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3779. yarn spark.yarn.applicationMaster....
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/3471#issuecomment-67512174 this looks good. kicked jenkins to run again since last run was while ago. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4409][MLlib] Additional Linear Algebra ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3319#issuecomment-67513626 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24591/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4409][MLlib] Additional Linear Algebra ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3319#issuecomment-67513619 [Test build #24591 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24591/consoleFull) for PR 3319 at commit [`75239f8`](https://github.com/apache/spark/commit/75239f8e5b41a275a0f232108b26cb0e16935bbf). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3222#issuecomment-67516313 [Test build #24592 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24592/consoleFull) for PR 3222 at commit [`164d5b7`](https://github.com/apache/spark/commit/164d5b74aae31683e6b69d8b0e23f77b25e7d99f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AdaGradUpdater(` * `class DBN(val stackedRBM: StackedRBM, val nn: MLP)` * `class MLP(` * `class MomentumUpdater(val momentum: Double) extends Updater ` * `class RBM(` * `class StackedRBM(val innerRBMs: Array[RBM])` * `case class MinstItem(label: Int, data: Array[Int]) ` * `class MinstDatasetReader(labelsFile: String, imagesFile: String)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3222#issuecomment-67516326 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24592/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark 3883: SSL support for HttpServer and Akk...
Github user jacek-lewandowski commented on the pull request: https://github.com/apache/spark/pull/3571#issuecomment-67518861 @vanzin I did some changes but I'm not sure about using Spark configuration in this case. At least it can be not so clear. I mean such cases as running executors. `CoarseGrainedExecutorBackend` needs SSL configuration to connect to the driver and fetch the real application configuration. In other words, it doesn't have any information about the configuration and it doesn't load the property file. I suppose the same problem would be with `DriverWrapper` which is used when the driver is run by the worker. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark 3883: SSL support for HttpServer and Akk...
Github user jacek-lewandowski commented on the pull request: https://github.com/apache/spark/pull/3571#issuecomment-67519339 It could work if `DriverWrapper` and `CoarseGrainedExecutorBackend` would load the daemon's configuration file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4884]: Improve Partition docs
Github user ash211 commented on the pull request: https://github.com/apache/spark/pull/3722#issuecomment-67519651 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation
Github user oza commented on a diff in the pull request: https://github.com/apache/spark/pull/3731#discussion_r22055469 --- Diff: docs/job-scheduling.md --- @@ -56,6 +56,112 @@ the same RDDs. For example, the [Shark](http://shark.cs.berkeley.edu) JDBC serve queries. In future releases, in-memory storage systems such as [Tachyon](http://tachyon-project.org) will provide another approach to share RDDs. +## Dynamic Resource Allocation + +Spark 1.2 introduces the ability to dynamically scale the set of cluster resources allocated to +your application up and down based on the workload. This means that your application may give +resources back to the cluster if they are no longer used and request them again later when there +is demand. This feature is particularly useful if multiple applications share resources in your +Spark cluster. If a subset of the resources allocated to an application becomes idle, it can be +returned to the cluster's pool of resources and acquired by other applications. In Spark, dynamic +resource allocation is performed on the granularity of the executor and can be enabled through +`spark.dynamicAllocation.enabled`. + +This feature is currently disabled by default and available only on [YARN](running-on-yarn.html). +A future release will extend this to [standalone mode](spark-standalone.html) and +[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note that although Spark on +Mesos already has a similar notion of dynamic resource sharing in fine-grained mode, enabling +dynamic allocation allows your Mesos application to take advantage of coarse-grained low-latency +scheduling while sharing cluster resources efficiently. + +Lastly, it is worth noting that Spark's dynamic resource allocation mechanism is cooperative. +This means if a Spark application enables this feature, other applications on the same cluster +are also expected to do so. Otherwise, the cluster's resources will end up being unfairly +distributed to the applications that do not voluntarily give up unused resources they have +acquired. + +### Configuration and Setup + +All configurations used by this feature live under the `spark.dynamicAllocation.*` namespace. +To enable this feature, your application must set `spark.dynamicAllocation.enabled` to `true` and +provide lower and upper bounds for the number of executors through +`spark.dynamicAllocation.minExecutors` and `spark.dynamicAllocation.maxExecutors`. Other relevant +configurations are described on the [configurations page](configuration.html#dynamic-allocation) +and in the subsequent sections in detail. + +Additionally, your application must use an external shuffle service (described below). To enable +this, set `spark.shuffle.service.enabled` to `true`. In YARN, this external shuffle service is +implemented in `org.apache.spark.yarn.network.YarnShuffleService` that runs in each `NodeManager` --- End diff -- +1 to add how to use external shuffle service since we need to enable external shuffle service to use dynamic allocation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3518#issuecomment-67520645 This looks pretty neat; I'll try to review this soon (a little busy right now), but in the meantime you might be interested in #3638 which has some small overlap in the sense that both patches deal with handling of serialization errors; both patches address different issues, though. I'm inclined to merge #3638 first, since it's a bug fix and this is a feature, so that's likely to create a bunch of merge conflicts here. I'll let you know if I do that, and I might be able to help fix the conflicts myself by submitting a PR to your PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...
Github user ilganeli commented on the pull request: https://github.com/apache/spark/pull/3518#issuecomment-67523669 Great - thanks, Josh. I'm working on doing a bit more code cleanup in the mean-time to minimize touch points within the existing Spark classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3779. yarn spark.yarn.applicationMaster....
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3471#issuecomment-67525753 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24593/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3779. yarn spark.yarn.applicationMaster....
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3471#issuecomment-67525739 [Test build #24593 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24593/consoleFull) for PR 3471 at commit [`20b9887`](https://github.com/apache/spark/commit/20b9887bb9529f2792123778e6eeca6ba0e51c37). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r22058276 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala --- @@ -0,0 +1,284 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseVector = BreezeVector, DenseMatrix = BreezeMatrix} +import breeze.linalg.Transpose + +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors} +import org.apache.spark.mllib.stat.impl.MultivariateGaussian +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext} +import org.apache.spark.SparkContext.DoubleAccumulatorParam + +/** + * This class performs expectation maximization for multivariate Gaussian + * Mixture Models (GMMs). A GMM represents a composite distribution of + * independent Gaussian distributions with associated mixing weights + * specifying each's contribution to the composite. + * + * Given a set of sample points, this class will maximize the log-likelihood + * for a mixture of k Gaussians, iterating until the log-likelihood changes by + * less than convergenceTol, or until it has reached the max number of iterations. + * While this process is generally guaranteed to converge, it is not guaranteed + * to find a global optimum. + * + * @param k The number of independent Gaussians in the mixture model + * @param convergenceTol The maximum change in log-likelihood at which convergence + * is considered to have occurred. + * @param maxIterations The maximum number of iterations to perform + */ +class GaussianMixtureModelEM private ( +private var k: Int, +private var convergenceTol: Double, +private var maxIterations: Int) extends Serializable { + + // Type aliases for convenience + private type DenseDoubleVector = BreezeVector[Double] + private type DenseDoubleMatrix = BreezeMatrix[Double] + + private type ExpectationSum = ( +Array[Double], // log-likelihood in index 0 +Array[Double], // array of weights +Array[DenseDoubleVector], // array of means +Array[DenseDoubleMatrix]) // array of cov matrices + + // create a zero'd ExpectationSum instance + private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = { +(Array(0.0), + new Array[Double](k), + (0 until k).map(_ = BreezeVector.zeros[Double](d)).toArray, + (0 until k).map(_ = BreezeMatrix.zeros[Double](d,d)).toArray) + } + + // add two ExpectationSum objects (allowed to use modify m1) + // (U, U) = U for aggregation + private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = { +m1._1(0) += m2._1(0) +for (i - 0 until m1._2.length) { + m1._2(i) += m2._2(i) + m1._3(i) += m2._3(i) + m1._4(i) += m2._4(i) +} +m1 + } + + // compute cluster contributions for each input point + // (U, T) = U for aggregation + private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian]) + (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = { +val k = model._2.length +val p = (0 until k).map(i = eps + weights(i) * dists(i).pdf(x)).toArray +val pSum = p.sum +model._1(0) += math.log(pSum) +val xxt = x * new Transpose(x) +for (i - 0 until k) { + p(i) /= pSum + model._2(i) += p(i) + model._3(i) += x * p(i) + model._4(i) += xxt * p(i) +} +model + } + + // number of samples per cluster to use when initializing Gaussians + private val nSamples = 5 + + // an initializing GMM can be provided rather than using the + // default random starting point + private var initialGmm: Option[GaussianMixtureModel] = None +
[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r22058461 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala --- @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.linalg.Matrix +import org.apache.spark.mllib.linalg.Vector + +/** + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are + * the respective mean and covariance for each Gaussian distribution i=1..k. + * + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is + * the weight for Gaussian i, and weight.sum == 1 + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the + * covariance matrix for Gaussian i + */ +class GaussianMixtureModel( + val weight: Array[Double], --- End diff -- We only use Breeze internally right now; we don't want to expose it as a public API. I really meant using the MultivariateGaussian class which you defined. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4877] Allow user first classes to exten...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/3725#issuecomment-67526925 BTW I'm pretty sure I addressed this as part of #3233, although in a different way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org