date:20141218

[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...

2014-12-18 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3429#discussion_r22027516
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala ---
@@ -33,6 +31,147 @@ import 
org.apache.spark.sql.catalyst.types.decimal.Decimal
 /* Implicit conversions */
 import scala.collection.JavaConversions._
 
+/**
+ * 1. The Underlying data type in catalyst and in Hive
+ * In catalyst:
+ *  Primitive  =
+ * java.lang.String
+ * int / scala.Int
+ * boolean / scala.Boolean
+ * float / scala.Float
+ * double / scala.Double
+ * long / scala.Long
+ * short / scala.Short
+ * byte / scala.Byte
+ * org.apache.spark.sql.catalyst.types.decimal.Decimal
+ * Array[Byte]
+ * java.sql.Date
+ * java.sql.Timestamp
+ *  Complicated Types =
--- End diff --

Complicated = Complex


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...

2014-12-18 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3429#discussion_r22027824
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala ---
@@ -119,12 +271,44 @@ private[hive] trait HiveInspectors {
   System.arraycopy(writable.getBytes, 0, temp, 0, temp.length)
   temp
 case poi: WritableConstantDateObjectInspector = 
poi.getWritableConstantValue.get()
-case hvoi: HiveVarcharObjectInspector = 
hvoi.getPrimitiveJavaObject(data).getValue
-case hdoi: HiveDecimalObjectInspector = 
HiveShim.toCatalystDecimal(hdoi, data)
-// org.apache.hadoop.hive.serde2.io.TimestampWritable.set will reset 
current time object
-// if next timestamp is null, so Timestamp object is cloned
-case ti: TimestampObjectInspector = 
ti.getPrimitiveJavaObject(data).clone()
-case pi: PrimitiveObjectInspector = pi.getPrimitiveJavaObject(data)
+case mi: StandardConstantMapObjectInspector =
+  // take the value from the map inspector object, rather than the 
input data
+  mi.getWritableConstantValue.map { case (k, v) =
+(unwrap(k, mi.getMapKeyObjectInspector),
+  unwrap(v, mi.getMapValueObjectInspector))
+  }.toMap
+case li: StandardConstantListObjectInspector =
+  // take the value from the list inspector object, rather than the 
input data
+  li.getWritableConstantValue.map(unwrap(_, 
li.getListElementObjectInspector)).toSeq
+// if the value is null, we don't care about the object inspector type
+case _ if data == null = null
+case poi: VoidObjectInspector = null // always be null for void 
object inspector
+case pi: PrimitiveObjectInspector = pi match {
+  // We think HiveVarchar is also a String
+  case hvoi: HiveVarcharObjectInspector if hvoi.preferWritable() =
+hvoi.getPrimitiveWritableObject(data).getHiveVarchar.getValue
+  case hvoi: HiveVarcharObjectInspector = 
hvoi.getPrimitiveJavaObject(data).getValue
+  case x: StringObjectInspector if x.preferWritable() =
+x.getPrimitiveWritableObject(data).toString
--- End diff --

I guess we should return a `Writable`, namely a `Text` object, rather than 
a `String` here? Should we remove the `toString` call?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...

2014-12-18 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3429#discussion_r22027948
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala ---
@@ -119,12 +271,44 @@ private[hive] trait HiveInspectors {
   System.arraycopy(writable.getBytes, 0, temp, 0, temp.length)
   temp
 case poi: WritableConstantDateObjectInspector = 
poi.getWritableConstantValue.get()
-case hvoi: HiveVarcharObjectInspector = 
hvoi.getPrimitiveJavaObject(data).getValue
-case hdoi: HiveDecimalObjectInspector = 
HiveShim.toCatalystDecimal(hdoi, data)
-// org.apache.hadoop.hive.serde2.io.TimestampWritable.set will reset 
current time object
-// if next timestamp is null, so Timestamp object is cloned
-case ti: TimestampObjectInspector = 
ti.getPrimitiveJavaObject(data).clone()
-case pi: PrimitiveObjectInspector = pi.getPrimitiveJavaObject(data)
+case mi: StandardConstantMapObjectInspector =
+  // take the value from the map inspector object, rather than the 
input data
+  mi.getWritableConstantValue.map { case (k, v) =
+(unwrap(k, mi.getMapKeyObjectInspector),
+  unwrap(v, mi.getMapValueObjectInspector))
+  }.toMap
+case li: StandardConstantListObjectInspector =
+  // take the value from the list inspector object, rather than the 
input data
+  li.getWritableConstantValue.map(unwrap(_, 
li.getListElementObjectInspector)).toSeq
+// if the value is null, we don't care about the object inspector type
+case _ if data == null = null
+case poi: VoidObjectInspector = null // always be null for void 
object inspector
+case pi: PrimitiveObjectInspector = pi match {
+  // We think HiveVarchar is also a String
+  case hvoi: HiveVarcharObjectInspector if hvoi.preferWritable() =
+hvoi.getPrimitiveWritableObject(data).getHiveVarchar.getValue
+  case hvoi: HiveVarcharObjectInspector = 
hvoi.getPrimitiveJavaObject(data).getValue
+  case x: StringObjectInspector if x.preferWritable() =
+x.getPrimitiveWritableObject(data).toString
--- End diff --

Oh I see where I'm wrong. We need Catalyst objects rather than Hive objects 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...

2014-12-18 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3429#discussion_r22028090
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveInspectorSuite.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.sql.Date
+import java.util
+
+import org.apache.hadoop.hive.serde2.io.DoubleWritable
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.types.decimal.Decimal
+import org.scalatest.FunSuite
+
+import org.apache.hadoop.hive.ql.udf.UDAFPercentile
+import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspector, 
StructObjectInspector, ObjectInspectorFactory}
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions
+import org.apache.hadoop.io.LongWritable
+
+import org.apache.spark.sql.catalyst.expressions.{Literal, Row}
+
+class HiveInspectorSuite extends FunSuite with HiveInspectors {
--- End diff --

A general comment about this test suite: would be better to use `===` 
rather than `==` in assertions to enable more friendly error messages with 
actual data values when tests fail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3731#issuecomment-67454633
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24575/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2014-12-18 Thread adrian-wang

GitHub user adrian-wang opened a pull request:

https://github.com/apache/spark/pull/3732

[SPARK-4508] [SQL] build native date type to conform behavior to Hive

Store daysSinceEpoch as an Int value(4 bytes) to represent DateType, 
instead of using java.sql.Date(8 bytes as Long) in catalyst row. This ensures 
the same comparison behavior of Hive and Catalyst.
Subsumes #3381 
I thinks there are already some tests in JavaSQLSuite, and for python it 
will not affect python's datetime class.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/adrian-wang/spark datenative

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3732.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3732


commit b9fbdb53b36b7635e5d3bd6c73c4318665dfb41c
Author: Daoyuan Wang daoyuan.w...@intel.com
Date:   2014-11-20T05:02:52Z

spark native date type

commit 9110ef072bc1aaf4c43bf307ae4e438130863661
Author: Daoyuan Wang daoyuan.w...@intel.com
Date:   2014-11-20T07:17:26Z

api change

commit 2e167a4a8e71e6827d320106fa5727434100958c
Author: Daoyuan Wang daoyuan.w...@intel.com
Date:   2014-11-20T07:18:17Z

remove outdated files




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3731#issuecomment-67454627
  
  [Test build #24575 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24575/consoleFull)
 for   PR 3731 at commit 
[`b9843f2`](https://github.com/apache/spark/commit/b9843f2c673f30c5111f2a2a29e15dcde00042db).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3732#issuecomment-67454747
  
  [Test build #24576 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24576/consoleFull)
 for   PR 3732 at commit 
[`2e167a4`](https://github.com/apache/spark/commit/2e167a4a8e71e6827d320106fa5727434100958c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...

2014-12-18 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/3429#issuecomment-67454924
  
In general this LGTM except for some minor styling comments, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3732#issuecomment-67455049
  
  [Test build #24576 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24576/consoleFull)
 for   PR 3732 at commit 
[`2e167a4`](https://github.com/apache/spark/commit/2e167a4a8e71e6827d320106fa5727434100958c).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `final class Date extends Ordered[Date] with Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3732#issuecomment-67455053
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24576/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

2014-12-18 Thread ilganeli

Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/3518#issuecomment-67455654
  
Hi @JoshRosen, I have made the updates we've discussed. Example output is 
shown below for two cases of unserializable RDDs. The first is when an 
individual RDD is unserializable, the latter is for when a nested RDD 
dependency is unserializable. 

Case 1:
```scala
 val unserializableRdd = new MyRDD(sc, 1, Nil) {
  class UnserializableClass
  val unserializable = new UnserializableClass
}

val trace : Array[SerializedRef] = 
scheduler.tryToSerializeRdd(unserializableRdd)
```

Depth 0: DAGSchedulerSuiteRDD 0 - Failed to serialize parent.
Un-serializable reference trace for DAGSchedulerSuiteRDD 0:
**
DAGSchedulerSuiteRDD 0:
--- Ref (class scala.Tuple2, Hash: 1240412896)
--- Ref 
(org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$2$$anon$1$UnserializableClass@315df4bb,
 Hash: 828241083)
--- Ref (StorageLevel(false, false, false, false, 1), Hash: 1370224403)
--- Ref (None, Hash: 1353759820)
--- Ref (List(), Hash: 1599566873)
--- Ref (scala.Tuple2, Hash: 254955665)
--- Ref (DAGSchedulerSuiteRDD 0, Hash: 279781579)
**

org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$2$$anon$1$UnserializableClass@315df4bb:
--- Ref (class scala.Tuple2, Hash: 1240412896)
--- Ref (StorageLevel(false, false, false, false, 1), Hash: 1370224403)
--- Ref (None, Hash: 1353759820)
--- Ref (List(), Hash: 1599566873)
--- Ref (scala.Tuple2, Hash: 254955665)
--- Ref (DAGSchedulerSuiteRDD 0, Hash: 279781579)
--- Ref 
(org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$2$$anon$1$UnserializableClass@315df4bb,
 Hash: 828241083)
**

Case 2:
```scala
val baseRdd = new MyRDD(sc, 1, Nil)
val midRdd = new MyRDD(sc, 1, List(new OneToOneDependency(baseRdd)))
val finalRdd = new MyRDD(sc, 1, List(new OneToOneDependency(midRdd))) {
  class UnserializableClass
  val unserializable = new UnserializableClass
}

val trace : Array[SerializedRef] = scheduler.tryToSerializeRdd(finalRdd)
```
Depth 0: DAGSchedulerSuiteRDD 2 - Failed to serialize parent.
Depth 1: DAGSchedulerSuiteRDD 1 - Success
Depth 2: DAGSchedulerSuiteRDD 0 - Success
Un-serializable reference trace for DAGSchedulerSuiteRDD 2:
**
DAGSchedulerSuiteRDD 2:
--- Ref (DAGSchedulerSuiteRDD 0, Hash: 1968196847)
--- Ref (org.apache.spark.OneToOneDependency@29d37757, Hash: 701724503)
--- Ref (List(org.apache.spark.OneToOneDependency@29d37757), Hash: 
1255445356)
--- Ref (DAGSchedulerSuiteRDD 1, Hash: 1787987889)
--- Ref (org.apache.spark.OneToOneDependency@3e598df9, Hash: 1046056441)
--- Ref (class scala.Tuple2, Hash: 1240412896)
--- Ref (StorageLevel(false, false, false, false, 1), Hash: 1370224403)
--- Ref (None, Hash: 1353759820)
--- Ref 
(org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$3$$anon$2$UnserializableClass@3887cf88,
 Hash: 948424584)
--- Ref (List(org.apache.spark.OneToOneDependency@3e598df9), Hash: 
1618683794)
--- Ref (List(), Hash: 1599566873)
--- Ref (scala.Tuple2, Hash: 550572371)
--- Ref (DAGSchedulerSuiteRDD 2, Hash: 1726715997)
**

org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$3$$anon$2$UnserializableClass@3887cf88:
--- Ref (DAGSchedulerSuiteRDD 0, Hash: 1968196847)
--- Ref (org.apache.spark.OneToOneDependency@29d37757, Hash: 701724503)
--- Ref (List(org.apache.spark.OneToOneDependency@29d37757), Hash: 
1255445356)
--- Ref (DAGSchedulerSuiteRDD 1, Hash: 1787987889)
--- Ref (org.apache.spark.OneToOneDependency@3e598df9, Hash: 1046056441)
--- Ref (class scala.Tuple2, Hash: 1240412896)
--- Ref (StorageLevel(false, false, false, false, 1), Hash: 1370224403)
--- Ref (None, Hash: 1353759820)
--- Ref (List(org.apache.spark.OneToOneDependency@3e598df9), Hash: 
1618683794)
--- Ref (List(), Hash: 1599566873)
--- Ref (scala.Tuple2, Hash: 550572371)
--- Ref (DAGSchedulerSuiteRDD 2, Hash: 1726715997)
--- Ref 
(org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$3$$anon$2$UnserializableClass@3887cf88,
 Hash: 948424584)
**



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3518#issuecomment-67455620
  
  [Test build #24577 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24577/consoleFull)
 for   PR 3518 at commit 
[`bb5f700`](https://github.com/apache/spark/commit/bb5f700363dc577b84414e25caedafeb7c247de6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4881] Use SparkConf#getBoolean instead ...

2014-12-18 Thread sarutak

GitHub user sarutak opened a pull request:

https://github.com/apache/spark/pull/3733

[SPARK-4881] Use SparkConf#getBoolean instead of get().toBoolean

It's really a minor issue.

In ApplicationMaster, there is code like as follows.

val preserveFiles = sparkConf.get(spark.yarn.preserve.staging.files, 
false).toBoolean


I think, the code can be simplified like as follows.

val preserveFiles = 
sparkConf.getBoolean(spark.yarn.preserve.staging.files, false)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sarutak/spark SPARK-4881

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3733.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3733


commit c63daa0dd76551fc1fea9b0dcfc91f9a73ee2948
Author: Kousuke Saruta saru...@oss.nttdata.co.jp
Date:   2014-12-18T08:33:04Z

Simplified code




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4881] Use SparkConf#getBoolean instead ...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3733#issuecomment-67456060
  
  [Test build #24578 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24578/consoleFull)
 for   PR 3733 at commit 
[`c63daa0`](https://github.com/apache/spark/commit/c63daa0dd76551fc1fea9b0dcfc91f9a73ee2948).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4871][SQL] Show sql statement in spark ...

2014-12-18 Thread scwf

Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/3718#issuecomment-67456216
  
Retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3556#issuecomment-67456529
  
  [Test build #24581 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24581/consoleFull)
 for   PR 3556 at commit 
[`37cfdf5`](https://github.com/apache/spark/commit/37cfdf5effe0de72a86974b65a6ddff87debfffa).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3732#issuecomment-67456526
  
  [Test build #24579 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24579/consoleFull)
 for   PR 3732 at commit 
[`6ef2b1f`](https://github.com/apache/spark/commit/6ef2b1f1ba89c0ef522720118269a5ba168d1f5c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3556#issuecomment-67456774
  
  [Test build #24581 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24581/consoleFull)
 for   PR 3556 at commit 
[`37cfdf5`](https://github.com/apache/spark/commit/37cfdf5effe0de72a86974b65a6ddff87debfffa).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3556#issuecomment-67456777
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24581/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3556#issuecomment-67457397
  
  [Test build #24582 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24582/consoleFull)
 for   PR 3556 at commit 
[`620ebe3`](https://github.com/apache/spark/commit/620ebe3df79fce1c8dbdea971eea99971af5b9d9).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3555#issuecomment-67457906
  
  [Test build #24583 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24583/consoleFull)
 for   PR 3555 at commit 
[`1893956`](https://github.com/apache/spark/commit/189395672255daad5eb2cdbd5b51a5948338f9f5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4861][SQL] Refactory command in spark s...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3712#issuecomment-67458887
  
  [Test build #24584 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24584/consoleFull)
 for   PR 3712 at commit 
[`51a82f2`](https://github.com/apache/spark/commit/51a82f2ae3fe9d28455940d953de7b76306f49b2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2956#issuecomment-67459431
  
  [Test build #24585 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24585/consoleFull)
 for   PR 2956 at commit 
[`a473241`](https://github.com/apache/spark/commit/a47324118358802fcc6821e77ead77fd37003904).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...

2014-12-18 Thread tolgap

Github user tolgap commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-67461517
  
@bgreeven I have cloned your branch and am trying to run the MNIST dataset. 
I can't quite understand how to set the number of output neurons though. The 
`topology` array seems to only apply to the hidden layers. I have seen some 
tests of MNIST on your code though, so I was curious how this was done?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4883][Shuffle] Add a name to the direct...

2014-12-18 Thread zsxwing

GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/3734

[SPARK-4883][Shuffle] Add a name to the directoryCleaner thread



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-4883

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3734.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3734


commit 71156d630dfc682caf9069b529b911d301827985
Author: zsxwing zsxw...@gmail.com
Date:   2014-12-18T09:21:01Z

Add a name to the directoryCleaner thread




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4883][Shuffle] Add a name to the direct...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3734#issuecomment-67462082
  
  [Test build #24586 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24586/consoleFull)
 for   PR 3734 at commit 
[`71156d6`](https://github.com/apache/spark/commit/71156d630dfc682caf9069b529b911d301827985).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3732#issuecomment-67464059
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24579/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3732#issuecomment-67464054
  
  [Test build #24579 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24579/consoleFull)
 for   PR 3732 at commit 
[`6ef2b1f`](https://github.com/apache/spark/commit/6ef2b1f1ba89c0ef522720118269a5ba168d1f5c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `final class Date extends Ordered[Date] with Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3518#issuecomment-67464102
  
  [Test build #24577 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24577/consoleFull)
 for   PR 3518 at commit 
[`bb5f700`](https://github.com/apache/spark/commit/bb5f700363dc577b84414e25caedafeb7c247de6).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class EdgeRef(cur : AnyRef, parent : EdgeRef) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3518#issuecomment-67464113
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24577/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4881] Use SparkConf#getBoolean instead ...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3733#issuecomment-67464681
  
  [Test build #24578 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24578/consoleFull)
 for   PR 3733 at commit 
[`c63daa0`](https://github.com/apache/spark/commit/c63daa0dd76551fc1fea9b0dcfc91f9a73ee2948).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4881] Use SparkConf#getBoolean instead ...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3733#issuecomment-67464689
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24578/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3556#issuecomment-67464932
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24582/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3556#issuecomment-67464925
  
  [Test build #24582 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24582/consoleFull)
 for   PR 3556 at commit 
[`620ebe3`](https://github.com/apache/spark/commit/620ebe3df79fce1c8dbdea971eea99971af5b9d9).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3555#issuecomment-67465486
  
  [Test build #24583 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24583/consoleFull)
 for   PR 3555 at commit 
[`1893956`](https://github.com/apache/spark/commit/189395672255daad5eb2cdbd5b51a5948338f9f5).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3555#issuecomment-67465495
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24583/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4861][SQL] Refactory command in spark s...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3712#issuecomment-67466375
  
  [Test build #24584 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24584/consoleFull)
 for   PR 3712 at commit 
[`51a82f2`](https://github.com/apache/spark/commit/51a82f2ae3fe9d28455940d953de7b76306f49b2).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4861][SQL] Refactory command in spark s...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3712#issuecomment-67466379
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24584/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3222#issuecomment-67466905
  
  [Test build #24587 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24587/consoleFull)
 for   PR 3222 at commit 
[`03a180f`](https://github.com/apache/spark/commit/03a180f66927c41a737bd8706caa6c4686606252).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2554][SQL] Supporting SumDistinct parti...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3348#issuecomment-67467429
  
  [Test build #24588 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24588/consoleFull)
 for   PR 3348 at commit 
[`fd28e4d`](https://github.com/apache/spark/commit/fd28e4d9e807e677a29451ee361ff040927ffc02).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-18 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3702#issuecomment-67467841
  
Yes, just talking about oversampling now. In 1, if you mean ceil(rdd.count 
/ numBins) then yes that's basically what I've got now. You won't quite get 
numBins back, yes. The spacing _will_ be even -- except at partition 
boundaries. You can push around a few points there to amortize the uneven 
space. I don't even think you need oversampling for that.

I'm suggesting 1 as well. I feel like I sound lazy, but, this is a context 
where approximation is entirely fine since the purpose is, say, exporting 
something you could plot in a picture, and because the error is so relatively 
modest in realistic use cases. It doesn't seem worth the complexity or 
processing. Maybe I should just document the couple caveats here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4871][SQL] Show sql statement in spark ...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3718#issuecomment-67468412
  
  [Test build #24580 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24580/consoleFull)
 for   PR 3718 at commit 
[`4d2038a`](https://github.com/apache/spark/commit/4d2038a3d9727a1ba38c5efba2b01f8faaf65ce8).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4871][SQL] Show sql statement in spark ...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3718#issuecomment-67468425
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24580/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2956#issuecomment-67468461
  
  [Test build #24585 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24585/consoleFull)
 for   PR 2956 at commit 
[`a473241`](https://github.com/apache/spark/commit/a47324118358802fcc6821e77ead77fd37003904).
 * This patch **fails PySpark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2956#issuecomment-67468465
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24585/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4883][Shuffle] Add a name to the direct...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3734#issuecomment-67471103
  
  [Test build #24586 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24586/consoleFull)
 for   PR 3734 at commit 
[`71156d6`](https://github.com/apache/spark/commit/71156d630dfc682caf9069b529b911d301827985).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait ParquetTest `
  * `protected class CaseInsensitiveMap(map: Map[String, String]) extends 
Map[String, String] `
  * `  class HiveThriftServer2Listener(val server: HiveServer2) extends 
SparkListener `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4883][Shuffle] Add a name to the direct...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3734#issuecomment-67471113
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24586/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4693] [SQL] PruningPredicates may be wr...

2014-12-18 Thread YanTangZhai

Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3556#issuecomment-67472596
  
@marmbrus Please review again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4692] [SQL] Support ! boolean logic ope...

2014-12-18 Thread YanTangZhai

Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3555#issuecomment-67473028
  
@marmbrus  Please review again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2554][SQL] Supporting SumDistinct parti...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3348#issuecomment-67474669
  
  [Test build #24588 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24588/consoleFull)
 for   PR 3348 at commit 
[`fd28e4d`](https://github.com/apache/spark/commit/fd28e4d9e807e677a29451ee361ff040927ffc02).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2554][SQL] Supporting SumDistinct parti...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3348#issuecomment-67474674
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24588/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...

2014-12-18 Thread Lewuathe

Github user Lewuathe commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-67475009
  
I agree with @jkbradley. 
 For now, do not expose an optimizer parameter. Only allow one (LBFGS?).

Changing scope of each API should be done considerably. In this case it is 
the tradeoff between publicity of optimizers and usability of ANN. ANN 
currently seems to require LBFGS therefore only making it public is the 
reasonable way. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3222#issuecomment-67475430
  
  [Test build #24587 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24587/consoleFull)
 for   PR 3222 at commit 
[`03a180f`](https://github.com/apache/spark/commit/03a180f66927c41a737bd8706caa6c4686606252).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AdaGradUpdater(`
  * `class DBN(val stackedRBM: StackedRBM, val nn: MLP)`
  * `class MLP(`
  * `class MomentumUpdater(val momentum: Double) extends Updater `
  * `class RBM(`
  * `class StackedRBM(val innerRBMs: Array[RBM])`
  * `case class MinstItem(label: Int, data: Array[Int]) `
  * `class MinstDatasetReader(labelsFile: String, imagesFile: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3222#issuecomment-67475437
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24587/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...

2014-12-18 Thread liyezhang556520

Github user liyezhang556520 commented on the pull request:

https://github.com/apache/spark/pull/2956#issuecomment-67483754
  
jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2956#issuecomment-67484171
  
  [Test build #24589 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24589/consoleFull)
 for   PR 2956 at commit 
[`a473241`](https://github.com/apache/spark/commit/a47324118358802fcc6821e77ead77fd37003904).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-18 Thread tgaloppo

Github user tgaloppo commented on the pull request:

https://github.com/apache/spark/pull/3022#issuecomment-67486366
  
Ok, I have addressed (I think) all of those issues, with the exception of 
modifying GaussianMixtureModel to carry instances of MultivariateGaussian.  I 
do like that idea, but think it would be best to create a new issue around 
solidifying MultivariateGaussian, then revisit this modification.  I'd be more 
than happy to work on the PR for making MultivariateGaussian public.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3429#issuecomment-67491067
  
  [Test build #24590 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24590/consoleFull)
 for   PR 3429 at commit 
[`9f0aff3`](https://github.com/apache/spark/commit/9f0aff33e862746d3d295a9dbf2629665d80cc22).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...

2014-12-18 Thread chenghao-intel

Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/3429#issuecomment-67491128
  
Thank you @liancheng , I've updated the code as feedback. @marmbrus I think 
this PR is ready to be merged once Jenkins agrees too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation

2014-12-18 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/3731#discussion_r22044330
  
--- Diff: docs/job-scheduling.md ---
@@ -56,6 +56,112 @@ the same RDDs. For example, the 
[Shark](http://shark.cs.berkeley.edu) JDBC serve
 queries. In future releases, in-memory storage systems such as 
[Tachyon](http://tachyon-project.org) will
 provide another approach to share RDDs.
 
+## Dynamic Resource Allocation
+
+Spark 1.2 introduces the ability to dynamically scale the set of cluster 
resources allocated to
+your application up and down based on the workload. This means that your 
application may give
+resources back to the cluster if they are no longer used and request them 
again later when there
+is demand. This feature is particularly useful if multiple applications 
share resources in your
+Spark cluster. If a subset of the resources allocated to an application 
becomes idle, it can be
+returned to the cluster's pool of resources and acquired by other 
applications. In Spark, dynamic
+resource allocation is performed on the granularity of the executor and 
can be enabled through
+`spark.dynamicAllocation.enabled`.
+
+This feature is currently disabled by default and available only on 
[YARN](running-on-yarn.html).
+A future release will extend this to [standalone 
mode](spark-standalone.html) and
+[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note 
that although Spark on
+Mesos already has a similar notion of dynamic resource sharing in 
fine-grained mode, enabling
+dynamic allocation allows your Mesos application to take advantage of 
coarse-grained low-latency
+scheduling while sharing cluster resources efficiently.
+
+Lastly, it is worth noting that Spark's dynamic resource allocation 
mechanism is cooperative.
+This means if a Spark application enables this feature, other applications 
on the same cluster
+are also expected to do so. Otherwise, the cluster's resources will end up 
being unfairly
+distributed to the applications that do not voluntarily give up unused 
resources they have
+acquired.
+
+### Configuration and Setup
+
+All configurations used by this feature live under the 
`spark.dynamicAllocation.*` namespace.
+To enable this feature, your application must set 
`spark.dynamicAllocation.enabled` to `true` and
+provide lower and upper bounds for the number of executors through
+`spark.dynamicAllocation.minExecutors` and 
`spark.dynamicAllocation.maxExecutors`. Other relevant
+configurations are described on the [configurations 
page](configuration.html#dynamic-allocation)
+and in the subsequent sections in detail.
+
+Additionally, your application must use an external shuffle service 
(described below). To enable
--- End diff --

It would be nice to add a short clause explaining why this is the case


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation

2014-12-18 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/3731#discussion_r22044723
  
--- Diff: docs/job-scheduling.md ---
@@ -56,6 +56,112 @@ the same RDDs. For example, the 
[Shark](http://shark.cs.berkeley.edu) JDBC serve
 queries. In future releases, in-memory storage systems such as 
[Tachyon](http://tachyon-project.org) will
 provide another approach to share RDDs.
 
+## Dynamic Resource Allocation
+
+Spark 1.2 introduces the ability to dynamically scale the set of cluster 
resources allocated to
+your application up and down based on the workload. This means that your 
application may give
+resources back to the cluster if they are no longer used and request them 
again later when there
+is demand. This feature is particularly useful if multiple applications 
share resources in your
+Spark cluster. If a subset of the resources allocated to an application 
becomes idle, it can be
+returned to the cluster's pool of resources and acquired by other 
applications. In Spark, dynamic
+resource allocation is performed on the granularity of the executor and 
can be enabled through
+`spark.dynamicAllocation.enabled`.
+
+This feature is currently disabled by default and available only on 
[YARN](running-on-yarn.html).
+A future release will extend this to [standalone 
mode](spark-standalone.html) and
+[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note 
that although Spark on
+Mesos already has a similar notion of dynamic resource sharing in 
fine-grained mode, enabling
+dynamic allocation allows your Mesos application to take advantage of 
coarse-grained low-latency
+scheduling while sharing cluster resources efficiently.
+
+Lastly, it is worth noting that Spark's dynamic resource allocation 
mechanism is cooperative.
--- End diff --

I would possibly rephrase or leave this paragraph out, as there are 
situations where different dynamicAllocation.enabled settings for different 
applications are reasonable.  I.e. a cluster might have some production 
applications that need a static allocation to cache data and respond to queries 
as fast as possible, while others might be interactive and have highly varying 
resource use. YARN is meant to take care of the fairness aspect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation

2014-12-18 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/3731#discussion_r22044849
  
--- Diff: docs/job-scheduling.md ---
@@ -56,6 +56,112 @@ the same RDDs. For example, the 
[Shark](http://shark.cs.berkeley.edu) JDBC serve
 queries. In future releases, in-memory storage systems such as 
[Tachyon](http://tachyon-project.org) will
 provide another approach to share RDDs.
 
+## Dynamic Resource Allocation
+
+Spark 1.2 introduces the ability to dynamically scale the set of cluster 
resources allocated to
+your application up and down based on the workload. This means that your 
application may give
+resources back to the cluster if they are no longer used and request them 
again later when there
+is demand. This feature is particularly useful if multiple applications 
share resources in your
+Spark cluster. If a subset of the resources allocated to an application 
becomes idle, it can be
+returned to the cluster's pool of resources and acquired by other 
applications. In Spark, dynamic
+resource allocation is performed on the granularity of the executor and 
can be enabled through
+`spark.dynamicAllocation.enabled`.
+
+This feature is currently disabled by default and available only on 
[YARN](running-on-yarn.html).
+A future release will extend this to [standalone 
mode](spark-standalone.html) and
+[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note 
that although Spark on
+Mesos already has a similar notion of dynamic resource sharing in 
fine-grained mode, enabling
+dynamic allocation allows your Mesos application to take advantage of 
coarse-grained low-latency
+scheduling while sharing cluster resources efficiently.
+
+Lastly, it is worth noting that Spark's dynamic resource allocation 
mechanism is cooperative.
+This means if a Spark application enables this feature, other applications 
on the same cluster
+are also expected to do so. Otherwise, the cluster's resources will end up 
being unfairly
+distributed to the applications that do not voluntarily give up unused 
resources they have
+acquired.
+
+### Configuration and Setup
+
+All configurations used by this feature live under the 
`spark.dynamicAllocation.*` namespace.
+To enable this feature, your application must set 
`spark.dynamicAllocation.enabled` to `true` and
+provide lower and upper bounds for the number of executors through
+`spark.dynamicAllocation.minExecutors` and 
`spark.dynamicAllocation.maxExecutors`. Other relevant
+configurations are described on the [configurations 
page](configuration.html#dynamic-allocation)
+and in the subsequent sections in detail.
+
+Additionally, your application must use an external shuffle service 
(described below). To enable
+this, set `spark.shuffle.service.enabled` to `true`. In YARN, this 
external shuffle service is
+implemented in `org.apache.spark.yarn.network.YarnShuffleService` that 
runs in each `NodeManager`
--- End diff --

Should this be broken out into a separate section for users that don't care 
about dynamic allocation, but want to learn how to use the external shuffle 
service?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation

2014-12-18 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/3731#discussion_r22044851
  
--- Diff: docs/job-scheduling.md ---
@@ -56,6 +56,112 @@ the same RDDs. For example, the 
[Shark](http://shark.cs.berkeley.edu) JDBC serve
 queries. In future releases, in-memory storage systems such as 
[Tachyon](http://tachyon-project.org) will
 provide another approach to share RDDs.
 
+## Dynamic Resource Allocation
+
+Spark 1.2 introduces the ability to dynamically scale the set of cluster 
resources allocated to
+your application up and down based on the workload. This means that your 
application may give
+resources back to the cluster if they are no longer used and request them 
again later when there
+is demand. This feature is particularly useful if multiple applications 
share resources in your
+Spark cluster. If a subset of the resources allocated to an application 
becomes idle, it can be
+returned to the cluster's pool of resources and acquired by other 
applications. In Spark, dynamic
+resource allocation is performed on the granularity of the executor and 
can be enabled through
+`spark.dynamicAllocation.enabled`.
+
+This feature is currently disabled by default and available only on 
[YARN](running-on-yarn.html).
+A future release will extend this to [standalone 
mode](spark-standalone.html) and
+[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note 
that although Spark on
+Mesos already has a similar notion of dynamic resource sharing in 
fine-grained mode, enabling
+dynamic allocation allows your Mesos application to take advantage of 
coarse-grained low-latency
+scheduling while sharing cluster resources efficiently.
+
+Lastly, it is worth noting that Spark's dynamic resource allocation 
mechanism is cooperative.
+This means if a Spark application enables this feature, other applications 
on the same cluster
+are also expected to do so. Otherwise, the cluster's resources will end up 
being unfairly
+distributed to the applications that do not voluntarily give up unused 
resources they have
+acquired.
+
+### Configuration and Setup
+
+All configurations used by this feature live under the 
`spark.dynamicAllocation.*` namespace.
+To enable this feature, your application must set 
`spark.dynamicAllocation.enabled` to `true` and
+provide lower and upper bounds for the number of executors through
+`spark.dynamicAllocation.minExecutors` and 
`spark.dynamicAllocation.maxExecutors`. Other relevant
+configurations are described on the [configurations 
page](configuration.html#dynamic-allocation)
+and in the subsequent sections in detail.
+
+Additionally, your application must use an external shuffle service 
(described below). To enable
+this, set `spark.shuffle.service.enabled` to `true`. In YARN, this 
external shuffle service is
+implemented in `org.apache.spark.yarn.network.YarnShuffleService` that 
runs in each `NodeManager`
--- End diff --

Should this be broken out into a separate section for users that don't care 
about dynamic allocation, but want to learn how to use the external shuffle 
service?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-18 Thread akopich

Github user akopich commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-67493934
  
How do you compare accuracy? Perplexity means nothing but perplexity -- 
topic models may be reliably compared only via application task (e.g. 
classification, recommendation... ).

Should I add the dataset for perplexity sanity check to the repo? I am 
about to use 1000 arxiv papers. This dataset is about 20 MB (5.5 MB zipped).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation

2014-12-18 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/3731#discussion_r22044912
  
--- Diff: docs/job-scheduling.md ---
@@ -56,6 +56,112 @@ the same RDDs. For example, the 
[Shark](http://shark.cs.berkeley.edu) JDBC serve
 queries. In future releases, in-memory storage systems such as 
[Tachyon](http://tachyon-project.org) will
 provide another approach to share RDDs.
 
+## Dynamic Resource Allocation
+
+Spark 1.2 introduces the ability to dynamically scale the set of cluster 
resources allocated to
+your application up and down based on the workload. This means that your 
application may give
+resources back to the cluster if they are no longer used and request them 
again later when there
+is demand. This feature is particularly useful if multiple applications 
share resources in your
+Spark cluster. If a subset of the resources allocated to an application 
becomes idle, it can be
+returned to the cluster's pool of resources and acquired by other 
applications. In Spark, dynamic
+resource allocation is performed on the granularity of the executor and 
can be enabled through
+`spark.dynamicAllocation.enabled`.
+
+This feature is currently disabled by default and available only on 
[YARN](running-on-yarn.html).
+A future release will extend this to [standalone 
mode](spark-standalone.html) and
+[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note 
that although Spark on
+Mesos already has a similar notion of dynamic resource sharing in 
fine-grained mode, enabling
+dynamic allocation allows your Mesos application to take advantage of 
coarse-grained low-latency
+scheduling while sharing cluster resources efficiently.
+
+Lastly, it is worth noting that Spark's dynamic resource allocation 
mechanism is cooperative.
+This means if a Spark application enables this feature, other applications 
on the same cluster
+are also expected to do so. Otherwise, the cluster's resources will end up 
being unfairly
+distributed to the applications that do not voluntarily give up unused 
resources they have
+acquired.
+
+### Configuration and Setup
+
+All configurations used by this feature live under the 
`spark.dynamicAllocation.*` namespace.
+To enable this feature, your application must set 
`spark.dynamicAllocation.enabled` to `true` and
+provide lower and upper bounds for the number of executors through
+`spark.dynamicAllocation.minExecutors` and 
`spark.dynamicAllocation.maxExecutors`. Other relevant
+configurations are described on the [configurations 
page](configuration.html#dynamic-allocation)
+and in the subsequent sections in detail.
+
+Additionally, your application must use an external shuffle service 
(described below). To enable
+this, set `spark.shuffle.service.enabled` to `true`. In YARN, this 
external shuffle service is
+implemented in `org.apache.spark.yarn.network.YarnShuffleService` that 
runs in each `NodeManager`
+in your cluster. To start this service, follow these steps:
+
+1. Build Spark with the [YARN profile](building-spark.html). Skip this 
step if you are using a
+pre-packaged distribution.
+2. Locate the `spark-version-yarn-shuffle.jar`. This should be under
+`$SPARK_HOME/network/yarn/target/scala-version` if you are building 
Spark yourself, and under
+`lib` if you are using a distribution.
+2. Add this jar to the classpath of all `NodeManager`s in your cluster.
+3. In the `yarn-site.xml` on each node, add `spark_shuffle` to 
`yarn.nodemanager.aux-services`,
+then set `yarn.nodemanager.aux-services.spark_shuffle.class` to
+`org.apache.spark.yarn.network.YarnShuffleService`. Additionally, set all 
relevant
+`spark.shuffle.service.*` [configurations](configuration.html).
+4. Restart all `NodeManager`s in your cluster.
+
+### Resource Allocation Policy
+
+On a high level, Spark should relinquish executors when they are no longer 
used and acquire
--- End diff --

Nit: I think should be At a high level or From a high level


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation

2014-12-18 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/3731#discussion_r22045044
  
--- Diff: docs/job-scheduling.md ---
@@ -56,6 +56,112 @@ the same RDDs. For example, the 
[Shark](http://shark.cs.berkeley.edu) JDBC serve
 queries. In future releases, in-memory storage systems such as 
[Tachyon](http://tachyon-project.org) will
 provide another approach to share RDDs.
 
+## Dynamic Resource Allocation
+
+Spark 1.2 introduces the ability to dynamically scale the set of cluster 
resources allocated to
+your application up and down based on the workload. This means that your 
application may give
+resources back to the cluster if they are no longer used and request them 
again later when there
+is demand. This feature is particularly useful if multiple applications 
share resources in your
+Spark cluster. If a subset of the resources allocated to an application 
becomes idle, it can be
+returned to the cluster's pool of resources and acquired by other 
applications. In Spark, dynamic
+resource allocation is performed on the granularity of the executor and 
can be enabled through
+`spark.dynamicAllocation.enabled`.
+
+This feature is currently disabled by default and available only on 
[YARN](running-on-yarn.html).
+A future release will extend this to [standalone 
mode](spark-standalone.html) and
+[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note 
that although Spark on
+Mesos already has a similar notion of dynamic resource sharing in 
fine-grained mode, enabling
+dynamic allocation allows your Mesos application to take advantage of 
coarse-grained low-latency
+scheduling while sharing cluster resources efficiently.
+
+Lastly, it is worth noting that Spark's dynamic resource allocation 
mechanism is cooperative.
+This means if a Spark application enables this feature, other applications 
on the same cluster
+are also expected to do so. Otherwise, the cluster's resources will end up 
being unfairly
+distributed to the applications that do not voluntarily give up unused 
resources they have
+acquired.
+
+### Configuration and Setup
+
+All configurations used by this feature live under the 
`spark.dynamicAllocation.*` namespace.
+To enable this feature, your application must set 
`spark.dynamicAllocation.enabled` to `true` and
+provide lower and upper bounds for the number of executors through
+`spark.dynamicAllocation.minExecutors` and 
`spark.dynamicAllocation.maxExecutors`. Other relevant
+configurations are described on the [configurations 
page](configuration.html#dynamic-allocation)
+and in the subsequent sections in detail.
+
+Additionally, your application must use an external shuffle service 
(described below). To enable
+this, set `spark.shuffle.service.enabled` to `true`. In YARN, this 
external shuffle service is
+implemented in `org.apache.spark.yarn.network.YarnShuffleService` that 
runs in each `NodeManager`
+in your cluster. To start this service, follow these steps:
+
+1. Build Spark with the [YARN profile](building-spark.html). Skip this 
step if you are using a
+pre-packaged distribution.
+2. Locate the `spark-version-yarn-shuffle.jar`. This should be under
+`$SPARK_HOME/network/yarn/target/scala-version` if you are building 
Spark yourself, and under
+`lib` if you are using a distribution.
+2. Add this jar to the classpath of all `NodeManager`s in your cluster.
+3. In the `yarn-site.xml` on each node, add `spark_shuffle` to 
`yarn.nodemanager.aux-services`,
+then set `yarn.nodemanager.aux-services.spark_shuffle.class` to
+`org.apache.spark.yarn.network.YarnShuffleService`. Additionally, set all 
relevant
+`spark.shuffle.service.*` [configurations](configuration.html).
+4. Restart all `NodeManager`s in your cluster.
+
+### Resource Allocation Policy
+
+On a high level, Spark should relinquish executors when they are no longer 
used and acquire
+executors when they are needed. Since there is no definitive way to 
predict whether an executor
+that is about to be removed will run a task in the near future, or whether 
a new executor that is
+about to be added will actually be idle, we need a set of heuristics to 
determine when to remove
+and request executors.
+
+ Request Policy
+
+A Spark application with dynamic allocation enabled requests additional 
executors when it has
+pending tasks waiting to be scheduled. This condition necessarily implies 
that the existing set
+of executors is insufficient to simultaneously saturate all tasks that 
have been submitted but
+not yet finished.
+
+Spark requests executors in rounds. The actual request is triggered when 
there have been pending
+tasks for

[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation

2014-12-18 Thread sryza

Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/3731#issuecomment-67494432
  
Super nice to have documentation at this level of detail.  This mostly 
looks good, I left a few comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2956#issuecomment-67494552
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24589/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2956#issuecomment-67494546
  
  [Test build #24589 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24589/consoleFull)
 for   PR 2956 at commit 
[`a473241`](https://github.com/apache/spark/commit/a47324118358802fcc6821e77ead77fd37003904).
 * This patch **fails PySpark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4884]: Improve Partition docs

2014-12-18 Thread msiddalingaiah

Github user msiddalingaiah commented on the pull request:

https://github.com/apache/spark/pull/3722#issuecomment-67496889
  
@ash211 Not a problem, I created a JIRA ticket and updated the 
title/description. Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4409][MLlib] Additional Linear Algebra ...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3319#issuecomment-67499071
  
  [Test build #24591 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24591/consoleFull)
 for   PR 3319 at commit 
[`75239f8`](https://github.com/apache/spark/commit/75239f8e5b41a275a0f232108b26cb0e16935bbf).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3222#issuecomment-67501444
  
  [Test build #24592 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24592/consoleFull)
 for   PR 3222 at commit 
[`164d5b7`](https://github.com/apache/spark/commit/164d5b74aae31683e6b69d8b0e23f77b25e7d99f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3429#issuecomment-67501660
  
  [Test build #24590 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24590/consoleFull)
 for   PR 3429 at commit 
[`9f0aff3`](https://github.com/apache/spark/commit/9f0aff33e862746d3d295a9dbf2629665d80cc22).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4573] [SQL] Add SettableStructObjectIns...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3429#issuecomment-67501667
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24590/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Added setMinCount to Word2Vec.scala

2014-12-18 Thread ganonp

Github user ganonp commented on the pull request:

https://github.com/apache/spark/pull/3693#issuecomment-67502378
  
O wow, I just didn't see that the function and everything inside was lining 
up... Hurts to look at. 

Thanks for those links and your patience. Spark now makes up about 70% of 
my job, so I'll definitely be contributing more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...

2014-12-18 Thread tgravescs

Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/3409#issuecomment-67507324
  
Looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...

2014-12-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3409


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2014-12-18 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/3607#discussion_r22050878
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala ---
@@ -39,23 +39,37 @@ private[spark] class ClientArguments(args: 
Array[String], sparkConf: SparkConf)
   var appName: String = Spark
   var priority = 0
 
-  // Additional memory to allocate to containers
-  // For now, use driver's memory overhead as our AM container's memory 
overhead
-  val amMemoryOverhead = 
sparkConf.getInt(spark.yarn.driver.memoryOverhead,
-math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, 
MEMORY_OVERHEAD_MIN))
-
-  val executorMemoryOverhead = 
sparkConf.getInt(spark.yarn.executor.memoryOverhead,
-math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, 
MEMORY_OVERHEAD_MIN))
-
   private val isDynamicAllocationEnabled =
 sparkConf.getBoolean(spark.dynamicAllocation.enabled, false)
 
   parseArgs(args.toList)
+
+  val isClusterMode = userClass != null
--- End diff --

ClientBase has this same check. Perhaps we should just make this accessible 
in the ClientBaseArguments so that ClientBase can just read it from here.

  private val isLaunchingDriver = args.userClass != null



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2014-12-18 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/3607#discussion_r22051045
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala ---
@@ -39,23 +39,37 @@ private[spark] class ClientArguments(args: 
Array[String], sparkConf: SparkConf)
   var appName: String = Spark
   var priority = 0
 
-  // Additional memory to allocate to containers
-  // For now, use driver's memory overhead as our AM container's memory 
overhead
-  val amMemoryOverhead = 
sparkConf.getInt(spark.yarn.driver.memoryOverhead,
-math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, 
MEMORY_OVERHEAD_MIN))
-
-  val executorMemoryOverhead = 
sparkConf.getInt(spark.yarn.executor.memoryOverhead,
-math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, 
MEMORY_OVERHEAD_MIN))
-
   private val isDynamicAllocationEnabled =
 sparkConf.getBoolean(spark.dynamicAllocation.enabled, false)
 
   parseArgs(args.toList)
+
+  val isClusterMode = userClass != null
+
   loadEnvironmentArgs()
   validateArgs()
 
+  // Additional memory to allocate to containers. In different modes, we 
use different configs.
+  val amMemoryOverheadConf = if (isClusterMode) {
+spark.yarn.driver.memoryOverhead
+  } else {
+spark.yarn.am.memoryOverhead
+  }
+  val amMemoryOverhead = sparkConf.getInt(amMemoryOverheadConf,
+math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, 
MEMORY_OVERHEAD_MIN))
+
+  val executorMemoryOverhead = 
sparkConf.getInt(spark.yarn.executor.memoryOverhead,
+math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, 
MEMORY_OVERHEAD_MIN))
+
   /** Load any default arguments provided through environment variables 
and Spark properties. */
   private def loadEnvironmentArgs(): Unit = {
+// In cluster mode, the driver and the AM live in the same JVM, so 
this does not apply
+if (!isClusterMode) {
+  amMemory = 
Utils.memoryStringToMb(sparkConf.get(spark.yarn.am.memory, 512m))
+} else {
+  println(spark.yarn.am.memory is set but does not apply in cluster 
mode,  +
--- End diff --

we might as well make it consistent and add warning about 
spark.yarn.am.memoryOverhead being set in cluster mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2014-12-18 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/3607#discussion_r22051552
  
--- Diff: docs/running-on-yarn.md ---
@@ -22,6 +22,14 @@ Most of the configs are the same for Spark on YARN as 
for other deployment modes
 table class=table
 trthProperty Name/ththDefault/ththMeaning/th/tr
 tr
+  tdcodespark.yarn.am.memory/code/td
+  td512m/td
+  td
+Amount of memory to use for the Yarn ApplicationMaster in client mode. 
In cluster mode, use `spark.driver.memory` instead.
--- End diff --

would be nice to specify format like the spark.executor.memory docs: 

in the same format as JVM memory strings (e.g. 512m, 2g)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...

2014-12-18 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/3607#discussion_r22051716
  
--- Diff: docs/running-on-yarn.md ---
@@ -92,6 +100,13 @@ Most of the configs are the same for Spark on YARN as 
for other deployment modes
   /td
 /tr
 tr
+  tdcodespark.yarn.am.memoryOverhead/code/td
+  tdAM memory * 0.07, with minimum of 384 /td
--- End diff --

would be nice to add comment to spark.yarn.driver.memoryOverhead saying it 
applies in cluster mode.

This config is a bit different from the others as the memory overhead is 
purely a yarn thing and doesn't apply in other modes. ie There is no existing 
spark.driver.memoryOverhead.  We could potentially just use one config for 
this. I'm not sure if that will be more confusing or not though... @sryza 
@vanzin @andrewor14  thoughts on that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3779. yarn spark.yarn.applicationMaster....

2014-12-18 Thread tgravescs

Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/3471#issuecomment-67512122
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3779. yarn spark.yarn.applicationMaster....

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3471#issuecomment-67512322
  
  [Test build #24593 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24593/consoleFull)
 for   PR 3471 at commit 
[`20b9887`](https://github.com/apache/spark/commit/20b9887bb9529f2792123778e6eeca6ba0e51c37).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3779. yarn spark.yarn.applicationMaster....

2014-12-18 Thread tgravescs

Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/3471#issuecomment-67512174
  
this looks good. kicked jenkins to run again since last run was while ago.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4409][MLlib] Additional Linear Algebra ...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3319#issuecomment-67513626
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24591/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4409][MLlib] Additional Linear Algebra ...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3319#issuecomment-67513619
  
  [Test build #24591 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24591/consoleFull)
 for   PR 3319 at commit 
[`75239f8`](https://github.com/apache/spark/commit/75239f8e5b41a275a0f232108b26cb0e16935bbf).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3222#issuecomment-67516313
  
  [Test build #24592 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24592/consoleFull)
 for   PR 3222 at commit 
[`164d5b7`](https://github.com/apache/spark/commit/164d5b74aae31683e6b69d8b0e23f77b25e7d99f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AdaGradUpdater(`
  * `class DBN(val stackedRBM: StackedRBM, val nn: MLP)`
  * `class MLP(`
  * `class MomentumUpdater(val momentum: Double) extends Updater `
  * `class RBM(`
  * `class StackedRBM(val innerRBMs: Array[RBM])`
  * `case class MinstItem(label: Int, data: Array[Int]) `
  * `class MinstDatasetReader(labelsFile: String, imagesFile: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3222#issuecomment-67516326
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24592/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Spark 3883: SSL support for HttpServer and Akk...

2014-12-18 Thread jacek-lewandowski

Github user jacek-lewandowski commented on the pull request:

https://github.com/apache/spark/pull/3571#issuecomment-67518861
  
@vanzin I did some changes but I'm not sure about using Spark configuration 
in this case. At least it can be not so clear. I mean such cases as running 
executors. `CoarseGrainedExecutorBackend` needs SSL configuration to connect to 
the driver and fetch the real application configuration. In other words, it 
doesn't have any information about the configuration and it doesn't load the 
property file. I suppose the same problem would be with `DriverWrapper` which 
is used when the driver is run by the worker.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Spark 3883: SSL support for HttpServer and Akk...

2014-12-18 Thread jacek-lewandowski

Github user jacek-lewandowski commented on the pull request:

https://github.com/apache/spark/pull/3571#issuecomment-67519339
  
It could work if `DriverWrapper` and `CoarseGrainedExecutorBackend` would 
load the daemon's configuration file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4884]: Improve Partition docs

2014-12-18 Thread ash211

Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/3722#issuecomment-67519651
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4140] Document dynamic allocation

2014-12-18 Thread oza

Github user oza commented on a diff in the pull request:

https://github.com/apache/spark/pull/3731#discussion_r22055469
  
--- Diff: docs/job-scheduling.md ---
@@ -56,6 +56,112 @@ the same RDDs. For example, the 
[Shark](http://shark.cs.berkeley.edu) JDBC serve
 queries. In future releases, in-memory storage systems such as 
[Tachyon](http://tachyon-project.org) will
 provide another approach to share RDDs.
 
+## Dynamic Resource Allocation
+
+Spark 1.2 introduces the ability to dynamically scale the set of cluster 
resources allocated to
+your application up and down based on the workload. This means that your 
application may give
+resources back to the cluster if they are no longer used and request them 
again later when there
+is demand. This feature is particularly useful if multiple applications 
share resources in your
+Spark cluster. If a subset of the resources allocated to an application 
becomes idle, it can be
+returned to the cluster's pool of resources and acquired by other 
applications. In Spark, dynamic
+resource allocation is performed on the granularity of the executor and 
can be enabled through
+`spark.dynamicAllocation.enabled`.
+
+This feature is currently disabled by default and available only on 
[YARN](running-on-yarn.html).
+A future release will extend this to [standalone 
mode](spark-standalone.html) and
+[Mesos coarse-grained mode](running-on-mesos.html#mesos-run-modes). Note 
that although Spark on
+Mesos already has a similar notion of dynamic resource sharing in 
fine-grained mode, enabling
+dynamic allocation allows your Mesos application to take advantage of 
coarse-grained low-latency
+scheduling while sharing cluster resources efficiently.
+
+Lastly, it is worth noting that Spark's dynamic resource allocation 
mechanism is cooperative.
+This means if a Spark application enables this feature, other applications 
on the same cluster
+are also expected to do so. Otherwise, the cluster's resources will end up 
being unfairly
+distributed to the applications that do not voluntarily give up unused 
resources they have
+acquired.
+
+### Configuration and Setup
+
+All configurations used by this feature live under the 
`spark.dynamicAllocation.*` namespace.
+To enable this feature, your application must set 
`spark.dynamicAllocation.enabled` to `true` and
+provide lower and upper bounds for the number of executors through
+`spark.dynamicAllocation.minExecutors` and 
`spark.dynamicAllocation.maxExecutors`. Other relevant
+configurations are described on the [configurations 
page](configuration.html#dynamic-allocation)
+and in the subsequent sections in detail.
+
+Additionally, your application must use an external shuffle service 
(described below). To enable
+this, set `spark.shuffle.service.enabled` to `true`. In YARN, this 
external shuffle service is
+implemented in `org.apache.spark.yarn.network.YarnShuffleService` that 
runs in each `NodeManager`
--- End diff --

+1 to add how to use external shuffle service since we need to enable 
external shuffle service to use dynamic allocation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

2014-12-18 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3518#issuecomment-67520645
  
This looks pretty neat; I'll try to review this soon (a little busy right 
now), but in the meantime you might be interested in #3638 which has some small 
overlap in the sense that both patches deal with handling of serialization 
errors; both patches address different issues, though.  I'm inclined to merge 
#3638 first, since it's a bug fix and this is a feature, so that's likely to 
create a bunch of merge conflicts here.  I'll let you know if I do that, and I 
might be able to help fix the conflicts myself by submitting a PR to your PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3694] RDD and Task serialization debugg...

2014-12-18 Thread ilganeli

Github user ilganeli commented on the pull request:

https://github.com/apache/spark/pull/3518#issuecomment-67523669
  
Great - thanks, Josh. I'm working on doing a bit more code cleanup in the 
mean-time to minimize touch points within the existing Spark classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3779. yarn spark.yarn.applicationMaster....

2014-12-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3471#issuecomment-67525753
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24593/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3779. yarn spark.yarn.applicationMaster....

2014-12-18 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3471#issuecomment-67525739
  
  [Test build #24593 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24593/consoleFull)
 for   PR 3471 at commit 
[`20b9887`](https://github.com/apache/spark/commit/20b9887bb9529f2792123778e6eeca6ba0e51c37).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-18 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3022#discussion_r22058276
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala
 ---
@@ -0,0 +1,284 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector = BreezeVector, DenseMatrix = 
BreezeMatrix}
+import breeze.linalg.Transpose
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
+import org.apache.spark.mllib.stat.impl.MultivariateGaussian
+import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
+import org.apache.spark.SparkContext.DoubleAccumulatorParam
+
+/**
+ * This class performs expectation maximization for multivariate Gaussian
+ * Mixture Models (GMMs).  A GMM represents a composite distribution of
+ * independent Gaussian distributions with associated mixing weights
+ * specifying each's contribution to the composite.
+ *
+ * Given a set of sample points, this class will maximize the 
log-likelihood 
+ * for a mixture of k Gaussians, iterating until the log-likelihood 
changes by 
+ * less than convergenceTol, or until it has reached the max number of 
iterations.
+ * While this process is generally guaranteed to converge, it is not 
guaranteed
+ * to find a global optimum.  
+ * 
+ * @param k The number of independent Gaussians in the mixture model
+ * @param convergenceTol The maximum change in log-likelihood at which 
convergence
+ * is considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform
+ */
+class GaussianMixtureModelEM private (
+private var k: Int, 
+private var convergenceTol: Double, 
+private var maxIterations: Int) extends Serializable {
+  
+  // Type aliases for convenience
+  private type DenseDoubleVector = BreezeVector[Double]
+  private type DenseDoubleMatrix = BreezeMatrix[Double]
+  
+  private type ExpectationSum = (
+Array[Double], // log-likelihood in index 0
+Array[Double], // array of weights
+Array[DenseDoubleVector], // array of means
+Array[DenseDoubleMatrix]) // array of cov matrices
+  
+  // create a zero'd ExpectationSum instance
+  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
+(Array(0.0), 
+  new Array[Double](k),
+  (0 until k).map(_ = BreezeVector.zeros[Double](d)).toArray,
+  (0 until k).map(_ = BreezeMatrix.zeros[Double](d,d)).toArray)
+  }
+  
+  // add two ExpectationSum objects (allowed to use modify m1)
+  // (U, U) = U for aggregation
+  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): 
ExpectationSum = {
+m1._1(0) += m2._1(0)
+for (i - 0 until m1._2.length) {
+  m1._2(i) += m2._2(i)
+  m1._3(i) += m2._3(i)
+  m1._4(i) += m2._4(i)
+}
+m1
+  }
+  
+  // compute cluster contributions for each input point
+  // (U, T) = U for aggregation
+  private def computeExpectation(weights: Array[Double], dists: 
Array[MultivariateGaussian])
+  (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
+val k = model._2.length
+val p = (0 until k).map(i = eps + weights(i) * 
dists(i).pdf(x)).toArray
+val pSum = p.sum
+model._1(0) += math.log(pSum)
+val xxt = x * new Transpose(x)
+for (i - 0 until k) {
+  p(i) /= pSum
+  model._2(i) += p(i)
+  model._3(i) += x * p(i)
+  model._4(i) += xxt * p(i)
+}
+model
+  }
+  
+  // number of samples per cluster to use when initializing Gaussians
+  private val nSamples = 5
+  
+  // an initializing GMM can be provided rather than using the 
+  // default random starting point
+  private var initialGmm: Option[GaussianMixtureModel] = None
+

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

2014-12-18 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3022#discussion_r22058461
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala
 ---
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.linalg.Matrix
+import org.apache.spark.mllib.linalg.Vector
+
+/**
+ * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, 
where points 
+ * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and 
sigma(i) are 
+ * the respective mean and covariance for each Gaussian distribution 
i=1..k. 
+ * 
+ * @param weight Weights for each Gaussian distribution in the mixture, 
where mu(i) is
+ *   the weight for Gaussian i, and weight.sum == 1
+ * @param mu Means for each Gaussian in the mixture, where mu(i) is the 
mean for Gaussian i
+ * @param sigma Covariance maxtrix for each Gaussian in the mixture, where 
sigma(i) is the
+ *  covariance matrix for Gaussian i
+ */
+class GaussianMixtureModel(
+  val weight: Array[Double], 
--- End diff --

We only use Breeze internally right now; we don't want to expose it as a 
public API.  I really meant using the MultivariateGaussian class which you 
defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4877] Allow user first classes to exten...

2014-12-18 Thread vanzin

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/3725#issuecomment-67526925
  
BTW I'm pretty sure I addressed this as part of #3233, although in a 
different way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 487 matches

Mail list logo