date:20150423

[GitHub] spark pull request: [SQL] SPARK-6548: Adding stddev to DataFrame f...

2015-04-23 Thread dreamquster

Github user dreamquster commented on the pull request:

https://github.com/apache/spark/pull/5357#issuecomment-95611837
  
@yhuai . Is this comment OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...

2015-04-23 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/5660#discussion_r28974357
  
--- Diff: 
sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala ---
@@ -61,39 +62,40 @@ private[hive] case class HiveFunctionWrapper(var 
functionClassName: String)
   // for Serialization
   def this() = this(null)
 
+
+  import java.io.{OutputStream, InputStream}
+
+  import com.esotericsoftware.kryo.io.Input
+  import com.esotericsoftware.kryo.io.Output
+
   import org.apache.spark.util.Utils._
 
   @transient
-  private val methodDeSerialize = {
-val method = classOf[Utilities].getDeclaredMethod(
-  deserializeObjectByKryo,
-  classOf[Kryo],
-  classOf[java.io.InputStream],
-  classOf[Class[_]])
-method.setAccessible(true)
-
-method
+  private def deserializeObjectByKryo[T: ClassTag](kryo: Kryo,
+   in: InputStream,
+   clazz: Class[_]): T = {
--- End diff --

Don't indent method arguments in this style. Spark uses the following style:

```
def methodName(
arg1: Type1,
arg2: Type2): ReturnType = {
  // ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...

2015-04-23 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/5660#discussion_r28974373
  
--- Diff: 
sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala ---
@@ -61,39 +62,40 @@ private[hive] case class HiveFunctionWrapper(var 
functionClassName: String)
   // for Serialization
   def this() = this(null)
 
+
+  import java.io.{OutputStream, InputStream}
+
+  import com.esotericsoftware.kryo.io.Input
+  import com.esotericsoftware.kryo.io.Output
+
   import org.apache.spark.util.Utils._
 
   @transient
-  private val methodDeSerialize = {
-val method = classOf[Utilities].getDeclaredMethod(
-  deserializeObjectByKryo,
-  classOf[Kryo],
-  classOf[java.io.InputStream],
-  classOf[Class[_]])
-method.setAccessible(true)
-
-method
+  private def deserializeObjectByKryo[T: ClassTag](kryo: Kryo,
+   in: InputStream,
+   clazz: Class[_]): T = {
+val inp = new Input(in)
+val t: T = kryo.readObject(inp,clazz).asInstanceOf[T]
+inp.close()
+t
   }
 
   @transient
-  private val methodSerialize = {
-val method = classOf[Utilities].getDeclaredMethod(
-  serializeObjectByKryo,
-  classOf[Kryo],
-  classOf[Object],
-  classOf[java.io.OutputStream])
-method.setAccessible(true)
-
-method
+  private def serializeObjectByKryo(kryo: Kryo,
+plan: Object,
+out: OutputStream ) {
--- End diff --

Same as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7093][SQL] Using newPredicate in Nested...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5665#issuecomment-95633990
  
  [Test build #30844 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30844/consoleFull)
 for   PR 5665 at commit 
[`a887c02`](https://github.com/apache/spark/commit/a887c02bc16c3d8527e3108090706209717cbe62).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7093][SQL] Using newPredicate in Nested...

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5665#issuecomment-95634009
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30844/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...

2015-04-23 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/5526#discussion_r28977394
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -197,3 +233,69 @@ trait InsertableRelation {
 trait CatalystScan {
   def buildScan(requiredColumns: Seq[Attribute], filters: 
Seq[Expression]): RDD[Row]
 }
+
+/**
+ * ::Experimental::
+ * [[OutputWriter]] is used together with [[FSBasedRelation]] for 
persisting rows to the
+ * underlying file system.  An [[OutputWriter]] instance is created when a 
new output file is
+ * opened.  This instance is used to persist rows to this single output 
file.
+ */
+@Experimental
+trait OutputWriter {
+  /**
+   * Persists a single row.  Invoked on the executor side.
+   */
+  def write(row: Row): Unit
+
+  /**
+   * Closes the [[OutputWriter]]. Invoked on the executor side after all 
rows are persisted, before
+   * the task output is committed.
+   */
+  def close(): Unit
+}
+
+/**
+ * ::Experimental::
+ * A [[BaseRelation]] that abstracts file system based data sources.
+ *
+ * For the read path, similar to [[PrunedFilteredScan]], it can eliminate 
unneeded columns and
+ * filter using selected predicates before producing an RDD containing all 
matching tuples as
+ * [[Row]] objects.
+ *
+ * In addition, when reading from Hive style partitioned tables stored in 
file systems, it's able to
+ * discover partitioning information from the paths of input directories, 
and perform partition
+ * pruning before start reading the data.
+ *
+ * For the write path, it provides the ability to write to both 
non-partitioned and partitioned
+ * tables.  Directory layout of the partitioned tables is compatible with 
Hive.
+ */
+@Experimental
+trait FSBasedRelation extends BaseRelation {
+  /**
+   * Builds an `RDD[Row]` containing all rows within this relation.
+   *
+   * @param requiredColumns Required columns.
+   * @param filters Candidate filters to be pushed down. The actual filter 
should be the conjunction
+   *of all `filters`.  The pushed down filters are currently 
purely an optimization as they
+   *will all be evaluated again. This means it is safe to use them 
with methods that produce
+   *false positives such as filtering partitions based on a bloom 
filter.
+   * @param inputPaths Data files to be read. If the underlying relation 
is partitioned, only data
+   *files within required partition directories are included.
+   */
+  def buildScan(
--- End diff --

This issue has been fixed. Three variants are provided. Developers can 
override any one of them to implement the read path.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7092] Update spark scala version to 2.1...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5662#issuecomment-95613732
  
  [Test build #30838 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30838/consoleFull)
 for   PR 5662 at commit 
[`58cf4f9`](https://github.com/apache/spark/commit/58cf4f96f62d290cc915c47bef08bd8666f2e73c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5526#issuecomment-95614464
  
  [Test build #30846 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30846/consoleFull)
 for   PR 5526 at commit 
[`4e93e9b`](https://github.com/apache/spark/commit/4e93e9b04790b24d0df57d8c62c9447abea0f74f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5526#issuecomment-95614524
  
  [Test build #30846 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30846/consoleFull)
 for   PR 5526 at commit 
[`4e93e9b`](https://github.com/apache/spark/commit/4e93e9b04790b24d0df57d8c62c9447abea0f74f).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait FSBasedRelationProvider `
  * `abstract class OutputWriter `
  * `abstract class FSBasedRelation extends BaseRelation `

 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5660#issuecomment-95629047
  
  [Test build #30849 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30849/consoleFull)
 for   PR 5660 at commit 
[`0b522a7`](https://github.com/apache/spark/commit/0b522a77ffe83889e7c6afaa963dd49e5675fe36).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7092] Update spark scala version to 2.1...

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5662#issuecomment-95613792
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30838/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7090][MLlib] Introduce LDAOptimizer to ...

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5661#issuecomment-95622405
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30843/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...

2015-04-23 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/5660#issuecomment-95625891
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5660#issuecomment-95626620
  
  [Test build #30848 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30848/consoleFull)
 for   PR 5660 at commit 
[`0b522a7`](https://github.com/apache/spark/commit/0b522a77ffe83889e7c6afaa963dd49e5675fe36).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7093][SQL] Using newPredicate in Nested...

2015-04-23 Thread scwf

GitHub user scwf opened a pull request:

https://github.com/apache/spark/pull/5665

[SPARK-7093][SQL] Using newPredicate in NestedLoopJoin to enable code 
generation

Using newPredicate in NestedLoopJoin instead of InterpretedPredicate to 
enable code generation

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/scwf/spark NLP

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5665.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5665


commit a887c02bc16c3d8527e3108090706209717cbe62
Author: scwf wangf...@huawei.com
Date:   2015-04-23T13:55:26Z

improve for NLP boundCondition




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7090][MLlib] Introduce LDAOptimizer to ...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5661#issuecomment-95622391
  
  [Test build #30843 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30843/consoleFull)
 for   PR 5661 at commit 
[`e756ce4`](https://github.com/apache/spark/commit/e756ce4c351a67e92afc0faef42b314c8ab8a31d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait LDAOptimizer`
  * `class EMLDAOptimizer extends LDAOptimizer`

 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...

2015-04-23 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/5660#issuecomment-95627460
  
Would like to add some background about this change:

Spark SQL repackaged Hive jars to remove unnecessary shading and 
dependencies. One of them is Kryo. Hive shades Kryo to package 
`org.apache.hive.com.esotericsoftware`. Normally this should be fine. However, 
MapR replaces Spark SQL's repackaged Hive dependencies with genuine Hive 
packages, thus the reflection call mentioned in this PR couldn't find Kryo 
because it's moved into another package. On the other hand, the two reflected 
methods are actually quite simple. That's why we chose to reimplement them in 
Spark SQL instead of using them via reflection.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...

2015-04-23 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/5660#discussion_r28974257
  
--- Diff: 
sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala ---
@@ -61,39 +62,40 @@ private[hive] case class HiveFunctionWrapper(var 
functionClassName: String)
   // for Serialization
   def this() = this(null)
 
+
+  import java.io.{OutputStream, InputStream}
+
+  import com.esotericsoftware.kryo.io.Input
+  import com.esotericsoftware.kryo.io.Output
--- End diff --

Move imports to the header of the file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...

2015-04-23 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/5526#issuecomment-95636852
  
@marmbrus @yhuai @rxin Previous comments are addressed, Also added tests 
(`ignore`d for now) in `FSBasedRelationSuite`. Going to implement all the 
interface.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7093][SQL] Using newPredicate in Nested...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5665#issuecomment-95603340
  
  [Test build #30844 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30844/consoleFull)
 for   PR 5665 at commit 
[`a887c02`](https://github.com/apache/spark/commit/a887c02bc16c3d8527e3108090706209717cbe62).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6924][YARN] Fix driver hangs in yarn-cl...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5663#issuecomment-95603329
  
  [Test build #30840 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30840/consoleFull)
 for   PR 5663 at commit 
[`5a28319`](https://github.com/apache/spark/commit/5a283199952b70c4d007e9da60d80fd96fb9c2a6).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6924][YARN] Fix driver hangs in yarn-cl...

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5663#issuecomment-95603345
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30840/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7044] [SQL] Fix the deadlock in script ...

2015-04-23 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/5625#discussion_r28970625
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala
 ---
@@ -145,20 +147,27 @@ case class ScriptTransformation(
   val dataOutputStream = new DataOutputStream(outputStream)
   val outputProjection = new InterpretedProjection(input, child.output)
 
-  iter
-.map(outputProjection)
-.foreach { row =
+  // Put the write(output to the pipeline) into a single thread
+  // and keep the collector as remain in the main thread.
+  // otherwise it will causes deadlock if the data size greater than
+  // the pipeline / buffer capacity.
+  future {
--- End diff --

Thank you @rxin, I think you are right, updated!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...

2015-04-23 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/5660#issuecomment-95628037
  
add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5526#issuecomment-95646404
  
  [Test build #30851 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30851/consoleFull)
 for   PR 5526 at commit 
[`3b22c32`](https://github.com/apache/spark/commit/3b22c3291cd920eb02a345dace2556dc23d57efb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6734] [SQL] Add UDTF.close support in G...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5383#issuecomment-95610869
  
  [Test build #30845 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30845/consoleFull)
 for   PR 5383 at commit 
[`8953be3`](https://github.com/apache/spark/commit/8953be3eda4b120966f6cc7fb1e02f48c632a90f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...

2015-04-23 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/5526#discussion_r28975672
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -197,3 +233,69 @@ trait InsertableRelation {
 trait CatalystScan {
   def buildScan(requiredColumns: Seq[Attribute], filters: 
Seq[Expression]): RDD[Row]
 }
+
+/**
+ * ::Experimental::
+ * [[OutputWriter]] is used together with [[FSBasedRelation]] for 
persisting rows to the
+ * underlying file system.  An [[OutputWriter]] instance is created when a 
new output file is
+ * opened.  This instance is used to persist rows to this single output 
file.
+ */
+@Experimental
+trait OutputWriter {
+  /**
+   * Persists a single row.  Invoked on the executor side.
+   */
+  def write(row: Row): Unit
+
+  /**
+   * Closes the [[OutputWriter]]. Invoked on the executor side after all 
rows are persisted, before
+   * the task output is committed.
+   */
+  def close(): Unit
+}
+
+/**
+ * ::Experimental::
+ * A [[BaseRelation]] that abstracts file system based data sources.
+ *
+ * For the read path, similar to [[PrunedFilteredScan]], it can eliminate 
unneeded columns and
+ * filter using selected predicates before producing an RDD containing all 
matching tuples as
+ * [[Row]] objects.
+ *
+ * In addition, when reading from Hive style partitioned tables stored in 
file systems, it's able to
+ * discover partitioning information from the paths of input directories, 
and perform partition
+ * pruning before start reading the data.
+ *
+ * For the write path, it provides the ability to write to both 
non-partitioned and partitioned
+ * tables.  Directory layout of the partitioned tables is compatible with 
Hive.
+ */
+@Experimental
+trait FSBasedRelation extends BaseRelation {
+  /**
+   * Builds an `RDD[Row]` containing all rows within this relation.
+   *
+   * @param requiredColumns Required columns.
+   * @param filters Candidate filters to be pushed down. The actual filter 
should be the conjunction
+   *of all `filters`.  The pushed down filters are currently 
purely an optimization as they
+   *will all be evaluated again. This means it is safe to use them 
with methods that produce
+   *false positives such as filtering partitions based on a bloom 
filter.
+   * @param inputPaths Data files to be read. If the underlying relation 
is partitioned, only data
+   *files within required partition directories are included.
+   */
+  def buildScan(
+  requiredColumns: Array[String],
+  filters: Array[Filter],
+  inputPaths: Array[String]): RDD[Row]
+
+  /**
+   * When writing rows to this relation, this method is invoked on the 
driver side before the actual
+   * write job is issued.  It provides an opportunity to configure the 
write job to be performed.
+   */
+  def prepareForWrite(conf: Configuration): Unit
+
+  /**
+   * This method is responsible for producing a new [[OutputWriter]] for 
each newly opened output
+   * file on the executor side.
+   */
+  def newOutputWriter(path: String): OutputWriter
--- End diff --

One issue here is about passing driver side Hadoop configuration to 
OutputWriters on executor side. Users may set properties to Hadoop 
configurations on driver side (e.g. 
`mapreduce.fileoutputcommitter.marksuccessfuljobs`), and we should inherit 
these settings on executor side when writing data. zero-arg constructor plus 
`init(...)` is a good way to avoid forcing `BaseRelation` to be serializable, 
but I guess we have to put `Configuration` as an argument of 
`OutputWriter.init(...)`. This makes the data sources API coupled with Hadoop 
API via `Configuration`, but I guess this should be more acceptable comparing 
to forcing `BaseRelation` subclasses to be serializable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5526#issuecomment-95640536
  
  [Test build #30850 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30850/consoleFull)
 for   PR 5526 at commit 
[`b63f813`](https://github.com/apache/spark/commit/b63f81375e3e3cdea884c2e7c3f1925294c61c21).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait FSBasedRelationProvider `
  * `abstract class OutputWriter `
  * `abstract class FSBasedRelation extends BaseRelation `

 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5526#issuecomment-95640578
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30850/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5660#issuecomment-95661012
  
  [Test build #30849 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30849/consoleFull)
 for   PR 5660 at commit 
[`0b522a7`](https://github.com/apache/spark/commit/0b522a77ffe83889e7c6afaa963dd49e5675fe36).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6856] [R] Make RDD information more use...

2015-04-23 Thread His-name-is-Joof

GitHub user His-name-is-Joof opened a pull request:

https://github.com/apache/spark/pull/5667

[SPARK-6856] [R] Make RDD information more useful in SparkR



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/His-name-is-Joof/spark joofspark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5667.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5667


commit 123be654a155140f7b8d78203daf546d68cef2e8
Author: Jeff Harrison jeffrharri...@gmail.com
Date:   2015-04-23T17:08:10Z

SPARK-6856




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7084] improve saveAsTable documentation

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5654#issuecomment-95666418
  
  [Test build #30856 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30856/consoleFull)
 for   PR 5654 at commit 
[`00bc819`](https://github.com/apache/spark/commit/00bc819ba948dd29d78251c7d59089ce1116bc2e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5668#issuecomment-95676314
  
  [Test build #30860 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30860/consoleFull)
 for   PR 5668 at commit 
[`b0beb34`](https://github.com/apache/spark/commit/b0beb34d6a77c738660cb161306c947411d70ab5).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5932][CORE] Use consistent naming for s...

2015-04-23 Thread ilganeli

Github user ilganeli commented on a diff in the pull request:

https://github.com/apache/spark/pull/5574#discussion_r28989856
  
--- Diff: 
network/common/src/main/java/org/apache/spark/network/util/ByteUnit.java ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.network.util;
+
+public enum ByteUnit {
+  BYTE (1),
+  KiB (1024L),
+  MiB ((long) Math.pow(1024L, 2L)),
+  GiB ((long) Math.pow(1024L, 3L)),
+  TiB ((long) Math.pow(1024L, 4L)),
+  PiB ((long) Math.pow(1024L, 5L));
+
+  private ByteUnit(long multiplier) {
+this.multiplier = multiplier;
+  }
+
+  // Interpret the provided number (d) with suffix (u) as this unit type.
+  // E.g. KiB.interpret(1, MiB) interprets 1MiB as its KiB representation 
= 1024k
+  public long interpret(long d, ByteUnit u) {
+return u.toBytes(d) / multiplier;  
+  }
+  
+  // Convert the provided number (d) interpreted as this unit type to unit 
type (u). 
+  public long convert(long d, ByteUnit u) {
+return toBytes(d) / u.multiplier;
--- End diff --

Marcelo - I think this could be readily solved if ```toBytes``` returns a 
double. Max for a double is 1.79769e+308 which is more than we conceivably ever 
need and would solve the overflow issue (we'd just need to check that the 
resulting number if less than Long.MAX_VALUE and throw an exception if it's 
not). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2342#issuecomment-95676957
  
  [Test build #30864 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30864/consoleFull)
 for   PR 2342 at commit 
[`974a64a`](https://github.com/apache/spark/commit/974a64a5670c8e8a8078b2e81c915e5808424e14).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5894][ML] Add polynomial mapper

2015-04-23 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/5245#issuecomment-95678246
  
@yinxusen I'm not sure whether it is faster or not. That's why I put the 
new approach side by side. Please help test the performance. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5668#issuecomment-95679048
  
  [Test build #30868 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30868/consoleFull)
 for   PR 5668 at commit 
[`d25bc2a`](https://github.com/apache/spark/commit/d25bc2ab87163cda40f75ae4d7579a848d42fe58).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2342#issuecomment-95679035
  
  [Test build #30867 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30867/consoleFull)
 for   PR 2342 at commit 
[`8110acf`](https://github.com/apache/spark/commit/8110acf82426a5da3e7925f9f27cb7042c817746).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6856] [R] Make RDD information more use...

2015-04-23 Thread His-name-is-Joof

Github user His-name-is-Joof commented on the pull request:

https://github.com/apache/spark/pull/5667#issuecomment-95679966
  
How's that? Very new to contributing to large projects in general, so 
criticism welcome. Excellent bugtracker and starter bugs!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7056] Make the Write Ahead Log pluggabl...

2015-04-23 Thread tdas

Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/5645#issuecomment-95678755
  
@jerryshao @hshreedharan Can you please take a look. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2342#issuecomment-95680982
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30867/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6818][SPARKR] Support column deletion i...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5655#issuecomment-95680969
  
  [Test build #30854 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30854/consoleFull)
 for   PR 5655 at commit 
[`7c66570`](https://github.com/apache/spark/commit/7c66570897da1bc048aa1a3abb95785e7216c302).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2342#issuecomment-95680971
  
  [Test build #30867 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30867/consoleFull)
 for   PR 2342 at commit 
[`8110acf`](https://github.com/apache/spark/commit/8110acf82426a5da3e7925f9f27cb7042c817746).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class ExecutorUIData(`

 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6818][SPARKR] Support column deletion i...

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5655#issuecomment-95680983
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30854/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7033][SPARKR] Clean usage of split. Use...

2015-04-23 Thread concretevitamin

Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/5628#issuecomment-95651218
  
Thanks, @sun-rui - doing a grep, could you also update test_rdd.R's line 
124?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5666#issuecomment-95653046
  
  [Test build #30852 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30852/consoleFull)
 for   PR 5666 at commit 
[`d77990f`](https://github.com/apache/spark/commit/d77990f0b1d51114e32956d8ce08adfa1f6eff30).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6818][SPARKR] Support column deletion i...

2015-04-23 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5655#issuecomment-95657442
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7044] [SQL] Fix the deadlock in script ...

2015-04-23 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/5625#issuecomment-95665707
  
Merging in master  branch-1.3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7044] [SQL] Fix the deadlock in script ...

2015-04-23 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/5625#issuecomment-95666195
  
Actually this doesn't merge cleanly into 1.3. Do you mind submitting a pull 
request for that branch? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6856] [R] Make RDD information more use...

2015-04-23 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/5667#issuecomment-95667726
  
Jenkins, ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7041] Avoid writing empty files in Exte...

2015-04-23 Thread mateiz

Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/5622#discussion_r28986044
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -736,11 +734,16 @@ private[spark] class ExternalSorter[K, V, C](
   val writeStartTime = System.nanoTime
   util.Utils.tryWithSafeFinally {
 for (i - 0 until numPartitions) {
-  val in = new 
FileInputStream(partitionWriters(i).fileSegment().file)
-  util.Utils.tryWithSafeFinally {
-lengths(i) = org.apache.spark.util.Utils.copyStream(in, out, 
false, transferToEnabled)
-  } {
-in.close()
+  val file = partitionWriters(i).fileSegment().file
+  if (!file.exists()) {
+lengths(i) = 0
+  } else {
+val in = new FileInputStream(file)
+util.Utils.tryWithSafeFinally {
--- End diff --

Shouldn't we avoid using partial package names like this? Just import 
spark.util.Utils.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7055][SQL]Use correct ClassLoader for J...

2015-04-23 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/5633#issuecomment-95671327
  
Unfortunately we can't run the docker tests on Jenkins and they cause 
issues with dependencies during the release so we temporarily removed them.  I 
can try running them manually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2750][WIP]Add Https support for Web UI

2015-04-23 Thread vanzin

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/5664#issuecomment-95671673
  
This needs to be integrated with the `SSLOptions` configuration added in 
cfea30037f (#3571), instead of creating its own way of configuring things.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...

2015-04-23 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/5666#issuecomment-95673967
  
Hey, we probably do want to do this at some point, but I'm not sure the 
answer is ByteBuffer.  Big changes like this should be proposed in JIRA and 
discussed before coding begins.  Since we are close to the merge deadline for 
Spark 1.4 it is unlikely this patch will be merged anytime soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...

2015-04-23 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/5666#issuecomment-95674982
  
Thanks for submitting this. It's our fault to have that JIRA ticket there 
and suggest nio.ByteBuffer.

ByteBuffer is not the right abstraction for this. It's API is pretty clunky 
(flip, etc), and cannot be serialized. We will do something as part of 
https://issues.apache.org/jira/browse/SPARK-7075




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-7063 when lz4 compression is used, it ca...

2015-04-23 Thread linlin200605

Github user linlin200605 commented on the pull request:

https://github.com/apache/spark/pull/5641#issuecomment-95674803
  
lz4-1.3.0 needs Java 7 for building the jar, usually not necessarily 
require JDK 7 at runtime if there is backward compatibility. I will follow up 
on that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7060][SQL] Add alias function to python...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5634#issuecomment-95678712
  
  [Test build #30855 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30855/consoleFull)
 for   PR 5634 at commit 
[`f157c30`](https://github.com/apache/spark/commit/f157c3096f16f16a57b03eca10bc69cdc90de533).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7060][SQL] Add alias function to python...

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5634#issuecomment-95678726
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30855/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature

2015-04-23 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2342#discussion_r28990365
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala ---
@@ -17,20 +17,167 @@
 
 package org.apache.spark.ui.jobs
 
-import scala.collection.mutable
-import scala.xml.{NodeSeq, Node}
+import java.util.Date
+
+import scala.collection.mutable.{Buffer, ListBuffer}
+import scala.xml.{NodeSeq, Node, Unparsed}
 
 import javax.servlet.http.HttpServletRequest
 
 import org.apache.spark.JobExecutionStatus
 import org.apache.spark.scheduler.StageInfo
 import org.apache.spark.ui.{UIUtils, WebUIPage}
+import org.apache.spark.ui.jobs.UIData.ExecutorUIData
 
 /** Page showing statistics and stage list for a given job */
 private[ui] class JobPage(parent: JobsTab) extends WebUIPage(job) {
-  private val listener = parent.listener
+  private val STAGES_LEGEND =
+div class=legend-areasvg width=200px height=85px
+  rect x=5px y=5px width=20px height=15px
+rx=2px ry=2px stroke=#97B0F8 fill=#D5DDF6/rect
+  text x=35px y=17pxCompleted Stage /text
+  rect x=5px y=35px width=20px height=15px
+rx=2px ry=2px stroke=#97B0F8 fill=#FF5475/rect
+  text x=35px y=47pxFailed Stage/text
+  rect x=5px y=65px width=20px height=15px
+rx=2px ry=2px stroke=#97B0F8 fill=#FDFFCA/rect
+  text x=35px y=77pxActive Stage/text
+/svg/div.toString.filter(_ != '\n')
+
+  private val EXECUTORS_LEGEND =
+div class=legend-areasvg width=200px height=55px
+  rect x=5px y=5px width=20px height=15px
+rx=2px ry=2px stroke=#97B0F8 fill=#D5DDF6/rect
+  text x=35px y=17pxExecutor Added/text
+  rect x=5px y=35px width=20px height=15px
+rx=2px ry=2px stroke=#97B0F8 fill=#EBCA59/rect
+  text x=35px y=47pxExecutor Removed/text
+/svg/div.toString.filter(_ != '\n')
+
+  private def makeStageEvent(stageInfos: Seq[StageInfo]): Seq[String] = {
+stageInfos.map { stage =
+  val stageId = stage.stageId
+  val attemptId = stage.attemptId
+  val name = stage.name
+  val status = {
--- End diff --

This is very minor, though


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature

2015-04-23 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2342#discussion_r28990248
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala ---
@@ -17,20 +17,167 @@
 
 package org.apache.spark.ui.jobs
 
-import scala.collection.mutable
-import scala.xml.{NodeSeq, Node}
+import java.util.Date
+
+import scala.collection.mutable.{Buffer, ListBuffer}
+import scala.xml.{NodeSeq, Node, Unparsed}
 
 import javax.servlet.http.HttpServletRequest
 
 import org.apache.spark.JobExecutionStatus
 import org.apache.spark.scheduler.StageInfo
 import org.apache.spark.ui.{UIUtils, WebUIPage}
+import org.apache.spark.ui.jobs.UIData.ExecutorUIData
 
 /** Page showing statistics and stage list for a given job */
 private[ui] class JobPage(parent: JobsTab) extends WebUIPage(job) {
-  private val listener = parent.listener
+  private val STAGES_LEGEND =
+div class=legend-areasvg width=200px height=85px
+  rect x=5px y=5px width=20px height=15px
+rx=2px ry=2px stroke=#97B0F8 fill=#D5DDF6/rect
+  text x=35px y=17pxCompleted Stage /text
+  rect x=5px y=35px width=20px height=15px
+rx=2px ry=2px stroke=#97B0F8 fill=#FF5475/rect
+  text x=35px y=47pxFailed Stage/text
+  rect x=5px y=65px width=20px height=15px
+rx=2px ry=2px stroke=#97B0F8 fill=#FDFFCA/rect
+  text x=35px y=77pxActive Stage/text
+/svg/div.toString.filter(_ != '\n')
+
+  private val EXECUTORS_LEGEND =
+div class=legend-areasvg width=200px height=55px
+  rect x=5px y=5px width=20px height=15px
+rx=2px ry=2px stroke=#97B0F8 fill=#D5DDF6/rect
+  text x=35px y=17pxExecutor Added/text
+  rect x=5px y=35px width=20px height=15px
+rx=2px ry=2px stroke=#97B0F8 fill=#EBCA59/rect
+  text x=35px y=47pxExecutor Removed/text
+/svg/div.toString.filter(_ != '\n')
+
+  private def makeStageEvent(stageInfos: Seq[StageInfo]): Seq[String] = {
+stageInfos.map { stage =
+  val stageId = stage.stageId
+  val attemptId = stage.attemptId
+  val name = stage.name
+  val status = {
--- End diff --

It might be better to add a private[spark] method called `getStatusString` 
to `StageInfo`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6122][Core] Upgrade tachyon-client vers...

2015-04-23 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/5354#issuecomment-95680725
  
It seems like the Jackson dep has to be excluded to get SBT + Hadoop 1.0.4 
to work. I think that has to stay then, yeah. I think the httpclient stuff can 
be cleaned up a small bit but that too is essential.

I'm getting worried at how much the divergence between SBT and Maven is 
causing us to hack the build, making it harder to get the build right for both. 
For example, these changes aren't necessary at all for Maven. It's exacerbated 
by trying to support Hadoop 1.x.

Still maybe we kick this can down the road a bit longer, to get in this 
change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2342#issuecomment-95671115
  
  [Test build #30859 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30859/consoleFull)
 for   PR 2342 at commit 
[`ee7a7f0`](https://github.com/apache/spark/commit/ee7a7f0c9a9618b05b67f50e5698898b343a059d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1537 [WiP] Application Timeline Server i...

2015-04-23 Thread steveloughran

Github user steveloughran commented on the pull request:

https://github.com/apache/spark/pull/5423#issuecomment-95675032
  
There's no obvious reason why the Jenkins build failed; the console says 
all the tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...

2015-04-23 Thread saucam

Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/5668#issuecomment-95678594
  
retest please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6856] [R] Make RDD information more use...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5667#issuecomment-95679774
  
  [Test build #30869 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30869/consoleFull)
 for   PR 5667 at commit 
[`c8c0b80`](https://github.com/apache/spark/commit/c8c0b8095088a8845adc7149a69cee051774c689).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5932][CORE] Use consistent naming for s...

2015-04-23 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5574#discussion_r28990766
  
--- Diff: 
network/common/src/main/java/org/apache/spark/network/util/ByteUnit.java ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.network.util;
+
+public enum ByteUnit {
+  BYTE (1),
+  KiB (1024L),
+  MiB ((long) Math.pow(1024L, 2L)),
+  GiB ((long) Math.pow(1024L, 3L)),
+  TiB ((long) Math.pow(1024L, 4L)),
+  PiB ((long) Math.pow(1024L, 5L));
+
+  private ByteUnit(long multiplier) {
+this.multiplier = multiplier;
+  }
+
+  // Interpret the provided number (d) with suffix (u) as this unit type.
+  // E.g. KiB.interpret(1, MiB) interprets 1MiB as its KiB representation 
= 1024k
+  public long interpret(long d, ByteUnit u) {
+return u.toBytes(d) / multiplier;  
+  }
+  
+  // Convert the provided number (d) interpreted as this unit type to unit 
type (u). 
+  public long convert(long d, ByteUnit u) {
+return toBytes(d) / u.multiplier;
--- End diff --

I saw your comment about using double - I don't think that's a great idea 
because doubles lose precision as you try to work with values at different 
orders of magniture.

Regarding the last paragraph of my comment above, I don't think it's going 
to be an issue in practice; but the code here can be changed to at least avoid 
overflows where possible. I checked `j.u.c.TimeUnit`, used in the time 
functions in this class, and it seems to follow the approach you took, than 
when an overflow is inevitable it caps the value at `Long.MAX_VALUE`. So that 
part is fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7070][MLLIB] LDA.setBeta should call se...

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5649#issuecomment-95680421
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30853/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...

2015-04-23 Thread viirya

GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/5666

[SPARK-5553][SQL] Replace Array[Byte] with Java NIO ByteBuffer as binary 
type representation

JIRA: https://issues.apache.org/jira/browse/SPARK-5553

This pr attempts to replace Array[Byte] with Java NIO ByteBuffer as SQL 
binary type representation.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 bytebuffer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5666.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5666


commit d77990f0b1d51114e32956d8ce08adfa1f6eff30
Author: Liang-Chi Hsieh vii...@gmail.com
Date:   2015-04-23T16:47:43Z

Replace Array[Byte] with Java NIO ByteBuffer as binary type representation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6818][SPARKR] Support column deletion i...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5655#issuecomment-95658961
  
  [Test build #30854 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30854/consoleFull)
 for   PR 5655 at commit 
[`7c66570`](https://github.com/apache/spark/commit/7c66570897da1bc048aa1a3abb95785e7216c302).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7060][SQL] Add alias function to python...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5634#issuecomment-95661762
  
  [Test build #30855 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30855/consoleFull)
 for   PR 5634 at commit 
[`f157c30`](https://github.com/apache/spark/commit/f157c3096f16f16a57b03eca10bc69cdc90de533).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [minor][streaming]fixed scala string interpola...

2015-04-23 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/5653#issuecomment-95665334
  
I've merged this into master  branch-1.3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7070][MLLIB] LDA.setBeta should call se...

2015-04-23 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/5649#issuecomment-95667912
  
Thanks for catching that!  LGTM pending tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5666#issuecomment-95673533
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30858/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5526#issuecomment-95674382
  
  [Test build #30851 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30851/consoleFull)
 for   PR 5526 at commit 
[`3b22c32`](https://github.com/apache/spark/commit/3b22c3291cd920eb02a345dace2556dc23d57efb).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait FSBasedRelationProvider `
  * `abstract class OutputWriter `
  * `abstract class FSBasedRelation extends BaseRelation `

 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5526#issuecomment-95674413
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30851/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5668#issuecomment-95676058
  
  [Test build #30860 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30860/consoleFull)
 for   PR 5668 at commit 
[`b0beb34`](https://github.com/apache/spark/commit/b0beb34d6a77c738660cb161306c947411d70ab5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7067][SQL] fix bug when use complex nes...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5659#issuecomment-95676079
  
  [Test build #30861 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30861/consoleFull)
 for   PR 5659 at commit 
[`ef6039c`](https://github.com/apache/spark/commit/ef6039c85ccfd396aad46797940a5611bc3325b4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6752][Streaming] Allow StreamingContext...

2015-04-23 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/5428


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4705] Handle multiple app attempts even...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5432#issuecomment-95677173
  
  [Test build #30863 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30863/consoleFull)
 for   PR 5432 at commit 
[`d5a9c37`](https://github.com/apache/spark/commit/d5a9c37a00f3b0b5aa66c5f92c325fdf0ac05bf0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7058] Include RDD deserialization time ...

2015-04-23 Thread kayousterhout

Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/5635#issuecomment-95678486
  
LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7070][MLLIB] LDA.setBeta should call se...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5649#issuecomment-95680404
  
  [Test build #30853 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30853/consoleFull)
 for   PR 5649 at commit 
[`c66023c`](https://github.com/apache/spark/commit/c66023cdd1468aa66e7ffcc6b242ccc3ca80ea7c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5660#issuecomment-95658337
  
  [Test build #30848 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30848/consoleFull)
 for   PR 5660 at commit 
[`0b522a7`](https://github.com/apache/spark/commit/0b522a77ffe83889e7c6afaa963dd49e5675fe36).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5666#issuecomment-95658377
  
  [Test build #30852 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30852/consoleFull)
 for   PR 5666 at commit 
[`d77990f`](https://github.com/apache/spark/commit/d77990f0b1d51114e32956d8ce08adfa1f6eff30).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5666#issuecomment-95658408
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30852/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...

2015-04-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5660#issuecomment-95658372
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30848/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7044] [SQL] Fix the deadlock in script ...

2015-04-23 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/5625


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5666#issuecomment-95668372
  
  [Test build #30858 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30858/consoleFull)
 for   PR 5666 at commit 
[`68c9b00`](https://github.com/apache/spark/commit/68c9b006606e12d0b7e5a27e74e4469581ca45b8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...

2015-04-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5666#issuecomment-95673520
  
  [Test build #30858 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30858/consoleFull)
 for   PR 5666 at commit 
[`68c9b00`](https://github.com/apache/spark/commit/68c9b006606e12d0b7e5a27e74e4469581ca45b8).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...

2015-04-23 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5626#discussion_r28987596
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import com.github.fommil.netlib.BLAS.{getInstance = blas}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor}
+import org.apache.spark.ml.impl.tree._
+import org.apache.spark.ml.param.{Param, Params, ParamMap}
+import org.apache.spark.ml.regression.DecisionTreeRegressionModel
+import org.apache.spark.ml.tree.{DecisionTreeModel, TreeEnsembleModel}
+import org.apache.spark.ml.util.MetadataUtils
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.{GradientBoostedTrees = OldGBT}
+import org.apache.spark.mllib.tree.configuration.{Algo = OldAlgo}
+import org.apache.spark.mllib.tree.loss.{Loss = OldLoss, LogLoss = 
OldLogLoss}
+import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel = 
OldGBTModel}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.DataFrame
+
+
+/**
+ * :: AlphaComponent ::
+ *
+ * [[http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees 
(GBTs)]]
+ * learning algorithm for classification.
+ * It supports binary labels, as well as both continuous and categorical 
features.
+ * Note: Multiclass labels are not currently supported.
+ */
+@AlphaComponent
+final class GBTClassifier
+  extends Predictor[Vector, GBTClassifier, GBTClassificationModel]
+  with GBTParams with TreeClassifierParams with Logging {
+
+  // Override parameter setters from parent trait for Java API 
compatibility.
+
+  // Parameters from TreeClassifierParams:
+
+  override def setMaxDepth(value: Int): this.type = 
super.setMaxDepth(value)
+
+  override def setMaxBins(value: Int): this.type = super.setMaxBins(value)
+
+  override def setMinInstancesPerNode(value: Int): this.type =
+super.setMinInstancesPerNode(value)
+
+  override def setMinInfoGain(value: Double): this.type = 
super.setMinInfoGain(value)
+
+  override def setMaxMemoryInMB(value: Int): this.type = 
super.setMaxMemoryInMB(value)
+
+  override def setCacheNodeIds(value: Boolean): this.type = 
super.setCacheNodeIds(value)
+
+  override def setCheckpointInterval(value: Int): this.type = 
super.setCheckpointInterval(value)
+
+  /**
+   * The impurity setting is ignored for GBT models.
+   * Individual trees are built using impurity Variance.
+   */
+  override def setImpurity(value: String): this.type = {
+logWarning(GBTClassifier.setImpurity should NOT be used)
+this
+  }
+
+  // Parameters from TreeEnsembleParams:
+
+  override def setSubsamplingRate(value: Double): this.type = 
super.setSubsamplingRate(value)
+
+  override def setSeed(value: Long): this.type = {
+logWarning(The 'seed' parameter is currently ignored by Gradient 
Boosting.)
+super.setSeed(value)
+  }
+
+  // Parameters from GBTParams:
+
+  override def setMaxIter(value: Int): this.type = super.setMaxIter(value)
+
+  override def setLearningRate(value: Double): this.type = 
super.setLearningRate(value)
+
+  // Parameters for GBTClassifier:
+
+  /**
+   * Loss function which GBT tries to minimize. (case-insensitive)
+   * Supported: LogLoss
+   * (default = LogLoss)
+   * @group param
+   */
+  val loss: Param[String] = new Param[String](this, loss, Loss function 
which GBT tries to +
+ minimize (case-insensitive). Supported options: LogLoss)
+
+  setDefault(loss - logloss)
+
+  /** @group setParam */
+  def setLoss(value: String):

[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...

2015-04-23 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5626#discussion_r28987585
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import com.github.fommil.netlib.BLAS.{getInstance = blas}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor}
+import org.apache.spark.ml.impl.tree._
+import org.apache.spark.ml.param.{Param, Params, ParamMap}
+import org.apache.spark.ml.regression.DecisionTreeRegressionModel
+import org.apache.spark.ml.tree.{DecisionTreeModel, TreeEnsembleModel}
+import org.apache.spark.ml.util.MetadataUtils
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.{GradientBoostedTrees = OldGBT}
+import org.apache.spark.mllib.tree.configuration.{Algo = OldAlgo}
+import org.apache.spark.mllib.tree.loss.{Loss = OldLoss, LogLoss = 
OldLogLoss}
+import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel = 
OldGBTModel}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.DataFrame
+
+
+/**
+ * :: AlphaComponent ::
+ *
+ * [[http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees 
(GBTs)]]
+ * learning algorithm for classification.
+ * It supports binary labels, as well as both continuous and categorical 
features.
+ * Note: Multiclass labels are not currently supported.
+ */
+@AlphaComponent
+final class GBTClassifier
+  extends Predictor[Vector, GBTClassifier, GBTClassificationModel]
+  with GBTParams with TreeClassifierParams with Logging {
+
+  // Override parameter setters from parent trait for Java API 
compatibility.
+
+  // Parameters from TreeClassifierParams:
+
+  override def setMaxDepth(value: Int): this.type = 
super.setMaxDepth(value)
+
+  override def setMaxBins(value: Int): this.type = super.setMaxBins(value)
+
+  override def setMinInstancesPerNode(value: Int): this.type =
+super.setMinInstancesPerNode(value)
+
+  override def setMinInfoGain(value: Double): this.type = 
super.setMinInfoGain(value)
+
+  override def setMaxMemoryInMB(value: Int): this.type = 
super.setMaxMemoryInMB(value)
+
+  override def setCacheNodeIds(value: Boolean): this.type = 
super.setCacheNodeIds(value)
+
+  override def setCheckpointInterval(value: Int): this.type = 
super.setCheckpointInterval(value)
--- End diff --

Should it go to shared params? I see the problem with the doc. If we want 
to put something special, we can put it in the JavaDoc. No strong preference 
about this. But it makes me think that whether we should mark shared params 
final.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...

2015-04-23 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5626#discussion_r28987643
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala ---
@@ -58,3 +58,43 @@ trait DecisionTreeModel {
 header + rootNode.subtreeToString(2)
   }
 }
+
+/**
+ * :: AlphaComponent ::
+ *
+ * Abstraction for models which are ensembles of decision trees
+ *
+ * TODO: Add support for predicting probabilities and raw predictions
+ */
+@AlphaComponent
+trait TreeEnsembleModel {
--- End diff --

Should it be public? Note that adding method to an interface counts as a 
break change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...

2015-04-23 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5626#discussion_r28987593
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import com.github.fommil.netlib.BLAS.{getInstance = blas}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor}
+import org.apache.spark.ml.impl.tree._
+import org.apache.spark.ml.param.{Param, Params, ParamMap}
+import org.apache.spark.ml.regression.DecisionTreeRegressionModel
+import org.apache.spark.ml.tree.{DecisionTreeModel, TreeEnsembleModel}
+import org.apache.spark.ml.util.MetadataUtils
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.{GradientBoostedTrees = OldGBT}
+import org.apache.spark.mllib.tree.configuration.{Algo = OldAlgo}
+import org.apache.spark.mllib.tree.loss.{Loss = OldLoss, LogLoss = 
OldLogLoss}
+import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel = 
OldGBTModel}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.DataFrame
+
+
+/**
+ * :: AlphaComponent ::
+ *
+ * [[http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees 
(GBTs)]]
+ * learning algorithm for classification.
+ * It supports binary labels, as well as both continuous and categorical 
features.
+ * Note: Multiclass labels are not currently supported.
+ */
+@AlphaComponent
+final class GBTClassifier
+  extends Predictor[Vector, GBTClassifier, GBTClassificationModel]
+  with GBTParams with TreeClassifierParams with Logging {
+
+  // Override parameter setters from parent trait for Java API 
compatibility.
+
+  // Parameters from TreeClassifierParams:
+
+  override def setMaxDepth(value: Int): this.type = 
super.setMaxDepth(value)
+
+  override def setMaxBins(value: Int): this.type = super.setMaxBins(value)
+
+  override def setMinInstancesPerNode(value: Int): this.type =
+super.setMinInstancesPerNode(value)
+
+  override def setMinInfoGain(value: Double): this.type = 
super.setMinInfoGain(value)
+
+  override def setMaxMemoryInMB(value: Int): this.type = 
super.setMaxMemoryInMB(value)
+
+  override def setCacheNodeIds(value: Boolean): this.type = 
super.setCacheNodeIds(value)
+
+  override def setCheckpointInterval(value: Int): this.type = 
super.setCheckpointInterval(value)
+
+  /**
+   * The impurity setting is ignored for GBT models.
+   * Individual trees are built using impurity Variance.
+   */
+  override def setImpurity(value: String): this.type = {
+logWarning(GBTClassifier.setImpurity should NOT be used)
+this
+  }
+
+  // Parameters from TreeEnsembleParams:
+
+  override def setSubsamplingRate(value: Double): this.type = 
super.setSubsamplingRate(value)
+
+  override def setSeed(value: Long): this.type = {
+logWarning(The 'seed' parameter is currently ignored by Gradient 
Boosting.)
+super.setSeed(value)
+  }
+
+  // Parameters from GBTParams:
+
+  override def setMaxIter(value: Int): this.type = super.setMaxIter(value)
+
+  override def setLearningRate(value: Double): this.type = 
super.setLearningRate(value)
+
+  // Parameters for GBTClassifier:
+
+  /**
+   * Loss function which GBT tries to minimize. (case-insensitive)
+   * Supported: LogLoss
+   * (default = LogLoss)
+   * @group param
+   */
+  val loss: Param[String] = new Param[String](this, loss, Loss function 
which GBT tries to +
--- End diff --

`loss` - `lossType`? `loss` may be too general. If `loss` appears in 
another algorithm, they should have similar semantic. For example, if we put

[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...

2015-04-23 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5626#discussion_r28987606
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala
 ---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor}
+import org.apache.spark.ml.impl.tree._
+import org.apache.spark.ml.param.{Params, ParamMap}
+import org.apache.spark.ml.tree.{DecisionTreeModel, TreeEnsembleModel}
+import org.apache.spark.ml.util.MetadataUtils
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.{RandomForest = OldRandomForest}
+import org.apache.spark.mllib.tree.configuration.{Algo = OldAlgo, 
Strategy = OldStrategy}
+import org.apache.spark.mllib.tree.model.{RandomForestModel = 
OldRandomForestModel}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.DataFrame
+
+
+/**
+ * :: AlphaComponent ::
+ *
+ * [[http://en.wikipedia.org/wiki/Random_forest  Random Forest]] learning 
algorithm for
+ * classification.
+ * It supports both binary and multiclass labels, as well as both 
continuous and categorical
+ * features.
+ */
+@AlphaComponent
+final class RandomForestClassifier
+  extends Predictor[Vector, RandomForestClassifier, 
RandomForestClassificationModel]
+  with RandomForestParams with TreeClassifierParams {
+
+  // Override parameter setters from parent trait for Java API 
compatibility.
+
+  // Parameters from TreeClassifierParams:
+
+  override def setMaxDepth(value: Int): this.type = 
super.setMaxDepth(value)
+
+  override def setMaxBins(value: Int): this.type = super.setMaxBins(value)
+
+  override def setMinInstancesPerNode(value: Int): this.type =
+super.setMinInstancesPerNode(value)
+
+  override def setMinInfoGain(value: Double): this.type = 
super.setMinInfoGain(value)
+
+  override def setMaxMemoryInMB(value: Int): this.type = 
super.setMaxMemoryInMB(value)
+
+  override def setCacheNodeIds(value: Boolean): this.type = 
super.setCacheNodeIds(value)
+
+  override def setCheckpointInterval(value: Int): this.type = 
super.setCheckpointInterval(value)
+
+  override def setImpurity(value: String): this.type = 
super.setImpurity(value)
+
+  // Parameters from TreeEnsembleParams:
+
+  override def setSubsamplingRate(value: Double): this.type = 
super.setSubsamplingRate(value)
+
+  override def setSeed(value: Long): this.type = super.setSeed(value)
+
+  // Parameters from RandomForestParams:
+
+  override def setNumTrees(value: Int): this.type = 
super.setNumTrees(value)
+
+  override def setFeaturesPerNode(value: String): this.type = 
super.setFeaturesPerNode(value)
+
+  override protected def train(
+  dataset: DataFrame,
+  paramMap: ParamMap): RandomForestClassificationModel = {
+val categoricalFeatures: Map[Int, Int] =
+  
MetadataUtils.getCategoricalFeatures(dataset.schema(paramMap(featuresCol)))
+val numClasses: Int = 
MetadataUtils.getNumClasses(dataset.schema(paramMap(labelCol))) match {
+  case Some(n: Int) = n
+  case None = throw new 
IllegalArgumentException(RandomForestClassifier was given input +
+s with invalid label column, without the number of classes 
specified.)
+  // TODO: Automatically index labels.
+}
+val oldDataset: RDD[LabeledPoint] = extractLabeledPoints(dataset, 
paramMap)
+val strategy =
+  super.getOldStrategy(categoricalFeatures, numClasses, 
OldAlgo.Classification, getOldImpurity)
+val oldModel = OldRandomForest.trainClassifier(
+  oldDataset, strategy, getNumTrees, getFeaturesPerNodeStr, 
getSeed.toInt)

[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...

2015-04-23 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5626#discussion_r28987621
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/impl/tree/treeParams.scala ---
@@ -298,3 +302,200 @@ private[ml] object TreeRegressorParams {
   // These options should be lowercase.
   val supportedImpurities: Array[String] = 
Array(variance).map(_.toLowerCase)
 }
+
+/**
+ * :: DeveloperApi ::
+ * Parameters for Decision Tree-based ensemble algorithms.
+ *
+ * Note: Marked as private and DeveloperApi since this may be made public 
in the future.
+ */
+@DeveloperApi
+private[ml] trait TreeEnsembleParams extends DecisionTreeParams {
+
+  /**
+   * Fraction of the training data used for learning each decision tree.
+   * (default = 1.0)
+   * @group param
+   */
+  final val subsamplingRate: DoubleParam = new DoubleParam(this, 
subsamplingRate,
+Fraction of the training data used for learning each decision tree.)
+
+  /**
+   * Random seed for bootstrapping and choosing feature subsets.
+   * @group param
+   */
+  final val seed: LongParam = new LongParam(this, seed,
+Random seed for bootstrapping and choosing feature subsets.)
+
+  setDefault(subsamplingRate - 1.0, seed - Utils.random.nextLong())
+
+  /** @group setParam */
+  def setSubsamplingRate(value: Double): this.type = {
+require(value  0.0  value = 1.0,
+  sSubsampling rate must be in range (0,1]. Bad rate: $value)
+set(subsamplingRate, value)
+this
+  }
+
+  /** @group getParam */
+  def getSubsamplingRate: Double = getOrDefault(subsamplingRate)
--- End diff --

Most getters should be final.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...

2015-04-23 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5626#discussion_r28987591
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import com.github.fommil.netlib.BLAS.{getInstance = blas}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor}
+import org.apache.spark.ml.impl.tree._
+import org.apache.spark.ml.param.{Param, Params, ParamMap}
+import org.apache.spark.ml.regression.DecisionTreeRegressionModel
+import org.apache.spark.ml.tree.{DecisionTreeModel, TreeEnsembleModel}
+import org.apache.spark.ml.util.MetadataUtils
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.{GradientBoostedTrees = OldGBT}
+import org.apache.spark.mllib.tree.configuration.{Algo = OldAlgo}
+import org.apache.spark.mllib.tree.loss.{Loss = OldLoss, LogLoss = 
OldLogLoss}
+import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel = 
OldGBTModel}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.DataFrame
+
+
+/**
+ * :: AlphaComponent ::
+ *
+ * [[http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees 
(GBTs)]]
+ * learning algorithm for classification.
+ * It supports binary labels, as well as both continuous and categorical 
features.
+ * Note: Multiclass labels are not currently supported.
+ */
+@AlphaComponent
+final class GBTClassifier
+  extends Predictor[Vector, GBTClassifier, GBTClassificationModel]
+  with GBTParams with TreeClassifierParams with Logging {
+
+  // Override parameter setters from parent trait for Java API 
compatibility.
+
+  // Parameters from TreeClassifierParams:
+
+  override def setMaxDepth(value: Int): this.type = 
super.setMaxDepth(value)
+
+  override def setMaxBins(value: Int): this.type = super.setMaxBins(value)
+
+  override def setMinInstancesPerNode(value: Int): this.type =
+super.setMinInstancesPerNode(value)
+
+  override def setMinInfoGain(value: Double): this.type = 
super.setMinInfoGain(value)
+
+  override def setMaxMemoryInMB(value: Int): this.type = 
super.setMaxMemoryInMB(value)
+
+  override def setCacheNodeIds(value: Boolean): this.type = 
super.setCacheNodeIds(value)
+
+  override def setCheckpointInterval(value: Int): this.type = 
super.setCheckpointInterval(value)
+
+  /**
+   * The impurity setting is ignored for GBT models.
+   * Individual trees are built using impurity Variance.
+   */
+  override def setImpurity(value: String): this.type = {
+logWarning(GBTClassifier.setImpurity should NOT be used)
+this
+  }
+
+  // Parameters from TreeEnsembleParams:
+
+  override def setSubsamplingRate(value: Double): this.type = 
super.setSubsamplingRate(value)
+
+  override def setSeed(value: Long): this.type = {
+logWarning(The 'seed' parameter is currently ignored by Gradient 
Boosting.)
+super.setSeed(value)
+  }
+
+  // Parameters from GBTParams:
+
+  override def setMaxIter(value: Int): this.type = super.setMaxIter(value)
+
+  override def setLearningRate(value: Double): this.type = 
super.setLearningRate(value)
+
+  // Parameters for GBTClassifier:
+
+  /**
+   * Loss function which GBT tries to minimize. (case-insensitive)
+   * Supported: LogLoss
--- End diff --

Though the values are case insensitive, they should show up consistently in 
the doc. `logloss` or `logLoss`? The latter looks better to me. Btw, is `log` 
sufficient?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...

2015-04-23 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5626#discussion_r28987598
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import com.github.fommil.netlib.BLAS.{getInstance = blas}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.AlphaComponent
+import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor}
+import org.apache.spark.ml.impl.tree._
+import org.apache.spark.ml.param.{Param, Params, ParamMap}
+import org.apache.spark.ml.regression.DecisionTreeRegressionModel
+import org.apache.spark.ml.tree.{DecisionTreeModel, TreeEnsembleModel}
+import org.apache.spark.ml.util.MetadataUtils
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.{GradientBoostedTrees = OldGBT}
+import org.apache.spark.mllib.tree.configuration.{Algo = OldAlgo}
+import org.apache.spark.mllib.tree.loss.{Loss = OldLoss, LogLoss = 
OldLogLoss}
+import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel = 
OldGBTModel}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.DataFrame
+
+
+/**
+ * :: AlphaComponent ::
+ *
+ * [[http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees 
(GBTs)]]
+ * learning algorithm for classification.
+ * It supports binary labels, as well as both continuous and categorical 
features.
+ * Note: Multiclass labels are not currently supported.
+ */
+@AlphaComponent
+final class GBTClassifier
+  extends Predictor[Vector, GBTClassifier, GBTClassificationModel]
+  with GBTParams with TreeClassifierParams with Logging {
+
+  // Override parameter setters from parent trait for Java API 
compatibility.
+
+  // Parameters from TreeClassifierParams:
+
+  override def setMaxDepth(value: Int): this.type = 
super.setMaxDepth(value)
+
+  override def setMaxBins(value: Int): this.type = super.setMaxBins(value)
+
+  override def setMinInstancesPerNode(value: Int): this.type =
+super.setMinInstancesPerNode(value)
+
+  override def setMinInfoGain(value: Double): this.type = 
super.setMinInfoGain(value)
+
+  override def setMaxMemoryInMB(value: Int): this.type = 
super.setMaxMemoryInMB(value)
+
+  override def setCacheNodeIds(value: Boolean): this.type = 
super.setCacheNodeIds(value)
+
+  override def setCheckpointInterval(value: Int): this.type = 
super.setCheckpointInterval(value)
+
+  /**
+   * The impurity setting is ignored for GBT models.
+   * Individual trees are built using impurity Variance.
+   */
+  override def setImpurity(value: String): this.type = {
+logWarning(GBTClassifier.setImpurity should NOT be used)
+this
+  }
+
+  // Parameters from TreeEnsembleParams:
+
+  override def setSubsamplingRate(value: Double): this.type = 
super.setSubsamplingRate(value)
+
+  override def setSeed(value: Long): this.type = {
+logWarning(The 'seed' parameter is currently ignored by Gradient 
Boosting.)
+super.setSeed(value)
+  }
+
+  // Parameters from GBTParams:
+
+  override def setMaxIter(value: Int): this.type = super.setMaxIter(value)
+
+  override def setLearningRate(value: Double): this.type = 
super.setLearningRate(value)
+
+  // Parameters for GBTClassifier:
+
+  /**
+   * Loss function which GBT tries to minimize. (case-insensitive)
+   * Supported: LogLoss
+   * (default = LogLoss)
+   * @group param
+   */
+  val loss: Param[String] = new Param[String](this, loss, Loss function 
which GBT tries to +
+ minimize (case-insensitive). Supported options: LogLoss)
+
+  setDefault(loss - logloss)
+
+  /** @group setParam */
+  def setLoss(value: String):

[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature

2015-04-23 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2342#issuecomment-95675268
  
I almost addressed your feedbacks.

 Using this vs the approach in #5547. I think a good answer here is to use 
this vis.js library for the jobs page and then use a custom D3-based approach 
for the stage page, where we need to be careful about scalability to thousands 
of events (e.g. thousands of tasks). So with that in mind, I'd propose removing 
the stage functionality for now from this patch and only having the other pages.

O.K. I respect the approach in #5547 for the stage page.

I'm pending following some feedbacks.

 It would be good to have a visual line indicating the start of the 
application.

Instead of a visual line, in current implementation, we cannot scroll 
before the time application started.

 It would be nice if you could use +scroll to zoom, so that we could 
remove the scroll lock. Is this possible with the library?

vis.js doesn't support that feature directly so we need something trick. 
I'll implement this feature when I get good idea.

 It would be nice if I could mouse over a job and then have it highlight 
the corresponding job on the table below.

Instead of this, when we click on event box on the timeline, we can move to 
the corresponding row in the jobs table.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1537 [WiP] Application Timeline Server i...

2015-04-23 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/5423#issuecomment-95675502
  
Yes, it says it timed out (two comments up)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...

2015-04-23 Thread saucam

GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/5668

[SPARK-7097][SQL]: Partitioned tables should only consider referred 
partitions in query during size estimation for checking against 
autoBroadcastJoinThreshold

This PR attempts to add support for better size estimation in case of 
partitioned tables so that only the referred partition's size are taken into 
consideration when testing against autoBroadCastJoinThreshold and  deciding 
whether to create a broadcast join or shuffle hash join.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark part_size

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5668.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5668


commit b0beb34d6a77c738660cb161306c947411d70ab5
Author: Yash Datta yash.da...@guavus.com
Date:   2015-04-23T17:58:17Z

SPARK-7097: Partitioned tables should only consider referred partitions in 
query during size estimation for checking against autoBroadcastJoinThreshold




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [HOTFIX][SQL] Ignore flaky CachedTableSuite te...

2015-04-23 Thread marmbrus

Github user marmbrus closed the pull request at:

https://github.com/apache/spark/pull/5639


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 >

1 - 100 of 665 matches

Mail list logo