[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4173#issuecomment-71604079
  
  [Test build #26150 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26150/consoleFull)
 for   PR 4173 at commit 
[`16934ee`](https://github.com/apache/spark/commit/16934ee0c9719afeb047e4eacf6e35b5e4aca86d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4217#issuecomment-71604007
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26147/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4217#issuecomment-71604002
  
  [Test build #26147 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26147/consoleFull)
 for   PR 4217 at commit 
[`e8b0b3d`](https://github.com/apache/spark/commit/e8b0b3d18eeea7734148e40d5ad9d5af8e4d05bd).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...

2015-01-26 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4217#issuecomment-71603219
  
LGTM pending Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3732#issuecomment-71602934
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26146/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3732#issuecomment-71602931
  
  [Test build #26146 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26146/consoleFull)
 for   PR 3732 at commit 
[`d6715fc`](https://github.com/apache/spark/commit/d6715fcac2a4c5a32a6b8c11f84567c675de1892).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5393. Flood of util.RackResolver log mes...

2015-01-26 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/4192#discussion_r23591435
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
@@ -60,6 +62,9 @@ private[yarn] class YarnAllocator(
 
   import YarnAllocator._
 
+  // RackResolver logs an INFO message whenever it resolves a rack, which 
is way too often.
+  Logger.getLogger(classOf[RackResolver]).setLevel(Level.WARN)
--- End diff --

It will.  Which isn't ideal, but I don't think it's a big deal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23591427
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala ---
@@ -212,6 +284,17 @@ class MatricesSuite extends FunSuite {
 assert(deHorz1(2, 4) === 1.0)
 assert(deHorz1(1, 4) === 0.0)
 
+// containing transposed matrices
+val spHorzT = Matrices.horzcat(Array(spMat1TT, spMat2))
+val spHorz2T = Matrices.horzcat(Array(spMat1TT, deMat2))
+val spHorz3T = Matrices.horzcat(Array(deMat1TT, spMat2))
+val deHorz1T = Matrices.horzcat(Array(deMat1TT, deMat2))
+
+assert(deHorz1T.toBreeze === deHorz1.toBreeze)
--- End diff --

We shouldn't normally need toBreeze, but when I remove toBreeze, even 
though the matrices are exactly the same, I get an error. We shouldn't need to 
add `transpose` support to TestingUtils, because it already compares the output 
of `.toArray` of both matrices, which should be equal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4173#discussion_r23591326
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -0,0 +1,598 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+import scala.collection.JavaConversions._
+
+import java.util.{ArrayList, List => JList}
+
+import com.fasterxml.jackson.core.JsonFactory
+import net.razorvine.pickle.Pickler
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.api.python.SerDeUtil
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.sql.catalyst.ScalaReflection
+import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Literal => LiteralExpr}
+import org.apache.spark.sql.catalyst.plans.{JoinType, Inner}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.execution.{LogicalRDD, EvaluatePython}
+import org.apache.spark.sql.json.JsonRDD
+import org.apache.spark.sql.types.{NumericType, StructType}
+import org.apache.spark.util.Utils
+
+
+/**
+ * A collection of rows that have the same columns.
+ *
+ * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and 
can be created using
+ * various functions in [[SQLContext]].
+ * {{{
+ *   val people = sqlContext.parquetFile("...")
+ * }}}
+ *
+ * Once created, it can be manipulated using the various 
domain-specific-language (DSL) functions
+ * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for 
Scala DSL.
+ *
+ * To select a column from the data frame, use the apply method:
+ * {{{
+ *   val ageCol = people("age")  // in Scala
+ *   Column ageCol = people.apply("age")  // in Java
+ * }}}
+ *
+ * Note that the [[Column]] type can also be manipulated through its 
various functions.
+ * {{
+ *   // The following creates a new column that increases everybody's age 
by 10.
+ *   people("age") + 10  // in Scala
+ * }}
+ *
+ * A more concrete example:
+ * {{{
+ *   // To create DataFrame using SQLContext
+ *   val people = sqlContext.parquetFile("...")
+ *   val department = sqlContext.parquetFile("...")
+ *
+ *   people.filter("age" > 30)
+ * .join(department, people("deptId") === department("id"))
+ * .groupby(department("name"), "gender")
+ * .agg(avg(people("salary")), max(people("age")))
+ * }}}
+ */
+// TODO: Improve documentation.
+class DataFrame protected[sql](
+val sqlContext: SQLContext,
+private val baseLogicalPlan: LogicalPlan,
+operatorsEnabled: Boolean)
+  extends DataFrameSpecificApi with RDDApi[Row] {
+
+  protected[sql] def this(sqlContext: Option[SQLContext], plan: 
Option[LogicalPlan]) =
+this(sqlContext.orNull, plan.orNull, sqlContext.isDefined && 
plan.isDefined)
+
+  protected[sql] def this(sqlContext: SQLContext, plan: LogicalPlan) = 
this(sqlContext, plan, true)
+
+  @transient protected[sql] lazy val queryExecution = 
sqlContext.executePlan(baseLogicalPlan)
+
+  @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan 
match {
+// For various commands (like DDL) and queries with side effects, we 
force query optimization to
+// happen right away to let these side effects take place eagerly.
+case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: 
WriteToFile =>
+  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
+case _ =>
+  baseLogicalPlan
+  }
+
+  /**
+   * An implicit conversion function internal to this class for us to 
avoid doing
+   * "new D

[GitHub] spark pull request: SPARK-4687. [WIP] Add an addDirectory API

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3670#issuecomment-71601199
  
  [Test build #26149 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26149/consoleFull)
 for   PR 3670 at commit 
[`9899623`](https://github.com/apache/spark/commit/9899623af9b12c5370336519079aa42c3fdbc2db).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3732#issuecomment-71601166
  
  [Test build #26148 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26148/consoleFull)
 for   PR 3732 at commit 
[`024c9a6`](https://github.com/apache/spark/commit/024c9a6d37da9930b2c8d57cf4c01543ef0080e2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-01-26 Thread adrian-wang
Github user adrian-wang commented on the pull request:

https://github.com/apache/spark/pull/3732#issuecomment-71600826
  
cc @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5052] Add common/base classes to fix gu...

2015-01-26 Thread elmer-garduno
Github user elmer-garduno commented on the pull request:

https://github.com/apache/spark/pull/3874#issuecomment-71600458
  
Thanks for merging!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4173#discussion_r23590914
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -0,0 +1,598 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+import scala.collection.JavaConversions._
+
+import java.util.{ArrayList, List => JList}
+
+import com.fasterxml.jackson.core.JsonFactory
+import net.razorvine.pickle.Pickler
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.api.python.SerDeUtil
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.sql.catalyst.ScalaReflection
+import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Literal => LiteralExpr}
+import org.apache.spark.sql.catalyst.plans.{JoinType, Inner}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.execution.{LogicalRDD, EvaluatePython}
+import org.apache.spark.sql.json.JsonRDD
+import org.apache.spark.sql.types.{NumericType, StructType}
+import org.apache.spark.util.Utils
+
+
+/**
+ * A collection of rows that have the same columns.
+ *
+ * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and 
can be created using
+ * various functions in [[SQLContext]].
+ * {{{
+ *   val people = sqlContext.parquetFile("...")
+ * }}}
+ *
+ * Once created, it can be manipulated using the various 
domain-specific-language (DSL) functions
+ * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for 
Scala DSL.
+ *
+ * To select a column from the data frame, use the apply method:
+ * {{{
+ *   val ageCol = people("age")  // in Scala
+ *   Column ageCol = people.apply("age")  // in Java
+ * }}}
+ *
+ * Note that the [[Column]] type can also be manipulated through its 
various functions.
+ * {{
+ *   // The following creates a new column that increases everybody's age 
by 10.
+ *   people("age") + 10  // in Scala
+ * }}
+ *
+ * A more concrete example:
+ * {{{
+ *   // To create DataFrame using SQLContext
+ *   val people = sqlContext.parquetFile("...")
+ *   val department = sqlContext.parquetFile("...")
+ *
+ *   people.filter("age" > 30)
+ * .join(department, people("deptId") === department("id"))
+ * .groupby(department("name"), "gender")
+ * .agg(avg(people("salary")), max(people("age")))
+ * }}}
+ */
+// TODO: Improve documentation.
+class DataFrame protected[sql](
+val sqlContext: SQLContext,
+private val baseLogicalPlan: LogicalPlan,
+operatorsEnabled: Boolean)
+  extends DataFrameSpecificApi with RDDApi[Row] {
+
+  protected[sql] def this(sqlContext: Option[SQLContext], plan: 
Option[LogicalPlan]) =
+this(sqlContext.orNull, plan.orNull, sqlContext.isDefined && 
plan.isDefined)
+
+  protected[sql] def this(sqlContext: SQLContext, plan: LogicalPlan) = 
this(sqlContext, plan, true)
+
+  @transient protected[sql] lazy val queryExecution = 
sqlContext.executePlan(baseLogicalPlan)
+
+  @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan 
match {
+// For various commands (like DDL) and queries with side effects, we 
force query optimization to
+// happen right away to let these side effects take place eagerly.
+case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: 
WriteToFile =>
+  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
+case _ =>
+  baseLogicalPlan
+  }
+
+  /**
+   * An implicit conversion function internal to this class for us to 
avoid doing
+   * "new D

[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...

2015-01-26 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/4173#discussion_r23590711
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -0,0 +1,598 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+import scala.collection.JavaConversions._
+
+import java.util.{ArrayList, List => JList}
+
+import com.fasterxml.jackson.core.JsonFactory
+import net.razorvine.pickle.Pickler
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.api.python.SerDeUtil
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.sql.catalyst.ScalaReflection
+import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Literal => LiteralExpr}
+import org.apache.spark.sql.catalyst.plans.{JoinType, Inner}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.execution.{LogicalRDD, EvaluatePython}
+import org.apache.spark.sql.json.JsonRDD
+import org.apache.spark.sql.types.{NumericType, StructType}
+import org.apache.spark.util.Utils
+
+
+/**
+ * A collection of rows that have the same columns.
+ *
+ * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and 
can be created using
+ * various functions in [[SQLContext]].
+ * {{{
+ *   val people = sqlContext.parquetFile("...")
+ * }}}
+ *
+ * Once created, it can be manipulated using the various 
domain-specific-language (DSL) functions
+ * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for 
Scala DSL.
+ *
+ * To select a column from the data frame, use the apply method:
+ * {{{
+ *   val ageCol = people("age")  // in Scala
+ *   Column ageCol = people.apply("age")  // in Java
+ * }}}
+ *
+ * Note that the [[Column]] type can also be manipulated through its 
various functions.
+ * {{
+ *   // The following creates a new column that increases everybody's age 
by 10.
+ *   people("age") + 10  // in Scala
+ * }}
+ *
+ * A more concrete example:
+ * {{{
+ *   // To create DataFrame using SQLContext
+ *   val people = sqlContext.parquetFile("...")
+ *   val department = sqlContext.parquetFile("...")
+ *
+ *   people.filter("age" > 30)
+ * .join(department, people("deptId") === department("id"))
+ * .groupby(department("name"), "gender")
+ * .agg(avg(people("salary")), max(people("age")))
+ * }}}
+ */
+// TODO: Improve documentation.
+class DataFrame protected[sql](
+val sqlContext: SQLContext,
+private val baseLogicalPlan: LogicalPlan,
+operatorsEnabled: Boolean)
+  extends DataFrameSpecificApi with RDDApi[Row] {
+
+  protected[sql] def this(sqlContext: Option[SQLContext], plan: 
Option[LogicalPlan]) =
+this(sqlContext.orNull, plan.orNull, sqlContext.isDefined && 
plan.isDefined)
+
+  protected[sql] def this(sqlContext: SQLContext, plan: LogicalPlan) = 
this(sqlContext, plan, true)
+
+  @transient protected[sql] lazy val queryExecution = 
sqlContext.executePlan(baseLogicalPlan)
+
+  @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan 
match {
+// For various commands (like DDL) and queries with side effects, we 
force query optimization to
+// happen right away to let these side effects take place eagerly.
+case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: 
WriteToFile =>
+  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
+case _ =>
+  baseLogicalPlan
+  }
+
+  /**
+   * An implicit conversion function internal to this class for us to 
avoid doing
+   * "new

[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3222#issuecomment-71599365
  
  [Test build #26145 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26145/consoleFull)
 for   PR 3222 at commit 
[`612c7bd`](https://github.com/apache/spark/commit/612c7bdce7208791741f3a9f91e896c9585e832b).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AdaGradUpdater(`
  * `class DBN(val stackedRBM: StackedRBM)`
  * `class MLP(`
  * `class MomentumUpdater(val momentum: Double) extends Updater `
  * `class RBM(`
  * `class StackedRBM(val innerRBMs: Array[RBM])`
  * `case class MinstItem(label: Int, data: Array[Int]) `
  * `class MinstDatasetReader(labelsFile: String, imagesFile: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3222#issuecomment-71599372
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26145/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4217#issuecomment-71597558
  
  [Test build #26147 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26147/consoleFull)
 for   PR 4217 at commit 
[`e8b0b3d`](https://github.com/apache/spark/commit/e8b0b3d18eeea7734148e40d5ad9d5af8e4d05bd).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589919
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala ---
@@ -212,6 +284,17 @@ class MatricesSuite extends FunSuite {
 assert(deHorz1(2, 4) === 1.0)
 assert(deHorz1(1, 4) === 0.0)
 
+// containing transposed matrices
+val spHorzT = Matrices.horzcat(Array(spMat1TT, spMat2))
+val spHorz2T = Matrices.horzcat(Array(spMat1TT, deMat2))
+val spHorz3T = Matrices.horzcat(Array(deMat1TT, spMat2))
+val deHorz1T = Matrices.horzcat(Array(deMat1TT, deMat2))
+
+assert(deHorz1T.toBreeze === deHorz1.toBreeze)
--- End diff --

Do we need `toBreeze`? Shall we add `transpose` support to `TestingUtils`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...

2015-01-26 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/4217#issuecomment-71597388
  
New commit is added for the comments. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3732#issuecomment-71597172
  
  [Test build #26146 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26146/consoleFull)
 for   PR 3732 at commit 
[`d6715fc`](https://github.com/apache/spark/commit/d6715fcac2a4c5a32a6b8c11f84567c675de1892).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589813
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -92,6 +87,15 @@ sealed trait Matrix extends Serializable {
 * backing array. For example, an operation such as addition or 
subtraction will only be
 * performed on the non-zero values in a `SparseMatrix`. */
   private[mllib] def update(f: Double => Double): Matrix
+
+  /**
+   * Applies a function `f` to all the active elements of dense and sparse 
matrix.
--- End diff --

It may be worth noting that the ordering of elements are not defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589803
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala ---
@@ -161,6 +161,68 @@ class MatricesSuite extends FunSuite {
 assert(deMat1.toArray === deMat2.toArray)
   }
 
+  test("transpose") {
+val dA =
+  new DenseMatrix(4, 3, Array(0.0, 1.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 
0.0, 0.0, 0.0, 3.0))
+val sA = new SparseMatrix(4, 3, Array(0, 1, 3, 4), Array(1, 0, 2, 3), 
Array(1.0, 2.0, 1.0, 3.0))
+
+val dAT = dA.transpose.asInstanceOf[DenseMatrix]
+val sAT = sA.transpose.asInstanceOf[SparseMatrix]
+val dATexpected =
+  new DenseMatrix(3, 4, Array(0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 
0.0, 0.0, 0.0, 3.0))
+val sATexpected =
+  new SparseMatrix(3, 4, Array(0, 1, 2, 3, 4), Array(1, 0, 1, 2), 
Array(2.0, 1.0, 1.0, 3.0))
+
+assert(dAT.toBreeze === dATexpected.toBreeze)
+assert(sAT.toBreeze === sATexpected.toBreeze)
+assert(dA(1, 0) === dAT(0, 1))
+assert(dA(2, 1) === dAT(1, 2))
+assert(sA(1, 0) === sAT(0, 1))
+assert(sA(2, 1) === sAT(1, 2))
+
+assert(!dA.toArray.eq(dAT.toArray), "has to have a new array")
+assert(dA.toArray.eq(dAT.transpose.toArray), "should not copy array")
+
+assert(dAT.toSparse().toBreeze === sATexpected.toBreeze)
+assert(sAT.toDense().toBreeze === dATexpected.toBreeze)
+  }
+
+  test("foreachActive") {
+val m = 3
+val n = 2
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0)
+val colPtrs = Array(0, 2, 4)
+val rowIndices = Array(0, 1, 1, 2)
+
+val sp = new SparseMatrix(m, n, colPtrs, rowIndices, values)
+val dn = new DenseMatrix(m, n, allValues)
+
+val dnMap = scala.collection.mutable.Map[(Int, Int), Double]()
--- End diff --

`import scala.collection.mutable` and use `mutable.Map` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589796
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -341,6 +430,37 @@ class SparseMatrix(
 this
   }
 
+  override def transpose: Matrix = {
+val transposedMatrix = new SparseMatrix(numCols, numRows, colPtrs, 
rowIndices, values)
+transposedMatrix.isTrans = !isTransposed
+transposedMatrix
+  }
+
+  private[spark] override def foreachActive(f: (Int, Int, Double) => 
Unit): Unit = {
+if (!isTransposed) {
+  var j = 0
+  while (j < numCols) {
+var idx = colPtrs(j)
+while (idx < colPtrs(j + 1)) {
--- End diff --

Cache `colPtrs(j + 1)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589806
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala ---
@@ -161,6 +161,68 @@ class MatricesSuite extends FunSuite {
 assert(deMat1.toArray === deMat2.toArray)
   }
 
+  test("transpose") {
+val dA =
+  new DenseMatrix(4, 3, Array(0.0, 1.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 
0.0, 0.0, 0.0, 3.0))
+val sA = new SparseMatrix(4, 3, Array(0, 1, 3, 4), Array(1, 0, 2, 3), 
Array(1.0, 2.0, 1.0, 3.0))
+
+val dAT = dA.transpose.asInstanceOf[DenseMatrix]
+val sAT = sA.transpose.asInstanceOf[SparseMatrix]
+val dATexpected =
+  new DenseMatrix(3, 4, Array(0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 
0.0, 0.0, 0.0, 3.0))
+val sATexpected =
+  new SparseMatrix(3, 4, Array(0, 1, 2, 3, 4), Array(1, 0, 1, 2), 
Array(2.0, 1.0, 1.0, 3.0))
+
+assert(dAT.toBreeze === dATexpected.toBreeze)
+assert(sAT.toBreeze === sATexpected.toBreeze)
+assert(dA(1, 0) === dAT(0, 1))
+assert(dA(2, 1) === dAT(1, 2))
+assert(sA(1, 0) === sAT(0, 1))
+assert(sA(2, 1) === sAT(1, 2))
+
+assert(!dA.toArray.eq(dAT.toArray), "has to have a new array")
+assert(dA.toArray.eq(dAT.transpose.toArray), "should not copy array")
+
+assert(dAT.toSparse().toBreeze === sATexpected.toBreeze)
+assert(sAT.toDense().toBreeze === dATexpected.toBreeze)
+  }
+
+  test("foreachActive") {
+val m = 3
+val n = 2
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0)
+val colPtrs = Array(0, 2, 4)
+val rowIndices = Array(0, 1, 1, 2)
+
+val sp = new SparseMatrix(m, n, colPtrs, rowIndices, values)
+val dn = new DenseMatrix(m, n, allValues)
+
+val dnMap = scala.collection.mutable.Map[(Int, Int), Double]()
+dn.foreachActive { (i, j, value) =>
+  dnMap.put((i, j), value)
+}
+assert(dnMap.size === 6)
+assert(dnMap.get(0, 0) === Some(1.0))
+assert(dnMap.get(1, 0) === Some(2.0))
+assert(dnMap.get(2, 0) === Some(0.0))
+assert(dnMap.get(0, 1) === Some(0.0))
+assert(dnMap.get(1, 1) === Some(4.0))
+assert(dnMap.get(2, 1) === Some(5.0))
+
+val spMap = scala.collection.mutable.Map[Int, Double]()
+var cnt = 0
+sp.foreachActive { (i, j, value) =>
+  spMap.put(cnt, value)
+  cnt += 1
+}
+assert(spMap.size === 4)
+assert(spMap.get(0) === Some(1.0))
--- End diff --

We are asserting on the ordering of `foreachActive`. This is not part of 
the contract. Shall we use `(i, j)` as the key (similar to `dnMap`)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589789
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -281,31 +344,53 @@ class SparseMatrix(
 
   require(values.length == rowIndices.length, "The number of row indices 
and values don't match! " +
 s"values.length: ${values.length}, rowIndices.length: 
${rowIndices.length}")
-  require(colPtrs.length == numCols + 1, "The length of the column indices 
should be the " +
-s"number of columns + 1. Currently, colPointers.length: 
${colPtrs.length}, " +
-s"numCols: $numCols")
+  // The Or statement is for the case when the matrix is transposed
+  require(colPtrs.length == numCols + 1 || colPtrs.length == numRows + 1, 
"The length of the " +
+"column indices should be the number of columns + 1. Currently, 
colPointers.length: " +
+s"${colPtrs.length}, numCols: $numCols")
   require(values.length == colPtrs.last, "The last value of colPtrs must 
equal the number of " +
 s"elements. values.length: ${values.length}, colPtrs.last: 
${colPtrs.last}")
 
   override def toArray: Array[Double] = {
 val arr = new Array[Double](numRows * numCols)
-var j = 0
-while (j < numCols) {
-  var i = colPtrs(j)
-  val indEnd = colPtrs(j + 1)
-  val offset = j * numRows
-  while (i < indEnd) {
-val rowIndex = rowIndices(i)
-arr(offset + rowIndex) = values(i)
+// if statement inside the loop would be expensive
+if (!isTransposed) {
+  var j = 0
--- End diff --

Use `foreachActive`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589792
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -341,6 +430,37 @@ class SparseMatrix(
 this
   }
 
+  override def transpose: Matrix = {
--- End diff --

See my comment on `isTrans`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589798
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -341,6 +430,37 @@ class SparseMatrix(
 this
   }
 
+  override def transpose: Matrix = {
+val transposedMatrix = new SparseMatrix(numCols, numRows, colPtrs, 
rowIndices, values)
+transposedMatrix.isTrans = !isTransposed
+transposedMatrix
+  }
+
+  private[spark] override def foreachActive(f: (Int, Int, Double) => 
Unit): Unit = {
+if (!isTransposed) {
+  var j = 0
+  while (j < numCols) {
+var idx = colPtrs(j)
+while (idx < colPtrs(j + 1)) {
+  f(rowIndices(idx), j, values(idx))
+  idx += 1
+}
+j += 1
+  }
+} else {
+  var i = 0
+  while (i < numRows) {
+var idx = colPtrs(i)
+while (idx < colPtrs(i + 1)) {
--- End diff --

Ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589785
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -114,7 +118,24 @@ class DenseMatrix(val numRows: Int, val numCols: Int, 
val values: Array[Double])
   require(values.length == numRows * numCols, "The number of values 
supplied doesn't match the " +
 s"size of the matrix! values.length: ${values.length}, numRows * 
numCols: ${numRows * numCols}")
 
-  override def toArray: Array[Double] = values
+  override def toArray: Array[Double] = {
+if (!isTransposed) {
+  values
+} else {
+  val transposedValues = new Array[Double](values.length)
--- End diff --

Use `foreachActive` and move this implementation to `Matrix`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589805
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala ---
@@ -161,6 +161,68 @@ class MatricesSuite extends FunSuite {
 assert(deMat1.toArray === deMat2.toArray)
   }
 
+  test("transpose") {
+val dA =
+  new DenseMatrix(4, 3, Array(0.0, 1.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 
0.0, 0.0, 0.0, 3.0))
+val sA = new SparseMatrix(4, 3, Array(0, 1, 3, 4), Array(1, 0, 2, 3), 
Array(1.0, 2.0, 1.0, 3.0))
+
+val dAT = dA.transpose.asInstanceOf[DenseMatrix]
+val sAT = sA.transpose.asInstanceOf[SparseMatrix]
+val dATexpected =
+  new DenseMatrix(3, 4, Array(0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 
0.0, 0.0, 0.0, 3.0))
+val sATexpected =
+  new SparseMatrix(3, 4, Array(0, 1, 2, 3, 4), Array(1, 0, 1, 2), 
Array(2.0, 1.0, 1.0, 3.0))
+
+assert(dAT.toBreeze === dATexpected.toBreeze)
+assert(sAT.toBreeze === sATexpected.toBreeze)
+assert(dA(1, 0) === dAT(0, 1))
+assert(dA(2, 1) === dAT(1, 2))
+assert(sA(1, 0) === sAT(0, 1))
+assert(sA(2, 1) === sAT(1, 2))
+
+assert(!dA.toArray.eq(dAT.toArray), "has to have a new array")
+assert(dA.toArray.eq(dAT.transpose.toArray), "should not copy array")
+
+assert(dAT.toSparse().toBreeze === sATexpected.toBreeze)
+assert(sAT.toDense().toBreeze === dATexpected.toBreeze)
+  }
+
+  test("foreachActive") {
+val m = 3
+val n = 2
+val values = Array(1.0, 2.0, 4.0, 5.0)
+val allValues = Array(1.0, 2.0, 0.0, 0.0, 4.0, 5.0)
+val colPtrs = Array(0, 2, 4)
+val rowIndices = Array(0, 1, 1, 2)
+
+val sp = new SparseMatrix(m, n, colPtrs, rowIndices, values)
+val dn = new DenseMatrix(m, n, allValues)
+
+val dnMap = scala.collection.mutable.Map[(Int, Int), Double]()
+dn.foreachActive { (i, j, value) =>
+  dnMap.put((i, j), value)
+}
+assert(dnMap.size === 6)
+assert(dnMap.get(0, 0) === Some(1.0))
--- End diff --

`dnMap(0, 0) === 1.0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589782
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -52,10 +58,13 @@ sealed trait Matrix extends Serializable {
   /** Get a deep copy of the matrix. */
   def copy: Matrix
 
+  /** Transpose the Matrix */
--- End diff --

It returns a new matrix instance representing the transposed matrix. We 
should also mention that they share the same underlying data arrays.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589786
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -148,6 +178,40 @@ class DenseMatrix(val numRows: Int, val numCols: Int, 
val values: Array[Double])
 this
   }
 
+  override def transpose: Matrix = {
+val transposedMatrix = new DenseMatrix(numCols, numRows, values)
+transposedMatrix.isTrans = !isTransposed
+transposedMatrix
+  }
+
+  private[spark] override def foreachActive(f: (Int, Int, Double) => 
Unit): Unit = {
+if (!isTransposed) {
+  // outer loop over columns
+  var j = 0
+  while (j < numCols) {
+var i = 0
+while (i < numRows) {
+  val ind = index(i, j)
--- End diff --

Do not call `index`. Increment `ind` instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589775
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -257,30 +257,30 @@ private[spark] object BLAS extends Serializable with 
Logging {
 
   /**
* C := alpha * A * B + beta * C
-   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
-   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
* @param alpha a scalar to scale the multiplication A * B.
* @param A the matrix A that will be left multiplied to B. Size of m x 
k.
* @param B the matrix B that will be left multiplied by A. Size of k x 
n.
* @param beta a scalar that can be used to scale matrix C.
-   * @param C the resulting matrix C. Size of m x n.
+   * @param C the resulting matrix C. Size of m x n. C.isTransposed must 
be false. In other words,
+   *  C cannot be the product of a `transpose()` call, or be 
converted from a transposed
--- End diff --

The last sentence is a little misleading. `C` could come from 
`transpose.transpose`. We can remove the last sentence completely. 
"C.isTransposed must be false.` is already sufficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589776
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala ---
@@ -257,30 +257,30 @@ private[spark] object BLAS extends Serializable with 
Logging {
 
   /**
* C := alpha * A * B + beta * C
-   * @param transA whether to use the transpose of matrix A (true), or A 
itself (false).
-   * @param transB whether to use the transpose of matrix B (true), or B 
itself (false).
* @param alpha a scalar to scale the multiplication A * B.
* @param A the matrix A that will be left multiplied to B. Size of m x 
k.
* @param B the matrix B that will be left multiplied by A. Size of k x 
n.
* @param beta a scalar that can be used to scale matrix C.
-   * @param C the resulting matrix C. Size of m x n.
+   * @param C the resulting matrix C. Size of m x n. C.isTransposed must 
be false. In other words,
+   *  C cannot be the product of a `transpose()` call, or be 
converted from a transposed
--- End diff --

The last sentence is a little misleading. `C` could come from 
`transpose.transpose`. We can remove the last sentence completely. 
"C.isTransposed must be false.` is already sufficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23589781
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -34,6 +34,12 @@ sealed trait Matrix extends Serializable {
   /** Number of columns. */
   def numCols: Int
 
+  /** Flag that keeps track whether the matrix is transposed or not. False 
by default. */
+  private[mllib] var isTrans = false
--- End diff --

We can make `isTranspose` a `val`/`def`. Add make a new constructor that 
takes `isTransposed` as input.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5264][SQL] support `drop table` DDL com...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4060#issuecomment-71596700
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26144/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5264][SQL] support `drop table` DDL com...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4060#issuecomment-71596695
  
  [Test build #26144 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26144/consoleFull)
 for   PR 4060 at commit 
[`dc1c9a0`](https://github.com/apache/spark/commit/dc1c9a0384b53ddf40cad6e74b1f4c070e7284de).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5264][SQL] support `drop table` DDL com...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4060#issuecomment-71596340
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26143/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5264][SQL] support `drop table` DDL com...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4060#issuecomment-71596334
  
  [Test build #26143 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26143/consoleFull)
 for   PR 4060 at commit 
[`1e47f8a`](https://github.com/apache/spark/commit/1e47f8a6b1cc445eddf4fa6e4e9e244854bfe2ed).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...

2015-01-26 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/4055#discussion_r23589395
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -106,7 +106,21 @@ private[spark] abstract class Task[T](val stageId: 
Int, var partitionId: Int) ex
 if (interruptThread && taskThread != null) {
   taskThread.interrupt()
 }
-  }  
+  }
+
+  override def hashCode(): Int = {
+31 * stageId.hashCode() + partitionId.hashCode()
+  }
+
+  def canEqual(other: Any): Boolean = other.isInstanceOf[Task[_]]
+
+  override def equals(other: Any): Boolean = other match {
+case that: Task[_] =>
+  (that canEqual this) &&
--- End diff --

According to the logic of `DAGScheduler`, I think once the stage id of 2 
tasks equals, the 2 tasks must be same type, so we don't need to check the type 
equality here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...

2015-01-26 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/4055#discussion_r23589410
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -540,6 +540,52 @@ class DAGSchedulerSuite extends 
TestKit(ActorSystem("DAGSchedulerSuite")) with F
 assertDataStructuresEmpty
   }
 
+  test("Make sure mapStage.pendingtasks is set() " +
+"while MapStage.isAvailable is true while stage was retry ") {
+val firstRDD = new MyRDD(sc, 6, Nil)
+val firstShuffleDep = new ShuffleDependency(firstRDD, null)
+val firstShuyffleId = firstShuffleDep.shuffleId
+val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep))
+val shuffleDep = new ShuffleDependency(shuffleMapRdd, null)
+val shuffleId = shuffleDep.shuffleId
--- End diff --

The `shuffleId` is not used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...

2015-01-26 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/4055#discussion_r23589337
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala 
---
@@ -65,4 +65,6 @@ private[spark] class ResultTask[T, U](
   override def preferredLocations: Seq[TaskLocation] = preferredLocs
 
   override def toString = "ResultTask(" + stageId + ", " + partitionId + 
")"
+
+  override def canEqual(other: Any): Boolean = 
other.isInstanceOf[ResultTask[T, U]]
--- End diff --

You can try `Seq[Int](1).isInstanceOf[Seq[String]]` in REPL, it will return 
true.
`isInstanceOf` can't work on generic type because of JVM type erasure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3974][MLlib] Distributed Block Matrix A...

2015-01-26 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/3200#discussion_r23589329
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
 ---
@@ -0,0 +1,242 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.linalg.distributed
+
+import breeze.linalg.{DenseMatrix => BDM}
+
+import org.apache.spark.{Logging, Partitioner}
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.mllib.rdd.RDDFunctions._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * A grid partitioner, which stores every block in a separate partition.
+ *
+ * @param numRowBlocks Number of blocks that form the rows of the matrix.
+ * @param numColBlocks Number of blocks that form the columns of the 
matrix.
+ */
+private[mllib] class GridPartitioner(
+val numRowBlocks: Int,
+val numColBlocks: Int,
+val numParts: Int) extends Partitioner {
+  // Having the number of partitions greater than the number of sub 
matrices does not help
+  override val numPartitions = math.min(numParts, numRowBlocks * 
numColBlocks)
+
+  /**
+   * Returns the index of the partition the SubMatrix belongs to. Tries to 
achieve block wise
+   * partitioning.
+   *
+   * @param key The key for the SubMatrix. Can be its position in the grid 
(its column major index)
+   *or a tuple of three integers that are the final row index 
after the multiplication,
+   *the index of the block to multiply with, and the final 
column index after the
+   *multiplication.
+   * @return The index of the partition, which the SubMatrix belongs to.
+   */
+  override def getPartition(key: Any): Int = {
+key match {
+  case (blockRowIndex: Int, blockColIndex: Int) =>
+getBlockId(blockRowIndex, blockColIndex)
+  case (blockRowIndex: Int, innerIndex: Int, blockColIndex: Int) =>
+getBlockId(blockRowIndex, blockColIndex)
+  case _ =>
+throw new IllegalArgumentException(s"Unrecognized key. key: $key")
+}
+  }
+
+  /** Partitions sub-matrices as blocks with neighboring sub-matrices. */
+  private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = {
+val totalBlocks = numRowBlocks * numColBlocks
+// Gives the number of blocks that need to be in each partition
+val partitionRatio = math.ceil(totalBlocks * 1.0 / numPartitions).toInt
+// Number of neighboring blocks to take in each row
+val subBlocksPerRow = math.ceil(numRowBlocks * 1.0 / 
partitionRatio).toInt
+// Number of neighboring blocks to take in each column
+val subBlocksPerCol = math.ceil(numColBlocks * 1.0 / 
partitionRatio).toInt
+// Coordinates of the block
+val i = blockRowIndex / subBlocksPerRow
+val j = blockColIndex / subBlocksPerCol
+val blocksPerRow = math.ceil(numRowBlocks * 1.0 / 
subBlocksPerRow).toInt
+j * blocksPerRow + i
+  }
+
+  /** Checks whether the partitioners have the same characteristics */
+  override def equals(obj: Any): Boolean = {
+obj match {
+  case r: GridPartitioner =>
+(this.numRowBlocks == r.numRowBlocks) && (this.numColBlocks == 
r.numColBlocks) &&
+  (this.numPartitions == r.numPartitions)
+  case _ =>
+false
+}
+  }
+}
+
+/**
+ * Represents a distributed matrix in blocks of local matrices.
+ *
+ * @param rdd The RDD of SubMatrices (local matrices) that form this matrix
+ * @param nRows Number of rows of this matrix
+ * @param nCols Number of columns of this matrix
+ * @param numRowBlocks Number of blocks that form the rows of this matrix
+ * @param numColBlocks Number of blocks that form the columns of this 
matrix
+ * @param rowsPerBlock Number of rows that make up each block. The blocks 
forming the fi

[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3222#issuecomment-71594155
  
  [Test build #26145 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26145/consoleFull)
 for   PR 3222 at commit 
[`612c7bd`](https://github.com/apache/spark/commit/612c7bdce7208791741f3a9f91e896c9585e832b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4337. [YARN] Add ability to cancel pendi...

2015-01-26 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/4141#issuecomment-71593861
  
@lianhuiwang I am not sure I can comment on this PR. I am unfamiliar with 
this piece of code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5264][SQL] support `drop table` DDL com...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4060#issuecomment-71593801
  
  [Test build #26144 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26144/consoleFull)
 for   PR 4060 at commit 
[`dc1c9a0`](https://github.com/apache/spark/commit/dc1c9a0384b53ddf40cad6e74b1f4c070e7284de).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5264][SQL] support `drop table` DDL com...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4060#issuecomment-71593453
  
  [Test build #26143 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26143/consoleFull)
 for   PR 4060 at commit 
[`1e47f8a`](https://github.com/apache/spark/commit/1e47f8a6b1cc445eddf4fa6e4e9e244854bfe2ed).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5414] Add SparkFirehoseListener class f...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4210#discussion_r23587232
  
--- Diff: core/src/main/java/org/apache/spark/SparkFirehoseListener.java ---
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark;
+
+import org.apache.spark.scheduler.*;
+
+/**
+ * Class that allows users to receive all SparkListener events.
+ * Users should override the onEvent method.
+ *
+ * This is a concrete class instead of abstract to enforce
+ * new events get added to both the SparkListener and this adapter
+ * in lockstep.
+ */
+public class SparkFirehoseListener implements SparkListener {
+
+public void onEvent(SparkListenerEvent event) { }
--- End diff --

is there a use case for having a noop onEvent?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3298][SQL] Add flag control overwrite r...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4175#issuecomment-71589586
  
  [Test build #26141 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26141/consoleFull)
 for   PR 4175 at commit 
[`26c6011`](https://github.com/apache/spark/commit/26c6011084e1540f1dd959f733d60acb52992de9).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3298][SQL] Add flag control overwrite r...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4175#issuecomment-71589589
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26141/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5384][mllib] Vectors.sqdist returns inc...

2015-01-26 Thread hhbyyh
Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/4183#issuecomment-71589499
  
Thanks @mengxr .

I saw a PR from @viirya for the issue which seems is good enough for now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5422] Add support for sending Graphite ...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4218#issuecomment-71589336
  
  [Test build #26142 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26142/consoleFull)
 for   PR 4218 at commit 
[`ebae393`](https://github.com/apache/spark/commit/ebae39347ab3d187bb5eef943882b6bb4c385113).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5422] Add support for sending Graphite ...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4218#issuecomment-71589339
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26142/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-26 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/3798#issuecomment-71588806
  
@tdas pinged me saying this needs more consensus on design before 
commenting on details so I will stop :) Anyway, I pointed out a few ways that 
hopefully can make this PR simpler to understand once we finalize the high 
level design.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23586770
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/rdd/kafka/KafkaRDD.scala ---
@@ -0,0 +1,199 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd.kafka
+
+import scala.reflect.{classTag, ClassTag}
+
+import org.apache.spark.{Logging, Partition, SparkContext, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.NextIterator
+
+import java.util.Properties
+import kafka.api.{FetchRequestBuilder, FetchResponse}
+import kafka.common.{ErrorMapping, TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+import kafka.message.{MessageAndMetadata, MessageAndOffset}
+import kafka.serializer.Decoder
+import kafka.utils.VerifiableProperties
+
+/** A batch-oriented interface for consuming from Kafka.
+  * Starting and ending offsets are specified in advance,
+  * so that you can control exactly-once semantics.
+  * For an easy interface to Kafka-managed offsets,
+  *  see {@link org.apache.spark.rdd.kafka.KafkaCluster}
+  * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";>
+  * configuration parameters.
+  *   Requires "metadata.broker.list" or "bootstrap.servers" to be set 
with Kafka broker(s),
+  *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+  * @param batch Each KafkaRDDPartition in the batch corresponds to a
+  *   range of offsets for a given Kafka topic/partition
+  * @param messageHandler function for translating each message into the 
desired type
+  */
+class KafkaRDD[
+  K: ClassTag,
+  V: ClassTag,
+  U <: Decoder[_]: ClassTag,
+  T <: Decoder[_]: ClassTag,
+  R: ClassTag](
+sc: SparkContext,
+val kafkaParams: Map[String, String],
+val batch: Array[KafkaRDDPartition],
+messageHandler: MessageAndMetadata[K, V] => R
+  ) extends RDD[R](sc, Nil) with Logging {
+
+  override def getPartitions: Array[Partition] = 
batch.asInstanceOf[Array[Partition]]
+
+  override def getPreferredLocations(thePart: Partition): Seq[String] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+// TODO is additional hostname resolution necessary here
+Seq(part.host)
+  }
+
+  override def compute(thePart: Partition, context: TaskContext): 
Iterator[R] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+if (part.fromOffset >= part.untilOffset) {
+  log.warn("Beginning offset is same or after ending offset " +
+s"skipping ${part.topic} ${part.partition}")
+  Iterator.empty
+} else {
+  new NextIterator[R] {
+context.addTaskCompletionListener{ context => closeIfNeeded() }
+
+log.info(s"Computing topic ${part.topic}, partition 
${part.partition} " +
+  s"offsets ${part.fromOffset} -> ${part.untilOffset}")
+
+val kc = new KafkaCluster(kafkaParams)
+val keyDecoder = 
classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
+  .newInstance(kc.config.props)
+  .asInstanceOf[Decoder[K]]
+val valueDecoder = 
classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
+  .newInstance(kc.config.props)
+  .asInstanceOf[Decoder[V]]
+val consumer = connectLeader
+var requestOffset = part.fromOffset
+var iter: Iterator[MessageAndOffset] = null
+
+// The idea is to use the provided preferred host, except on task 
retry atttempts,
+// to minimize number of kafka metadata requests
+private def connectLeader: SimpleConsumer = {
+  if (context.attemptNumber > 0) {
+kc.connectLeader(part.topic, part.partition).fold(
+  errs => throw new Exception(
+s"Couldn't connect to leader for 

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23586739
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/rdd/kafka/KafkaRDD.scala ---
@@ -0,0 +1,199 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd.kafka
+
+import scala.reflect.{classTag, ClassTag}
+
+import org.apache.spark.{Logging, Partition, SparkContext, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.NextIterator
+
+import java.util.Properties
+import kafka.api.{FetchRequestBuilder, FetchResponse}
+import kafka.common.{ErrorMapping, TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+import kafka.message.{MessageAndMetadata, MessageAndOffset}
+import kafka.serializer.Decoder
+import kafka.utils.VerifiableProperties
+
+/** A batch-oriented interface for consuming from Kafka.
+  * Starting and ending offsets are specified in advance,
+  * so that you can control exactly-once semantics.
+  * For an easy interface to Kafka-managed offsets,
+  *  see {@link org.apache.spark.rdd.kafka.KafkaCluster}
+  * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";>
+  * configuration parameters.
+  *   Requires "metadata.broker.list" or "bootstrap.servers" to be set 
with Kafka broker(s),
+  *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+  * @param batch Each KafkaRDDPartition in the batch corresponds to a
+  *   range of offsets for a given Kafka topic/partition
+  * @param messageHandler function for translating each message into the 
desired type
+  */
+class KafkaRDD[
+  K: ClassTag,
+  V: ClassTag,
+  U <: Decoder[_]: ClassTag,
+  T <: Decoder[_]: ClassTag,
+  R: ClassTag](
+sc: SparkContext,
+val kafkaParams: Map[String, String],
+val batch: Array[KafkaRDDPartition],
+messageHandler: MessageAndMetadata[K, V] => R
+  ) extends RDD[R](sc, Nil) with Logging {
+
+  override def getPartitions: Array[Partition] = 
batch.asInstanceOf[Array[Partition]]
+
+  override def getPreferredLocations(thePart: Partition): Seq[String] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+// TODO is additional hostname resolution necessary here
+Seq(part.host)
+  }
+
+  override def compute(thePart: Partition, context: TaskContext): 
Iterator[R] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+if (part.fromOffset >= part.untilOffset) {
+  log.warn("Beginning offset is same or after ending offset " +
+s"skipping ${part.topic} ${part.partition}")
+  Iterator.empty
+} else {
+  new NextIterator[R] {
+context.addTaskCompletionListener{ context => closeIfNeeded() }
+
+log.info(s"Computing topic ${part.topic}, partition 
${part.partition} " +
+  s"offsets ${part.fromOffset} -> ${part.untilOffset}")
+
+val kc = new KafkaCluster(kafkaParams)
+val keyDecoder = 
classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
+  .newInstance(kc.config.props)
+  .asInstanceOf[Decoder[K]]
+val valueDecoder = 
classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
+  .newInstance(kc.config.props)
+  .asInstanceOf[Decoder[V]]
+val consumer = connectLeader
+var requestOffset = part.fromOffset
+var iter: Iterator[MessageAndOffset] = null
+
+// The idea is to use the provided preferred host, except on task 
retry atttempts,
+// to minimize number of kafka metadata requests
+private def connectLeader: SimpleConsumer = {
+  if (context.attemptNumber > 0) {
+kc.connectLeader(part.topic, part.partition).fold(
+  errs => throw new Exception(
+s"Couldn't connect to leader for 

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23586712
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/rdd/kafka/KafkaRDD.scala ---
@@ -0,0 +1,199 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd.kafka
+
+import scala.reflect.{classTag, ClassTag}
+
+import org.apache.spark.{Logging, Partition, SparkContext, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.NextIterator
+
+import java.util.Properties
+import kafka.api.{FetchRequestBuilder, FetchResponse}
+import kafka.common.{ErrorMapping, TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+import kafka.message.{MessageAndMetadata, MessageAndOffset}
+import kafka.serializer.Decoder
+import kafka.utils.VerifiableProperties
+
+/** A batch-oriented interface for consuming from Kafka.
+  * Starting and ending offsets are specified in advance,
+  * so that you can control exactly-once semantics.
+  * For an easy interface to Kafka-managed offsets,
+  *  see {@link org.apache.spark.rdd.kafka.KafkaCluster}
+  * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";>
+  * configuration parameters.
+  *   Requires "metadata.broker.list" or "bootstrap.servers" to be set 
with Kafka broker(s),
+  *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+  * @param batch Each KafkaRDDPartition in the batch corresponds to a
+  *   range of offsets for a given Kafka topic/partition
+  * @param messageHandler function for translating each message into the 
desired type
+  */
+class KafkaRDD[
+  K: ClassTag,
+  V: ClassTag,
+  U <: Decoder[_]: ClassTag,
+  T <: Decoder[_]: ClassTag,
+  R: ClassTag](
+sc: SparkContext,
+val kafkaParams: Map[String, String],
+val batch: Array[KafkaRDDPartition],
+messageHandler: MessageAndMetadata[K, V] => R
+  ) extends RDD[R](sc, Nil) with Logging {
+
+  override def getPartitions: Array[Partition] = 
batch.asInstanceOf[Array[Partition]]
+
+  override def getPreferredLocations(thePart: Partition): Seq[String] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+// TODO is additional hostname resolution necessary here
+Seq(part.host)
+  }
+
+  override def compute(thePart: Partition, context: TaskContext): 
Iterator[R] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+if (part.fromOffset >= part.untilOffset) {
+  log.warn("Beginning offset is same or after ending offset " +
+s"skipping ${part.topic} ${part.partition}")
+  Iterator.empty
+} else {
+  new NextIterator[R] {
+context.addTaskCompletionListener{ context => closeIfNeeded() }
+
+log.info(s"Computing topic ${part.topic}, partition 
${part.partition} " +
+  s"offsets ${part.fromOffset} -> ${part.untilOffset}")
+
+val kc = new KafkaCluster(kafkaParams)
+val keyDecoder = 
classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
+  .newInstance(kc.config.props)
+  .asInstanceOf[Decoder[K]]
+val valueDecoder = 
classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
+  .newInstance(kc.config.props)
+  .asInstanceOf[Decoder[V]]
+val consumer = connectLeader
+var requestOffset = part.fromOffset
+var iter: Iterator[MessageAndOffset] = null
+
+// The idea is to use the provided preferred host, except on task 
retry atttempts,
+// to minimize number of kafka metadata requests
+private def connectLeader: SimpleConsumer = {
+  if (context.attemptNumber > 0) {
+kc.connectLeader(part.topic, part.partition).fold(
+  errs => throw new Exception(
+s"Couldn't connect to leader for 

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23586726
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/rdd/kafka/KafkaRDD.scala ---
@@ -0,0 +1,199 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd.kafka
+
+import scala.reflect.{classTag, ClassTag}
+
+import org.apache.spark.{Logging, Partition, SparkContext, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.NextIterator
+
+import java.util.Properties
+import kafka.api.{FetchRequestBuilder, FetchResponse}
+import kafka.common.{ErrorMapping, TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+import kafka.message.{MessageAndMetadata, MessageAndOffset}
+import kafka.serializer.Decoder
+import kafka.utils.VerifiableProperties
+
+/** A batch-oriented interface for consuming from Kafka.
+  * Starting and ending offsets are specified in advance,
+  * so that you can control exactly-once semantics.
+  * For an easy interface to Kafka-managed offsets,
+  *  see {@link org.apache.spark.rdd.kafka.KafkaCluster}
+  * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";>
+  * configuration parameters.
+  *   Requires "metadata.broker.list" or "bootstrap.servers" to be set 
with Kafka broker(s),
+  *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+  * @param batch Each KafkaRDDPartition in the batch corresponds to a
+  *   range of offsets for a given Kafka topic/partition
+  * @param messageHandler function for translating each message into the 
desired type
+  */
+class KafkaRDD[
+  K: ClassTag,
+  V: ClassTag,
+  U <: Decoder[_]: ClassTag,
+  T <: Decoder[_]: ClassTag,
+  R: ClassTag](
+sc: SparkContext,
+val kafkaParams: Map[String, String],
+val batch: Array[KafkaRDDPartition],
+messageHandler: MessageAndMetadata[K, V] => R
+  ) extends RDD[R](sc, Nil) with Logging {
+
+  override def getPartitions: Array[Partition] = 
batch.asInstanceOf[Array[Partition]]
+
+  override def getPreferredLocations(thePart: Partition): Seq[String] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+// TODO is additional hostname resolution necessary here
+Seq(part.host)
+  }
+
+  override def compute(thePart: Partition, context: TaskContext): 
Iterator[R] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+if (part.fromOffset >= part.untilOffset) {
+  log.warn("Beginning offset is same or after ending offset " +
+s"skipping ${part.topic} ${part.partition}")
+  Iterator.empty
+} else {
+  new NextIterator[R] {
+context.addTaskCompletionListener{ context => closeIfNeeded() }
+
+log.info(s"Computing topic ${part.topic}, partition 
${part.partition} " +
+  s"offsets ${part.fromOffset} -> ${part.untilOffset}")
+
+val kc = new KafkaCluster(kafkaParams)
+val keyDecoder = 
classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
+  .newInstance(kc.config.props)
+  .asInstanceOf[Decoder[K]]
+val valueDecoder = 
classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
+  .newInstance(kc.config.props)
+  .asInstanceOf[Decoder[V]]
+val consumer = connectLeader
+var requestOffset = part.fromOffset
+var iter: Iterator[MessageAndOffset] = null
+
+// The idea is to use the provided preferred host, except on task 
retry atttempts,
+// to minimize number of kafka metadata requests
+private def connectLeader: SimpleConsumer = {
+  if (context.attemptNumber > 0) {
+kc.connectLeader(part.topic, part.partition).fold(
+  errs => throw new Exception(
+s"Couldn't connect to leader for 

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23586631
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/rdd/kafka/KafkaCluster.scala ---
@@ -0,0 +1,313 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd.kafka
+
+import scala.util.control.NonFatal
+import scala.collection.mutable.ArrayBuffer
+import java.util.Properties
+import kafka.api._
+import kafka.common.{ErrorMapping, OffsetMetadataAndError, 
TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+
+/**
+  * Convenience methods for interacting with a Kafka cluster.
+  * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";>
+  * configuration parameters.
+  *   Requires "metadata.broker.list" or "bootstrap.servers" to be set 
with Kafka broker(s),
+  *   NOT zookeeper servers, specified in host1:port1,host2:port2 form
+  */
+class KafkaCluster(val kafkaParams: Map[String, String]) extends 
Serializable {
+  import KafkaCluster.Err
+
+  val seedBrokers: Array[(String, Int)] =
+kafkaParams.get("metadata.broker.list")
+  .orElse(kafkaParams.get("bootstrap.servers"))
+  .getOrElse(throw new Exception("Must specify metadata.broker.list or 
bootstrap.servers"))
+  .split(",").map { hp =>
+val hpa = hp.split(":")
+(hpa(0), hpa(1).toInt)
+  }
+
+  // ConsumerConfig isn't serializable
+  @transient private var _config: ConsumerConfig = null
+
+  def config: ConsumerConfig = this.synchronized {
+if (_config == null) {
+  _config = KafkaCluster.consumerConfig(kafkaParams)
+}
+_config
+  }
+
+  def connect(host: String, port: Int): SimpleConsumer =
+new SimpleConsumer(host, port, config.socketTimeoutMs,
+  config.socketReceiveBufferBytes, config.clientId)
+
+  def connect(hostAndPort: (String, Int)): SimpleConsumer =
+connect(hostAndPort._1, hostAndPort._2)
+
+  def connectLeader(topic: String, partition: Int): Either[Err, 
SimpleConsumer] =
+findLeader(topic, partition).right.map(connect)
+
+  def findLeader(topic: String, partition: Int): Either[Err, (String, 
Int)] = {
+val req = TopicMetadataRequest(TopicMetadataRequest.CurrentVersion,
+  0, config.clientId, Seq(topic))
+val errs = new Err
+withBrokers(seedBrokers, errs) { consumer =>
+  val resp: TopicMetadataResponse = consumer.send(req)
+  resp.topicsMetadata.find(_.topic == topic).flatMap { t =>
+t.partitionsMetadata.find(_.partitionId == partition)
+  }.foreach { partitionMeta =>
+partitionMeta.leader.foreach { leader =>
+  return Right((leader.host, leader.port))
+}
+  }
+}
+Left(errs)
+  }
+
+  def findLeaders(
+topicAndPartitions: Set[TopicAndPartition]
+  ): Either[Err, Map[TopicAndPartition, (String, Int)]] = {
+getPartitionMetadata(topicAndPartitions.map(_.topic)).right.flatMap { 
tms =>
+  val result = tms.flatMap { tm: TopicMetadata =>
+tm.partitionsMetadata.flatMap { pm =>
+  val tp = TopicAndPartition(tm.topic, pm.partitionId)
+  if (topicAndPartitions(tp)) {
+pm.leader.map { l =>
+  tp -> (l.host -> l.port)
+}
+  } else {
+None
+  }
+}
+  }.toMap
+  if (result.keys.size == topicAndPartitions.size) {
+Right(result)
+  } else {
+val missing = topicAndPartitions.diff(result.keys.toSet)
+val err = new Err
+err.append(new Exception(s"Couldn't find leaders for ${missing}"))
+Left(err)
+  }
+}
+  }
+
+  def getPartitions(topics: Set[String]): Either[Err, 
Set[TopicAndPartition]] =
+getPartitionMetadata(topics).right.map { r =>
+  

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23586598
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/rdd/kafka/KafkaCluster.scala ---
@@ -0,0 +1,313 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd.kafka
+
+import scala.util.control.NonFatal
+import scala.collection.mutable.ArrayBuffer
+import java.util.Properties
+import kafka.api._
+import kafka.common.{ErrorMapping, OffsetMetadataAndError, 
TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+
+/**
+  * Convenience methods for interacting with a Kafka cluster.
+  * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";>
+  * configuration parameters.
+  *   Requires "metadata.broker.list" or "bootstrap.servers" to be set 
with Kafka broker(s),
+  *   NOT zookeeper servers, specified in host1:port1,host2:port2 form
+  */
+class KafkaCluster(val kafkaParams: Map[String, String]) extends 
Serializable {
+  import KafkaCluster.Err
+
+  val seedBrokers: Array[(String, Int)] =
+kafkaParams.get("metadata.broker.list")
+  .orElse(kafkaParams.get("bootstrap.servers"))
+  .getOrElse(throw new Exception("Must specify metadata.broker.list or 
bootstrap.servers"))
+  .split(",").map { hp =>
+val hpa = hp.split(":")
+(hpa(0), hpa(1).toInt)
+  }
+
+  // ConsumerConfig isn't serializable
+  @transient private var _config: ConsumerConfig = null
+
+  def config: ConsumerConfig = this.synchronized {
+if (_config == null) {
+  _config = KafkaCluster.consumerConfig(kafkaParams)
+}
+_config
+  }
+
+  def connect(host: String, port: Int): SimpleConsumer =
+new SimpleConsumer(host, port, config.socketTimeoutMs,
+  config.socketReceiveBufferBytes, config.clientId)
+
+  def connect(hostAndPort: (String, Int)): SimpleConsumer =
+connect(hostAndPort._1, hostAndPort._2)
+
+  def connectLeader(topic: String, partition: Int): Either[Err, 
SimpleConsumer] =
+findLeader(topic, partition).right.map(connect)
+
+  def findLeader(topic: String, partition: Int): Either[Err, (String, 
Int)] = {
+val req = TopicMetadataRequest(TopicMetadataRequest.CurrentVersion,
+  0, config.clientId, Seq(topic))
+val errs = new Err
+withBrokers(seedBrokers, errs) { consumer =>
+  val resp: TopicMetadataResponse = consumer.send(req)
+  resp.topicsMetadata.find(_.topic == topic).flatMap { t =>
+t.partitionsMetadata.find(_.partitionId == partition)
+  }.foreach { partitionMeta =>
+partitionMeta.leader.foreach { leader =>
+  return Right((leader.host, leader.port))
+}
+  }
+}
+Left(errs)
+  }
+
+  def findLeaders(
+topicAndPartitions: Set[TopicAndPartition]
+  ): Either[Err, Map[TopicAndPartition, (String, Int)]] = {
+getPartitionMetadata(topicAndPartitions.map(_.topic)).right.flatMap { 
tms =>
+  val result = tms.flatMap { tm: TopicMetadata =>
+tm.partitionsMetadata.flatMap { pm =>
+  val tp = TopicAndPartition(tm.topic, pm.partitionId)
+  if (topicAndPartitions(tp)) {
+pm.leader.map { l =>
+  tp -> (l.host -> l.port)
+}
+  } else {
+None
+  }
+}
+  }.toMap
+  if (result.keys.size == topicAndPartitions.size) {
+Right(result)
+  } else {
+val missing = topicAndPartitions.diff(result.keys.toSet)
+val err = new Err
+err.append(new Exception(s"Couldn't find leaders for ${missing}"))
+Left(err)
+  }
+}
+  }
+
+  def getPartitions(topics: Set[String]): Either[Err, 
Set[TopicAndPartition]] =
+getPartitionMetadata(topics).right.map { r =>
+  

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23586515
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/rdd/kafka/KafkaCluster.scala ---
@@ -0,0 +1,318 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd.kafka
+
+import scala.util.control.NonFatal
+import scala.util.Random
+import scala.collection.mutable.ArrayBuffer
+import java.util.Properties
+import kafka.api._
+import kafka.common.{ErrorMapping, OffsetMetadataAndError, 
TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+
+/**
+  * Convenience methods for interacting with a Kafka cluster.
+  * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";>
+  * configuration parameters.
+  *   Requires "metadata.broker.list" or "bootstrap.servers" to be set 
with Kafka broker(s),
+  *   NOT zookeeper servers, specified in host1:port1,host2:port2 form
+  */
+class KafkaCluster(val kafkaParams: Map[String, String]) extends 
Serializable {
+  import KafkaCluster.{Err, LeaderOffset}
+
+  val seedBrokers: Array[(String, Int)] =
+kafkaParams.get("metadata.broker.list")
+  .orElse(kafkaParams.get("bootstrap.servers"))
+  .getOrElse(throw new Exception("Must specify metadata.broker.list or 
bootstrap.servers"))
+  .split(",").map { hp =>
+val hpa = hp.split(":")
+(hpa(0), hpa(1).toInt)
+  }
+
+  // ConsumerConfig isn't serializable
+  @transient private var _config: ConsumerConfig = null
+
+  def config: ConsumerConfig = this.synchronized {
+if (_config == null) {
+  _config = KafkaCluster.consumerConfig(kafkaParams)
+}
+_config
+  }
+
+  def connect(host: String, port: Int): SimpleConsumer =
+new SimpleConsumer(host, port, config.socketTimeoutMs,
+  config.socketReceiveBufferBytes, config.clientId)
+
+  def connect(hostAndPort: (String, Int)): SimpleConsumer =
+connect(hostAndPort._1, hostAndPort._2)
+
+  def connectLeader(topic: String, partition: Int): Either[Err, 
SimpleConsumer] =
+findLeader(topic, partition).right.map(connect)
+
+  def findLeader(topic: String, partition: Int): Either[Err, (String, 
Int)] = {
+val req = TopicMetadataRequest(TopicMetadataRequest.CurrentVersion,
+  0, config.clientId, Seq(topic))
+val errs = new Err
+withBrokers(Random.shuffle(seedBrokers), errs) { consumer =>
+  val resp: TopicMetadataResponse = consumer.send(req)
+  resp.topicsMetadata.find(_.topic == topic).flatMap { t =>
+t.partitionsMetadata.find(_.partitionId == partition)
+  }.foreach { partitionMeta =>
+partitionMeta.leader.foreach { leader =>
+  return Right((leader.host, leader.port))
+}
+  }
+}
+Left(errs)
+  }
+
+  def findLeaders(
+topicAndPartitions: Set[TopicAndPartition]
+  ): Either[Err, Map[TopicAndPartition, (String, Int)]] = {
+getPartitionMetadata(topicAndPartitions.map(_.topic)).right.flatMap { 
tms =>
+  val result = tms.flatMap { tm: TopicMetadata =>
+tm.partitionsMetadata.flatMap { pm =>
--- End diff --

this block also


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23586464
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/rdd/kafka/KafkaCluster.scala ---
@@ -0,0 +1,318 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd.kafka
+
+import scala.util.control.NonFatal
+import scala.util.Random
+import scala.collection.mutable.ArrayBuffer
+import java.util.Properties
+import kafka.api._
+import kafka.common.{ErrorMapping, OffsetMetadataAndError, 
TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+
+/**
+  * Convenience methods for interacting with a Kafka cluster.
+  * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";>
+  * configuration parameters.
+  *   Requires "metadata.broker.list" or "bootstrap.servers" to be set 
with Kafka broker(s),
+  *   NOT zookeeper servers, specified in host1:port1,host2:port2 form
+  */
+class KafkaCluster(val kafkaParams: Map[String, String]) extends 
Serializable {
+  import KafkaCluster.{Err, LeaderOffset}
+
+  val seedBrokers: Array[(String, Int)] =
+kafkaParams.get("metadata.broker.list")
+  .orElse(kafkaParams.get("bootstrap.servers"))
+  .getOrElse(throw new Exception("Must specify metadata.broker.list or 
bootstrap.servers"))
+  .split(",").map { hp =>
+val hpa = hp.split(":")
+(hpa(0), hpa(1).toInt)
+  }
+
+  // ConsumerConfig isn't serializable
+  @transient private var _config: ConsumerConfig = null
+
+  def config: ConsumerConfig = this.synchronized {
+if (_config == null) {
+  _config = KafkaCluster.consumerConfig(kafkaParams)
+}
+_config
+  }
+
+  def connect(host: String, port: Int): SimpleConsumer =
+new SimpleConsumer(host, port, config.socketTimeoutMs,
+  config.socketReceiveBufferBytes, config.clientId)
+
+  def connect(hostAndPort: (String, Int)): SimpleConsumer =
+connect(hostAndPort._1, hostAndPort._2)
+
+  def connectLeader(topic: String, partition: Int): Either[Err, 
SimpleConsumer] =
+findLeader(topic, partition).right.map(connect)
+
+  def findLeader(topic: String, partition: Int): Either[Err, (String, 
Int)] = {
+val req = TopicMetadataRequest(TopicMetadataRequest.CurrentVersion,
+  0, config.clientId, Seq(topic))
+val errs = new Err
+withBrokers(Random.shuffle(seedBrokers), errs) { consumer =>
+  val resp: TopicMetadataResponse = consumer.send(req)
+  resp.topicsMetadata.find(_.topic == topic).flatMap { t =>
--- End diff --

this block of code is pretty hard to understand with so many level of 
nesting. can you rewrite it? maybe by introducing variables and adding comments 
to explain what is going on. overall I feel this PR went slightly overboard 
with Scala. With no explicit type, intermediate variable, and comment, it is 
pretty hard to understand a lot of blocks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4109#issuecomment-71587461
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26140/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4109#issuecomment-71587457
  
  [Test build #26140 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26140/consoleFull)
 for   PR 4109 at commit 
[`c524770`](https://github.com/apache/spark/commit/c524770af3b9964ff5faee33cc2bd74cd50570c7).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4217#issuecomment-71586983
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26139/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4217#issuecomment-71586980
  
  [Test build #26139 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26139/consoleFull)
 for   PR 4217 at commit 
[`314c424`](https://github.com/apache/spark/commit/314c424b5197a88cf14f6c81a839014b21479831).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4217#discussion_r23585795
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 
---
@@ -371,17 +371,19 @@ object Vectors {
   squaredDistance += score * score
 }
 
-  case (v1: SparseVector, v2: DenseVector) if v1.indices.length / 
v1.size < 0.5 =>
+  case (v1: SparseVector, v2: DenseVector) =>
 squaredDistance = sqdist(v1, v2)
 
-  case (v1: DenseVector, v2: SparseVector) if v2.indices.length / 
v2.size < 0.5 =>
+  case (v1: DenseVector, v2: SparseVector) =>
 squaredDistance = sqdist(v2, v1)
 
-  // When a SparseVector is approximately dense, we treat it as a 
DenseVector
   case (v1, v2) =>
-squaredDistance = v1.toArray.zip(v2.toArray).foldLeft(0.0){ 
(distance, elems) =>
-  val score = elems._1 - elems._2
-  distance + score * score
+var kv = 0
+val nnzv = v1.size
+while (kv < nnzv) {
+  var score = v1(kv) - v2(kv)
--- End diff --

`var` -> `val`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...

2015-01-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4217#discussion_r23585794
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 
---
@@ -371,17 +371,19 @@ object Vectors {
   squaredDistance += score * score
 }
 
-  case (v1: SparseVector, v2: DenseVector) if v1.indices.length / 
v1.size < 0.5 =>
+  case (v1: SparseVector, v2: DenseVector) =>
 squaredDistance = sqdist(v1, v2)
 
-  case (v1: DenseVector, v2: SparseVector) if v2.indices.length / 
v2.size < 0.5 =>
+  case (v1: DenseVector, v2: SparseVector) =>
 squaredDistance = sqdist(v2, v1)
 
-  // When a SparseVector is approximately dense, we treat it as a 
DenseVector
   case (v1, v2) =>
--- End diff --

I would recommend using 

~~~
case (DenseVector(vv1), DenseVector(vv2)) =>
  var i = 0
  val sz = vv1.size
  while (i < sz) {
...
  }
case _ =>
  // throw exception
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-26 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4073#issuecomment-71585503
  
Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4073


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5422] Add support for sending Graphite ...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4218#issuecomment-71584864
  
  [Test build #26142 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26142/consoleFull)
 for   PR 4218 at commit 
[`ebae393`](https://github.com/apache/spark/commit/ebae39347ab3d187bb5eef943882b6bb4c385113).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5422] Add support for sending Graphite ...

2015-01-26 Thread ryan-williams
GitHub user ryan-williams opened a pull request:

https://github.com/apache/spark/pull/4218

[SPARK-5422] Add support for sending Graphite metrics via UDP

Depends on [SPARK-5413](https://issues.apache.org/jira/browse/SPARK-5413) / 
#4209, included here, will rebase once the latter's merged.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ryan-williams/spark udp

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4218.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4218


commit cb582623170294311a6880d8a178c712fbaf0b6c
Author: Ryan Williams 
Date:   2015-01-26T21:50:55Z

bump metrics dependency to v3.1.0

pick up batching of TCP requests to graphite, among other things

commit ebae39347ab3d187bb5eef943882b6bb4c385113
Author: Ryan Williams 
Date:   2015-01-27T03:25:19Z

Add support for sending Graphite metrics via UDP




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-26 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/4073#issuecomment-71584396
  
@mengxr This can also be viewd as a bugfix which prevents overwriting of 
the param `subSamplingRate`, which was hardcoded to 1.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5388] Provide a stable application...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4216#issuecomment-71583774
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26138/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5388] Provide a stable application...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4216#issuecomment-71583770
  
  [Test build #26138 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26138/consoleFull)
 for   PR 4216 at commit 
[`6568ca5`](https://github.com/apache/spark/commit/6568ca534166bdcad7533385ac958d049ca63a28).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `   *   (4) the main class for the child`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4786][SQL]: Parquet filter pushdown for...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4156#issuecomment-71583576
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26137/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3298][SQL] Add flag control overwrite r...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4175#issuecomment-71583562
  
  [Test build #26141 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26141/consoleFull)
 for   PR 4175 at commit 
[`26c6011`](https://github.com/apache/spark/commit/26c6011084e1540f1dd959f733d60acb52992de9).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4786][SQL]: Parquet filter pushdown for...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4156#issuecomment-71583569
  
  [Test build #26137 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26137/consoleFull)
 for   PR 4156 at commit 
[`a403979`](https://github.com/apache/spark/commit/a40397938fa777e3303e3dd800d378af4c9df575).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4109#issuecomment-71582792
  
  [Test build #26140 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26140/consoleFull)
 for   PR 4109 at commit 
[`c524770`](https://github.com/apache/spark/commit/c524770af3b9964ff5faee33cc2bd74cd50570c7).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4217#issuecomment-71582448
  
  [Test build #26139 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26139/consoleFull)
 for   PR 4217 at commit 
[`314c424`](https://github.com/apache/spark/commit/314c424b5197a88cf14f6c81a839014b21479831).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...

2015-01-26 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/4217

[SPARK-5419][Mllib] Fix the logic in Vectors.sqdist

The current implementation in Vectors.sqdist is not efficient because of 
allocating temp arrays. There is also a bug in the code `v1.indices.length / 
v1.size < 0.5`. This pr fixes the bug and refactors sqdist without allocating 
new arrays.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 fix_sqdist

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4217.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4217


commit 314c424b5197a88cf14f6c81a839014b21479831
Author: Liang-Chi Hsieh 
Date:   2015-01-27T02:51:51Z

Fix sqdist bug.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5341] Use maven coordinates as dep...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4215#issuecomment-71582038
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26135/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5341] Use maven coordinates as dep...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4215#issuecomment-71582029
  
  [Test build #26135 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26135/consoleFull)
 for   PR 4215 at commit 
[`6645af4`](https://github.com/apache/spark/commit/6645af42b65b1e5bc1c37065e8e07bc55c58be1a).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5341] Use maven coordinates as dep...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4215#issuecomment-71581842
  
  [Test build #26136 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26136/consoleFull)
 for   PR 4215 at commit 
[`a0870af`](https://github.com/apache/spark/commit/a0870af7a16ffce241fac7979e76535f7ef864be).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5341] Use maven coordinates as dep...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4215#issuecomment-71581847
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26136/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4955]With executor dynamic scaling enab...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3962#issuecomment-71581696
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26134/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4955]With executor dynamic scaling enab...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3962#issuecomment-71581689
  
  [Test build #26134 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26134/consoleFull)
 for   PR 3962 at commit 
[`48d9ebb`](https://github.com/apache/spark/commit/48d9ebb6a9e8c82c5bbd9221dec5d8baca987a82).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-26 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/4109#discussion_r23583664
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -34,6 +34,9 @@ sealed trait Matrix extends Serializable {
   /** Number of columns. */
   def numCols: Int
 
+  /** Flag that keeps track whether the matrix is transposed or not. False 
by default. */
+  private[linalg] var isTransposed = false
--- End diff --

We do modify it in place. In the transpose functions. We don't specify
isTransposed in the constructor.
On Jan 26, 2015 1:38 PM, "Xiangrui Meng"  wrote:

> In mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
> :
>
> > @@ -34,6 +34,9 @@ sealed trait Matrix extends Serializable {
> >/** Number of columns. */
> >def numCols: Int
> >
> > +  /** Flag that keeps track whether the matrix is transposed or not. 
False by default. */
> > +  private[linalg] var isTransposed = false
>
> We should make this public because otherwise users won't be able to
> understand public members like values. This should be a val instead of var
> because we don't modify it in-place.
>
> —
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5417] Remove redundant executor-id set(...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4213#issuecomment-71580340
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26132/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5417] Remove redundant executor-id set(...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4213#issuecomment-71580332
  
  [Test build #26132 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26132/consoleFull)
 for   PR 4213 at commit 
[`b3e4f7b`](https://github.com/apache/spark/commit/b3e4f7b5a3bf15d8bf5486df8cd8b74ab6b5a142).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class Rating(userId: Int, movieId: Int, rating: Float, 
timestamp: Long)`
  * `  case class Movie(movieId: Int, title: String, genres: Seq[String])`
  * `  case class Params(`
  * `class ALS extends Estimator[ALSModel] with ALSParams `
  * `  case class RatingBlock(srcIds: Array[Int], dstIds: Array[Int], 
ratings: Array[Float]) `
  * `  protected case class Keyword(str: String) `
  * `class SqlLexical extends StdLexical `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5416] init Executor.threadPool before E...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4212#issuecomment-71579938
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26131/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5416] init Executor.threadPool before E...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4212#issuecomment-71579925
  
  [Test build #26131 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26131/consoleFull)
 for   PR 4212 at commit 
[`236f2ad`](https://github.com/apache/spark/commit/236f2ad62d6ffd581746ed7c642093b00b904588).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5388] Provide a stable application...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4216#issuecomment-71579669
  
  [Test build #26138 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26138/consoleFull)
 for   PR 4216 at commit 
[`6568ca5`](https://github.com/apache/spark/commit/6568ca534166bdcad7533385ac958d049ca63a28).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5388] Provide a stable application...

2015-01-26 Thread andrewor14
GitHub user andrewor14 opened a pull request:

https://github.com/apache/spark/pull/4216

[WIP][SPARK-5388] Provide a stable application submission gateway

The goal is to provide a stable application submission gateway that is not 
inherently based on Akka, which is unstable across versions. This PR targets 
standalone cluster mode, but is implemented in a general enough manner to be 
extended to other modes in the future. Client mode is currently not included in 
the changes here because there are many more Akka messages exchanged there.

As of the changes here, the Master will advertise two ports, 7077 and 
17077. We need to keep around the old one (7077) for client mode and older 
versions of Spark submit. However, all new versions of Spark submit will use 
the REST gateway (17077).

This is still WIP, but comments and feedback are most welcome.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewor14/spark rest

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4216.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4216


commit 53e7c0e9a208e70c374793baf693d64fe4e6b8e1
Author: Andrew Or 
Date:   2015-01-15T00:29:52Z

Initial client, server, and all the messages

This commit introduces type-safe schemas for all messages exchanged
in the REST protocol. Each message is expected to contain an ACTION
field that has only one possible value for each message type.

Before the message is sent, we validate that all required fields
are in fact present, and that the value of the action field is the
correct type.

The next step is to actually integrate this in standalone mode.

commit af9d9cb2933774e33278022a38196b522193a766
Author: Andrew Or 
Date:   2015-01-17T03:12:32Z

Integrate REST protocol in standalone mode

This commit embeds the REST server in the standalone Master and
forces Spark submit to submit applications through the REST client.
This is the first working end-to-end implementation of a stable
submission interface in standalone cluster mode.

commit 6ff088dca37f3f888a94c00a5d533ef8c6a6e6f5
Author: Andrew Or 
Date:   2015-01-20T00:24:05Z

Rename classes to generalize REST protocol

Previously the REST protocol was very explicitly tied to the
standalone mode. This commit frees the protocol from this
restriction.

commit 484bd2172b847433c989d7c450fbbc99dddb1f56
Author: Andrew Or 
Date:   2015-01-20T01:02:25Z

Specify an ordering for fields in SubmitDriverRequestMessage

Previously APP_ARGs, SPARK_PROPERTYs and ENVIRONMENT_VARIABLEs
will appear in the JSON at random places. Now they are grouped
together at the end of the JSON blob.

commit e958caec3e2f5bbd1dd2cf6dc0e02a3ad27c2699
Author: Andrew Or 
Date:   2015-01-20T23:29:56Z

Supported nested values in messages

This is applicable to application arguments, Spark properties, and
environment variables, all of which were previously handled through
parameterized fields, which were cumbersome to parse. Since JSON
naturally supports nesting, we should take advantage of it too.

This commit refactors the code that converts the messages to and
from JSON in a way that subclasses can easily override the conversion
behavior without duplicating code.

commit 544de1dd5d4d2bb35180775ddf655bedee4d44ae
Author: Andrew Or 
Date:   2015-01-21T22:02:33Z

Major clean ups in code and comments

This involves refactoring SparkSubmit a little to put the code
that launches the REST client in the right place. This commit also
adds port retry logic in the REST server, which was previously
missing.

commit 120ab9d33484ccc20f45aee6272daeca5ebcc878
Author: Andrew Or 
Date:   2015-01-21T22:55:35Z

Support kill and request driver status through SparkSubmit

commit b44e103b78b36fadd887dc4b894027a03069b1f7
Author: Andrew Or 
Date:   2015-01-22T01:19:33Z

Implement status requests + fix validation behavior

This commit makes the StandaloneRestServer actually handle status
requests. The existing polling behavior from o.a.s.deploy.Client
is also implemented in the StandaloneRestClient and amended.

Additionally, the validation behavior was confusing before this
commit. Previously the error message would seem to indicate that
the user constructed a malformed message even if the message was
constructed on the server side. This commit ensures that the error
message is different for these two cases.

commit 51c5ca6d8ef448f7b6181c684fa3ee3794f0d6b8
Author: Andrew Or 
Date:   2015-01-22T01:43:43Z

Distinguish client and server side Spark versions

Otherwise it's a little ambiguous what we mean by SPARK_VERSION.

commit 

[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...

2015-01-26 Thread suyanNone
Github user suyanNone commented on the pull request:

https://github.com/apache/spark/pull/4055#issuecomment-71579227
  
@cloud-fan  ZhangLei, SunHongLiang, HanLi, ChenXingYu, blabla...I am 
ZhangLei's classmate in ZJU. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5384][mllib] Vectors.sqdist returns inc...

2015-01-26 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4183#issuecomment-71579048
  
@hhbyyh Great. I created a JIRA for this issue and assigned it to you: 
https://issues.apache.org/jira/browse/SPARK-5419 Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4786][SQL]: Parquet filter pushdown for...

2015-01-26 Thread saucam
Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/4156#issuecomment-71578360
  
fixed the styling issues. @liancheng thanks for the feedback!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4786][SQL]: Parquet filter pushdown for...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4156#issuecomment-71578336
  
  [Test build #26137 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26137/consoleFull)
 for   PR 4156 at commit 
[`a403979`](https://github.com/apache/spark/commit/a40397938fa777e3303e3dd800d378af4c9df575).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-26 Thread squito
Github user squito commented on the pull request:

https://github.com/apache/spark/pull/3798#issuecomment-71577982
  
@koeninger
I doubt that we want to go this route in this case, but just in case you're 
interested, I think a much better way to handle multiple errors gracefully is 
with [scalactic's `Or`](http://www.scalactic.org/user_guide/OrAndEvery).  Its 
much better than `Either` for this case of building up a set of errors to 
report back to the user.  And scalactic is a nicely designed, small library 
(eg. you're not pulling scalaz).  Probably not worth it for this one case, but 
thought you might find it interesting :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3298][SQL] Add flag control overwrite r...

2015-01-26 Thread OopsOutOfMemory
Github user OopsOutOfMemory commented on the pull request:

https://github.com/apache/spark/pull/4175#issuecomment-71577816
  
Thanks, squito :) 
I'm working on this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >