[GitHub] spark pull request: [SPARK-8226][SQL]Add function shiftrightunsign...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7035#issuecomment-118256514
  
  [Test build #36476 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36476/console)
 for   PR 7035 at commit 
[`3e9f5ae`](https://github.com/apache/spark/commit/3e9f5aef20208c7e20e024c20f16745b12f0bea1).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class Word2VecModel(JavaVectorTransformer, JavaSaveable, JavaLoader):`
  * `case class CreateNamedStruct(children: Seq[Expression]) extends 
Expression `
  * `case class Factorial(child: Expression) extends UnaryExpression with 
ExpectsInputTypes `
  * `case class ShiftLeft(left: Expression, right: Expression) extends 
BinaryExpression `
  * `case class ShiftRight(left: Expression, right: Expression) extends 
BinaryExpression `
  * `case class ShiftRightUnsigned(left: Expression, right: Expression) 
extends BinaryExpression `
  * `case class Md5(child: Expression) extends UnaryExpression with 
ExpectsInputTypes `
  * `case class Sha1(child: Expression) extends UnaryExpression with 
ExpectsInputTypes `
  * `case class Crc32(child: Expression) extends UnaryExpression with 
ExpectsInputTypes `
  * `case class Not(child: Expression) extends UnaryExpression with 
Predicate with ExpectsInputTypes `
  * `trait StringRegexExpression extends ExpectsInputTypes `
  * `trait CaseConversionExpression extends ExpectsInputTypes `
  * `trait StringComparison extends ExpectsInputTypes `
  * `case class StringLength(child: Expression) extends UnaryExpression 
with ExpectsInputTypes `
  * `protected[sql] abstract class AtomicType extends DataType `
  * `abstract class NumericType extends AtomicType `
  * `abstract class DataType extends AbstractDataType `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118256360
  
  [Test build #36485 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36485/console)
 for   PR 7192 at commit 
[`738e81d`](https://github.com/apache/spark/commit/738e81dbee65587e85422b434ec2a3b0d684769e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class Word2VecModel(JavaVectorTransformer, JavaSaveable, JavaLoader):`
  * `case class Factorial(child: Expression) extends UnaryExpression with 
ExpectsInputTypes `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118256377
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread sarutak
Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/7207#discussion_r33843582
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala ---
@@ -82,6 +83,48 @@ class UDFSuite extends QueryTest {
 assert(ctx.sql("SELECT strLenScala('test', 1)").head().getInt(0) === 5)
   }
 
+  test("UDF in a WHERE") {
+testData.sqlContext.udf.register("oneArgFilter", (n:Int) => { n > 80 })
+
+val result =
+  testData.sqlContext.sql("SELECT * FROM testData WHERE 
oneArgFilter(key)")
+assert(result.count() === 20)
+  }
+
+  test("UDF in a HAVING") {
+testData.sqlContext.udf.register("havingFilter", (n:Long) => { n > 5 })
+
+val result =
+  testData.sqlContext.sql("SELECT g, SUM(v) as s FROM groupData GROUP 
BY g HAVING havingFilter(s)")
--- End diff --

This line exceeds 100 characters and the last test failure is due to this. 
@spirom , Could you add proper indentation here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8777] [SQL] Add random data generator t...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7176#issuecomment-118254752
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7203#issuecomment-118254748
  
  [Test build #36486 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36486/consoleFull)
 for   PR 7203 at commit 
[`2d0ed15`](https://github.com/apache/spark/commit/2d0ed1578589adf8ad3cbf4bfaff085fb27171df).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8777] [SQL] Add random data generator t...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7176#issuecomment-118254700
  
  [Test build #36474 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36474/console)
 for   PR 7176 at commit 
[`f71634d`](https://github.com/apache/spark/commit/f71634d73470189cfe45a89d2a69ea9c5ffa9e29).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class RandomDataGeneratorSuite extends SparkFunSuite `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7203#issuecomment-118253650
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7203#issuecomment-118254038
  
LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7203#issuecomment-118253745
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8695] [core] [WIP] TreeAggregation shou...

2015-07-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7168#issuecomment-118253071
  
cc @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread piaozhexiu
Github user piaozhexiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/7203#discussion_r33843363
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -24,13 +24,18 @@ import org.apache.spark.sql.types.DataType
  * User-defined function.
  * @param dataType  Return type of function.
  */
-case class ScalaUDF(function: AnyRef, dataType: DataType, children: 
Seq[Expression])
-  extends Expression {
+case class ScalaUDF(
+function: AnyRef,
+dataType: DataType,
+children: Seq[Expression],
+expectedInputTypes: Seq[DataType] = Nil) extends Expression with 
ExpectsInputTypes {
--- End diff --

You're right. Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread piaozhexiu
Github user piaozhexiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/7203#discussion_r33843354
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -87,6 +87,7 @@ class UDFRegistration private[sql] (sqlContext: 
SQLContext) extends Logging {
 (0 to 22).map { x =>
   val types = (1 to x).foldRight("RT")((i, s) => {s"A$i, $s"})
   val typeTags = (1 to x).map(i => s"A${i}: TypeTag").foldLeft("RT: 
TypeTag")(_ + ", " + _)
+  val inputTypes = (1 to x).foldLeft("Nil")((s, i) => {s"$s :+ 
ScalaReflection.schemaFor[A$i].dataType"})
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-118251941
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7207#issuecomment-118251827
  
  [Test build #36484 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36484/console)
 for   PR 7207 at commit 
[`1a3c5ff`](https://github.com/apache/spark/commit/1a3c5ff54c43d60e34e7591e7f175840b0e91513).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class GroupData(g: String, v: Int)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-118251812
  
  [Test build #36480 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36480/console)
 for   PR 7099 at commit 
[`072e948`](https://github.com/apache/spark/commit/072e9484eb2750952340ceb10d553dfac6768471).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7207#issuecomment-118251829
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8226][SQL]Add function shiftrightunsign...

2015-07-02 Thread zhichao-li
Github user zhichao-li commented on a diff in the pull request:

https://github.com/apache/spark/pull/7035#discussion_r33843222
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -521,6 +521,55 @@ case class ShiftRight(left: Expression, right: 
Expression) extends BinaryExpress
   }
 }
 
+case class ShiftRightUnsigned(left: Expression, right: Expression) extends 
BinaryExpression {
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+(left.dataType, right.dataType) match {
+  case (NullType, _) | (_, NullType) => return 
TypeCheckResult.TypeCheckSuccess
+  case (_, IntegerType) => left.dataType match {
+case LongType | IntegerType | ShortType | ByteType =>
+  return TypeCheckResult.TypeCheckSuccess
+case _ => // failed
+  }
+  case _ => // failed
+}
+TypeCheckResult.TypeCheckFailure(
+  s"ShiftRightUnsigned expects long, integer, short or byte value as 
first argument and an " +
+s"integer value as second argument, not (${left.dataType}, 
${right.dataType})")
+  }
+
+  override def eval(input: InternalRow): Any = {
+val valueLeft = left.eval(input)
+if (valueLeft != null) {
+  val valueRight = right.eval(input)
+  if (valueRight != null) {
+left.dataType match {
--- End diff --

humm, using pattern match with a typed pattern is indeed a best practice in 
scala rather than type tests and casts.  @chenghao-intel any comments? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7207#issuecomment-118251660
  
  [Test build #36484 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36484/consoleFull)
 for   PR 7207 at commit 
[`1a3c5ff`](https://github.com/apache/spark/commit/1a3c5ff54c43d60e34e7591e7f175840b0e91513).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8341] Significant selector feature tran...

2015-07-02 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6795#discussion_r33843163
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/SignificantSelector.scala 
---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import scala.collection.mutable
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: Experimental ::
+ * Model to extract significant indices from vector.
+ *
+ * Significant indices is vector's index that has different value for 
different vectors.
+ *
+ * For example, when you use HashingTF they create big sparse vector,
+ * and this code convert to smallest vector that don't include same values 
indices for all vectors.
+ *
+ * @param indices array of significant indices.
+ */
+@Experimental
+class SignificantSelectorModel(val indices: Array[Int]) extends 
VectorTransformer {
+
+  /**
+   * Applies transformation on a vector.
+   *
+   * @param vector vector to be transformed.
+   * @return transformed vector.
+   */
+  override def transform(vector: Vector): Vector = vector match {
+case DenseVector(vs) =>
+  Vectors.dense(indices.map(vs))
+
+case SparseVector(s, ids, vs) =>
+  var sv_idx = 0
+  var new_idx = 0
+  val elements = new mutable.ListBuffer[(Int, Double)]()
+  
+  for (idx <- indices) {
+while (sv_idx < ids.length && ids(sv_idx) < idx) {
+  sv_idx += 1
+}
+if (sv_idx < ids.length && ids(sv_idx) == idx) {
+  elements += ((new_idx, vs(sv_idx)))
+  sv_idx += 1
+}
+new_idx += 1
+  }
+  
+  Vectors.sparse(indices.length, elements)
+
+case v =>
+  throw new IllegalArgumentException("Don't support vector type " + 
v.getClass)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Specialized model for equivalent vectors
+ */
+@Experimental
+class SignificantSelectorEmptyModel extends 
SignificantSelectorModel(Array[Int]()) {
+  
+  val empty_vector = Vectors.dense(Array[Double]())
+  
+  override def transform(vector: Vector): Vector = empty_vector
+}
+
+/**
+ * :: Experimental ::
+ * Create Significant selector.
+ */
+@Experimental
+class SignificantSelector() {
+
+  /**
+   * Returns a significant vector indices selector.
+   *
+   * @param sources an `RDD[Vector]` containing the vectors.
+   */
+  def fit(sources: RDD[Vector]): SignificantSelectorModel = {
+val sources_count = sources.count()
+val significant_indices = sources.flatMap {
+case DenseVector(vs) =>
+  vs.zipWithIndex
+case SparseVector(_, ids, vs) =>
+  vs.zip(ids)
+case v =>
+  throw new IllegalArgumentException("Don't support vector type " 
+ v.getClass)
+  }
+  .map(e => (e.swap, 1))
+  .reduceByKey(_ + _)
+  .map { case ((idx, value), count) => (idx, (value, count))}
+  .groupByKey()
+  .mapValues { e =>
+val values = e.groupBy(_._1)
+val sum = e.map(_._2).sum
+
+values.size + (if (sum == sources_count || values.contains(0.0)) 0 
else 1)
+  }
+  .filter(_._2 > 1)
+  .keys
+  .collect()
+  .sorted
+
+if (significant_indices.nonEmpty)
+  new SignificantSelectorModel(significant_indices)
+else
+  new SignificantSelectorEmptyModel()
--- End diff --

I would prefer not to add a public class if all it's doing is handling a 
special case. Perhaps we should allow sparse vectors to be empty as well. 
@mengxr thoughts?


---
If your project is set up for it, you can reply to

[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118251491
  
  [Test build #36485 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36485/consoleFull)
 for   PR 7192 at commit 
[`738e81d`](https://github.com/apache/spark/commit/738e81dbee65587e85422b434ec2a3b0d684769e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8341] Significant selector feature tran...

2015-07-02 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6795#discussion_r33843068
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/SignificantSelectorTest.scala
 ---
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.scalatest.FunSuite
+
+class SignificantSelectorTest extends FunSuite with MLlibTestSparkContext {
+  val dv = Vectors.dense(1, 2, 3, 4, 5)
+  val sv1 = Vectors.sparse(5, Seq((0, 1.0), (1, 2.0), (2, 3.0), (3, 4.0), 
(4, 5.0)))
+  val sv2 = Vectors.sparse(5, Seq((2, 3.0)))
+
+  test("same result vector") {
+val vectors = sc.parallelize(List(
+  Vectors.dense(0.0, 1.0, 2.0, 3.0, 4.0),
+  Vectors.dense(4.0, 5.0, 6.0, 7.0, 8.0)
+))
+
+val significant = new SignificantSelector().fit(vectors)
+assert(significant.transform(dv) == dv)
+assert(significant.transform(sv1) == sv1)
+assert(significant.transform(sv2) == sv2)
+  }
+  
+  
+  test("shortest result vector") {
+val vectors = sc.parallelize(List(
+  Vectors.dense(0.0, 2.0, 3.0, 4.0),
+  Vectors.dense(0.0, 2.0, 3.0, 4.0),
+  Vectors.dense(0.0, 2.0, 3.0, 4.0),
+  Vectors.sparse(4, Seq((1, 3.0), (2, 4.0))),
+  Vectors.dense(0.0, 3.0, 5.0, 4.0),
+  Vectors.dense(0.0, 3.0, 7.0, 4.0)
+))
+
+val significant = new SignificantSelector().fit(vectors)
+assert(significant.transform(dv).toString == "[2.0,3.0,4.0]")
--- End diff --

Test equality of vectors


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8341] Significant selector feature tran...

2015-07-02 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6795#discussion_r33843067
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/SignificantSelectorTest.scala
 ---
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.scalatest.FunSuite
+
+class SignificantSelectorTest extends FunSuite with MLlibTestSparkContext {
+  val dv = Vectors.dense(1, 2, 3, 4, 5)
+  val sv1 = Vectors.sparse(5, Seq((0, 1.0), (1, 2.0), (2, 3.0), (3, 4.0), 
(4, 5.0)))
+  val sv2 = Vectors.sparse(5, Seq((2, 3.0)))
+
+  test("same result vector") {
+val vectors = sc.parallelize(List(
+  Vectors.dense(0.0, 1.0, 2.0, 3.0, 4.0),
+  Vectors.dense(4.0, 5.0, 6.0, 7.0, 8.0)
+))
+
+val significant = new SignificantSelector().fit(vectors)
+assert(significant.transform(dv) == dv)
+assert(significant.transform(sv1) == sv1)
+assert(significant.transform(sv2) == sv2)
+  }
+  
+  
+  test("shortest result vector") {
+val vectors = sc.parallelize(List(
+  Vectors.dense(0.0, 2.0, 3.0, 4.0),
+  Vectors.dense(0.0, 2.0, 3.0, 4.0),
+  Vectors.dense(0.0, 2.0, 3.0, 4.0),
+  Vectors.sparse(4, Seq((1, 3.0), (2, 4.0))),
+  Vectors.dense(0.0, 3.0, 5.0, 4.0),
+  Vectors.dense(0.0, 3.0, 7.0, 4.0)
+))
+
+val significant = new SignificantSelector().fit(vectors)
+assert(significant.transform(dv).toString == "[2.0,3.0,4.0]")
+assert(significant.transform(sv1).toString == 
"(3,[0,1,2],[2.0,3.0,4.0])")
+assert(significant.transform(sv2).toString == "(3,[1],[3.0])")
+  }
+  
+  test("empty result vector") {
+val vectors = sc.parallelize(List(
+  Vectors.dense(0.0, 2.0, 3.0, 4.0),
+  Vectors.dense(0.0, 2.0, 3.0, 4.0)
+))
+
+val significant = new SignificantSelector().fit(vectors)
+assert(significant.transform(dv).toString == "[]")
--- End diff --

Test equality of vectors


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7207#issuecomment-118251389
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118251388
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7207#issuecomment-118251369
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118251370
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8192] [SPARK-8193] [SQL] udf current_da...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6985#issuecomment-118251165
  
  [Test build #36483 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36483/consoleFull)
 for   PR 6985 at commit 
[`6a20b64`](https://github.com/apache/spark/commit/6a20b64ab169ba31d61cfffa1c6151f34304e8a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8192] [SPARK-8193] [SQL] udf current_da...

2015-07-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/6985#issuecomment-118251254
  
lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8777] [SQL] Add random data generator t...

2015-07-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7176#discussion_r33842967
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGeneratorSuite.scala 
---
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.scalacheck.Prop.{exists, forAll, secure}
+import org.scalatest.prop.Checkers
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.CatalystTypeConverters
+import org.apache.spark.sql.types._
+
+/**
+ * Tests of [[RandomDataGenerator]].
+ */
+class RandomDataGeneratorSuite extends SparkFunSuite with Checkers {
+
+  /**
+   * Tests random data generation for the given type by using it to 
generate random values then
+   * converting those values into their Catalyst equivalents using 
CatalystTypeConverters.
+   */
+  def testRandomDataGeneration(dataType: DataType, nullable: Boolean = 
true): Unit = {
+val toCatalyst = 
CatalystTypeConverters.createToCatalystConverter(dataType)
+val generator = RandomDataGenerator.forType(dataType, 
nullable).getOrElse {
+  fail(s"Random data generator was not defined for $dataType")
+}
+if (nullable) {
+  check(exists(generator) { _ == null })
+}
+if (!nullable) {
+  check(forAll(generator) { _ != null })
+}
+check(secure(forAll(generator) { v => { toCatalyst(v); true } }))
+  }
+
+  // Basic types:
+  for (
+dataType <- DataTypeTestUtils.atomicTypes;
+nullable <- Seq(true, false)
+if !dataType.isInstanceOf[DecimalType] ||
+  dataType.asInstanceOf[DecimalType].precisionInfo.isEmpty
+  ) {
+test(s"$dataType (nullable=$nullable)") {
+  testRandomDataGeneration(dataType)
+}
+  }
+
+  for (
+arrayType <- DataTypeTestUtils.atomicArrayTypes
+if RandomDataGenerator.forType(arrayType.elementType, 
arrayType.containsNull).isDefined
+  ) {
+test(s"$arrayType") {
+  testRandomDataGeneration(arrayType)
+}
+  }
+
+  val atomicTypesWithDataGenerators =
+
DataTypeTestUtils.atomicTypes.filter(RandomDataGenerator.forType(_).isDefined)
+
+  // Complex types:
+  for (
+keyType <- atomicTypesWithDataGenerators;
+valueType <- atomicTypesWithDataGenerators
+// Scala's BigDecimal.hashCode can lead to OutOfMemoryError on Scala 
2.10 (see SI-6173) and
+// Spark can hit NumberFormatException errors when converting certain 
BigDecimals (SPARK-8802).
+// For these reasons, we don't support generation of maps with decimal 
keys.
+if !keyType.isInstanceOf[DecimalType]
+  ) {
+val mapType = MapType(keyType, valueType)
+test(s"$mapType") {
+  testRandomDataGeneration(mapType)
+}
+  }
+
+  for (
+colOneType <- atomicTypesWithDataGenerators;
--- End diff --

hm to me it is less clear to drop the ; here, although i don't have a 
strong preference


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/7207#issuecomment-118250753
  
ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8192] [SPARK-8193] [SQL] udf current_da...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6985#issuecomment-118250558
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8192] [SPARK-8193] [SQL] udf current_da...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6985#issuecomment-118250508
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8777] [SQL] Add random data generator t...

2015-07-02 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/7176#issuecomment-118250140
  
LGTM except for minor styling issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8776] Increase the default MaxPermSize

2015-07-02 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7196#discussion_r33842849
  
--- Diff: 
launcher/src/test/java/org/apache/spark/launcher/SparkSubmitCommandBuilderSuite.java
 ---
@@ -194,7 +194,7 @@ private void testCmdBuilder(boolean isDriver) throws 
Exception {
 if (isDriver) {
--- End diff --

You can remove this condition then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8777] [SQL] Add random data generator t...

2015-07-02 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/7176#discussion_r33842768
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGeneratorSuite.scala 
---
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.scalacheck.Prop.{exists, forAll, secure}
+import org.scalatest.prop.Checkers
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.CatalystTypeConverters
+import org.apache.spark.sql.types._
+
+/**
+ * Tests of [[RandomDataGenerator]].
+ */
+class RandomDataGeneratorSuite extends SparkFunSuite with Checkers {
+
+  /**
+   * Tests random data generation for the given type by using it to 
generate random values then
+   * converting those values into their Catalyst equivalents using 
CatalystTypeConverters.
+   */
+  def testRandomDataGeneration(dataType: DataType, nullable: Boolean = 
true): Unit = {
+val toCatalyst = 
CatalystTypeConverters.createToCatalystConverter(dataType)
+val generator = RandomDataGenerator.forType(dataType, 
nullable).getOrElse {
+  fail(s"Random data generator was not defined for $dataType")
+}
+if (nullable) {
+  check(exists(generator) { _ == null })
+}
+if (!nullable) {
+  check(forAll(generator) { _ != null })
+}
+check(secure(forAll(generator) { v => { toCatalyst(v); true } }))
+  }
+
+  // Basic types:
+  for (
+dataType <- DataTypeTestUtils.atomicTypes;
+nullable <- Seq(true, false)
+if !dataType.isInstanceOf[DecimalType] ||
+  dataType.asInstanceOf[DecimalType].precisionInfo.isEmpty
+  ) {
+test(s"$dataType (nullable=$nullable)") {
+  testRandomDataGeneration(dataType)
+}
+  }
+
+  for (
+arrayType <- DataTypeTestUtils.atomicArrayTypes
+if RandomDataGenerator.forType(arrayType.elementType, 
arrayType.containsNull).isDefined
+  ) {
+test(s"$arrayType") {
+  testRandomDataGeneration(arrayType)
+}
+  }
+
+  val atomicTypesWithDataGenerators =
+
DataTypeTestUtils.atomicTypes.filter(RandomDataGenerator.forType(_).isDefined)
+
+  // Complex types:
+  for (
+keyType <- atomicTypesWithDataGenerators;
+valueType <- atomicTypesWithDataGenerators
+// Scala's BigDecimal.hashCode can lead to OutOfMemoryError on Scala 
2.10 (see SI-6173) and
+// Spark can hit NumberFormatException errors when converting certain 
BigDecimals (SPARK-8802).
+// For these reasons, we don't support generation of maps with decimal 
keys.
+if !keyType.isInstanceOf[DecimalType]
+  ) {
+val mapType = MapType(keyType, valueType)
+test(s"$mapType") {
+  testRandomDataGeneration(mapType)
+}
+  }
+
+  for (
+colOneType <- atomicTypesWithDataGenerators;
--- End diff --

Oh I didn't notice you were using `()`. You can omit the `;` if you use 
`{}` instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8341] Significant selector feature tran...

2015-07-02 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6795#discussion_r33842684
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/SignificantSelector.scala 
---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import scala.collection.mutable
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: Experimental ::
+ * Model to extract significant indices from vector.
+ *
+ * Significant indices is vector's index that has different value for 
different vectors.
+ *
+ * For example, when you use HashingTF they create big sparse vector,
+ * and this code convert to smallest vector that don't include same values 
indices for all vectors.
+ *
+ * @param indices array of significant indices.
+ */
+@Experimental
+class SignificantSelectorModel(val indices: Array[Int]) extends 
VectorTransformer {
+
+  /**
+   * Applies transformation on a vector.
+   *
+   * @param vector vector to be transformed.
+   * @return transformed vector.
+   */
+  override def transform(vector: Vector): Vector = vector match {
+case DenseVector(vs) =>
+  Vectors.dense(indices.map(vs))
+
+case SparseVector(s, ids, vs) =>
+  var sv_idx = 0
+  var new_idx = 0
+  val elements = new mutable.ListBuffer[(Int, Double)]()
+  
+  for (idx <- indices) {
+while (sv_idx < ids.length && ids(sv_idx) < idx) {
+  sv_idx += 1
+}
+if (sv_idx < ids.length && ids(sv_idx) == idx) {
+  elements += ((new_idx, vs(sv_idx)))
+  sv_idx += 1
+}
+new_idx += 1
+  }
+  
+  Vectors.sparse(indices.length, elements)
+
+case v =>
+  throw new IllegalArgumentException("Don't support vector type " + 
v.getClass)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Specialized model for equivalent vectors
+ */
+@Experimental
+class SignificantSelectorEmptyModel extends 
SignificantSelectorModel(Array[Int]()) {
+  
+  val empty_vector = Vectors.dense(Array[Double]())
+  
+  override def transform(vector: Vector): Vector = empty_vector
+}
+
+/**
+ * :: Experimental ::
+ * Create Significant selector.
+ */
+@Experimental
+class SignificantSelector() {
+
+  /**
+   * Returns a significant vector indices selector.
+   *
+   * @param sources an `RDD[Vector]` containing the vectors.
+   */
+  def fit(sources: RDD[Vector]): SignificantSelectorModel = {
+val sources_count = sources.count()
+val significant_indices = sources.flatMap {
+case DenseVector(vs) =>
+  vs.zipWithIndex
+case SparseVector(_, ids, vs) =>
+  vs.zip(ids)
+case v =>
+  throw new IllegalArgumentException("Don't support vector type " 
+ v.getClass)
+  }
+  .map(e => (e.swap, 1))
+  .reduceByKey(_ + _)
+  .map { case ((idx, value), count) => (idx, (value, count))}
+  .groupByKey()
+  .mapValues { e =>
+val values = e.groupBy(_._1)
+val sum = e.map(_._2).sum
+
+values.size + (if (sum == sources_count || values.contains(0.0)) 0 
else 1)
--- End diff --

`SparseVector#size` should give you the total size of the sparse vector, 
while `SparseVector#numNonzeros` gives you the number of nonzero values.

Also, SparseVectors may contain zero elements (e.g. `Vectors.sparse(1, 
Seq((0, 0.0)))`); it's just that elements which are not active (in `values`) 
are assumed to be zero.

I also don't think 

I understand what you're doing now, but I think that you should make the 
handling of sparse/dense more uniform and explicit s

[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118249799
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118249788
  
  [Test build #36481 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36481/console)
 for   PR 7192 at commit 
[`ec3fb10`](https://github.com/apache/spark/commit/ec3fb10d0cf145bad66e81a412ecc746a0ec5556).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118249489
  
hi @andrewor14 , I rethink about it, closure cleaner maybe expensive, but 
if we can avoid the `$out` reference and then avoid serialization of it, is it 
a kind of speed up? Sorry if I missed something here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8192] [SPARK-8193] [SQL] udf current_da...

2015-07-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6985#discussion_r33842559
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala
 ---
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodeGenContext, 
GeneratedExpressionCode}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+
+/**
+ * Returns the current date at the start of query evaluation.
+ * All calls of current_date within the same query return the same value.
+ */
+case class CurrentDate() extends LeafExpression {
+  override def foldable: Boolean = true
+  override def nullable: Boolean = false
+
+  override def dataType: DataType = DateType
+
+  override def eval(input: InternalRow): Any = {
+DateTimeUtils.millisToDays(System.currentTimeMillis())
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
+val datetimeUtils = "org.apache.spark.sql.catalyst.util.DateTimeUtils"
+s"""
+  boolean ${ev.isNull} = false;
+  ${ctx.javaType(dataType)} ${ev.primitive} =
+$datetimeUtils.millisToDays(System.currentTimeMillis());
+"""
+  }
+}
+
+/**
+ * Returns the current timestamp at the start of query evaluation.
+ * All calls of current_timestamp within the same query return the same 
value.
+ */
+case class CurrentTimestamp() extends LeafExpression {
+  override def foldable: Boolean = true
+  override def nullable: Boolean = false
+
+  override def dataType: DataType = TimestampType
+
+  override def eval(input: InternalRow): Any = {
+System.currentTimeMillis() * 1L
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
--- End diff --

you can remove this one as well.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8192] [SPARK-8193] [SQL] udf current_da...

2015-07-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6985#discussion_r33842558
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeFunctions.scala
 ---
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.expressions.codegen.{CodeGenContext, 
GeneratedExpressionCode}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+
+/**
+ * Returns the current date at the start of query evaluation.
+ * All calls of current_date within the same query return the same value.
+ */
+case class CurrentDate() extends LeafExpression {
+  override def foldable: Boolean = true
+  override def nullable: Boolean = false
+
+  override def dataType: DataType = DateType
+
+  override def eval(input: InternalRow): Any = {
+DateTimeUtils.millisToDays(System.currentTimeMillis())
+  }
+
+  override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
--- End diff --

actually you probably don't need the genCode version since this will always 
get constant folded.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8192] [SPARK-8193] [SQL] udf current_da...

2015-07-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6985#discussion_r33842525
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DatetimeExpressionsSuite.scala ---
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.sql.Date
+
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.functions._
+
+class DatetimeExpressionsSuite extends QueryTest {
+  private lazy val ctx = org.apache.spark.sql.test.TestSQLContext
+
+  import ctx.implicits._
+
+  val df1 = Seq((1, 2), (3, 1)).toDF("a", "b")
--- End diff --

don't do this here. if there is an error with the df creation, it will make 
all the test cases disappear since it wasn't able to construct the object.

you can add a lazy in front of val here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118248666
  
  [Test build #36482 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36482/consoleFull)
 for   PR 7192 at commit 
[`2fd6f40`](https://github.com/apache/spark/commit/2fd6f40d4828de0cf4f8c92081fedd0b7eb9d2f6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118248598
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118248592
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7207#issuecomment-118248560
  
Jenkins, ok to test.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7207#issuecomment-118248348
  
Thanks - looks pretty good for the first patch! 

There are just some minor style issues.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7207#discussion_r33842386
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala ---
@@ -82,6 +83,48 @@ class UDFSuite extends QueryTest {
 assert(ctx.sql("SELECT strLenScala('test', 1)").head().getInt(0) === 5)
   }
 
+  test("UDF in a WHERE") {
+testData.sqlContext.udf.register("oneArgFilter", (n:Int) => { n > 80 })
+
+val result =
+  testData.sqlContext.sql("SELECT * FROM testData WHERE 
oneArgFilter(key)")
+assert(result.count() === 20)
+  }
+
+  test("UDF in a HAVING") {
+testData.sqlContext.udf.register("havingFilter", (n:Long) => { n > 5 })
+
+val result =
+  testData.sqlContext.sql("SELECT g, SUM(v) as s FROM groupData GROUP 
BY g HAVING havingFilter(s)")
+assert(result.count() === 2)
+  }
+
+  test("UDF in a GROUP BY") {
+testData.sqlContext.udf.register("groupFunction", (n:Int) => { n > 10 
})
+
+val result =
+  testData.sqlContext.sql("SELECT SUM(v) FROM groupData GROUP BY 
groupFunction(v)")
+assert(result.count() === 2)
+  }
+
+  test("UDFs everywhere") {
+ctx.udf.register("groupFunction", (n:Int) => { n > 10 })
--- End diff --

add a space after colon, i.e.
```scala
(n: Int) => { n > 10 }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7207#discussion_r33842367
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala ---
@@ -82,6 +83,48 @@ class UDFSuite extends QueryTest {
 assert(ctx.sql("SELECT strLenScala('test', 1)").head().getInt(0) === 5)
   }
 
+  test("UDF in a WHERE") {
+testData.sqlContext.udf.register("oneArgFilter", (n:Int) => { n > 80 })
+
+val result =
+  testData.sqlContext.sql("SELECT * FROM testData WHERE 
oneArgFilter(key)")
+assert(result.count() === 20)
+  }
+
+  test("UDF in a HAVING") {
+testData.sqlContext.udf.register("havingFilter", (n:Long) => { n > 5 })
+
+val result =
+  testData.sqlContext.sql("SELECT g, SUM(v) as s FROM groupData GROUP 
BY g HAVING havingFilter(s)")
--- End diff --

we now prefer just having the dataset closer to the test case rather than 
putting them in TestData.

You can do something like this easily in each test case itself

```
val df = Seq(("red", 1), ("red", 2), ("blue", 10), ("green", 100), 
("green", 200)).toDF("g", "v")
df.registerTempTable("groupData")
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7207#discussion_r33842348
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala ---
@@ -82,6 +83,48 @@ class UDFSuite extends QueryTest {
 assert(ctx.sql("SELECT strLenScala('test', 1)").head().getInt(0) === 5)
   }
 
+  test("UDF in a WHERE") {
+testData.sqlContext.udf.register("oneArgFilter", (n:Int) => { n > 80 })
--- End diff --

we now prefer just having the dataset closer to the test case rather than 
putting them in TestData.

You can do something like this easily in each test case itself

```
val df = Seq(("red", 1), ("red", 2), ("blue", 10), ("green", 100), 
("green", 200)).toDF("g", "v")
df.registerTempTable("groupData")
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7203#issuecomment-118247852
  
Looks good. Just two nitpicks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7203#discussion_r33842250
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 ---
@@ -24,13 +24,18 @@ import org.apache.spark.sql.types.DataType
  * User-defined function.
  * @param dataType  Return type of function.
  */
-case class ScalaUDF(function: AnyRef, dataType: DataType, children: 
Seq[Expression])
-  extends Expression {
+case class ScalaUDF(
+function: AnyRef,
+dataType: DataType,
+children: Seq[Expression],
+expectedInputTypes: Seq[DataType] = Nil) extends Expression with 
ExpectsInputTypes {
--- End diff --

you can just call this inputTypes and remove the inputTypes function I think


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7203#discussion_r3384
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -87,6 +87,7 @@ class UDFRegistration private[sql] (sqlContext: 
SQLContext) extends Logging {
 (0 to 22).map { x =>
   val types = (1 to x).foldRight("RT")((i, s) => {s"A$i, $s"})
   val typeTags = (1 to x).map(i => s"A${i}: TypeTag").foldLeft("RT: 
TypeTag")(_ + ", " + _)
+  val inputTypes = (1 to x).foldLeft("Nil")((s, i) => {s"$s :+ 
ScalaReflection.schemaFor[A$i].dataType"})
--- End diff --

I think the convention is usually
```scala
a :: b :: c :: Nil
```

rather than 
```scala
Nil :+ a :+ ...
```

Do you mind updating it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118247030
  
  [Test build #36481 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36481/consoleFull)
 for   PR 7192 at commit 
[`ec3fb10`](https://github.com/apache/spark/commit/ec3fb10d0cf145bad66e81a412ecc746a0ec5556).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118246922
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7207#issuecomment-118246915
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8796][SQL] mark child as transient in I...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7192#issuecomment-118246916
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-118246828
  
  [Test build #36480 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36480/consoleFull)
 for   PR 7099 at commit 
[`072e948`](https://github.com/apache/spark/commit/072e9484eb2750952340ceb10d553dfac6768471).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8810] [SQL] Added several UDF unit test...

2015-07-02 Thread spirom
GitHub user spirom opened a pull request:

https://github.com/apache/spark/pull/7207

[SPARK-8810] [SQL] Added several UDF unit tests for Spark SQL

One test for each of the GROUP BY, WHERE and HAVING clauses, and one that 
combines all three with an additional UDF in the SELECT. 

(Since this is my first attempt at contributing to SPARK, meta-level 
guidance on anything I've screwed up would be greatly appreciated, whether 
important or minor.)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/spirom/spark udf-test-branch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7207


commit 1a3c5ff54c43d60e34e7591e7f175840b0e91513
Author: Spiro Michaylov 
Date:   2015-07-02T11:34:51Z

Added several UDF unit tests for Spark SQL




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7190] [SPARK-7815] unsafe UTF8String

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/7197#issuecomment-118246365
  
This looks pretty good to me overall; I left a few small optimization 
related comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8226][SQL]Add function shiftrightunsign...

2015-07-02 Thread tarekauel
Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7035#discussion_r33841943
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -521,6 +521,55 @@ case class ShiftRight(left: Expression, right: 
Expression) extends BinaryExpress
   }
 }
 
+case class ShiftRightUnsigned(left: Expression, right: Expression) extends 
BinaryExpression {
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+(left.dataType, right.dataType) match {
+  case (NullType, _) | (_, NullType) => return 
TypeCheckResult.TypeCheckSuccess
+  case (_, IntegerType) => left.dataType match {
+case LongType | IntegerType | ShortType | ByteType =>
+  return TypeCheckResult.TypeCheckSuccess
+case _ => // failed
+  }
+  case _ => // failed
+}
+TypeCheckResult.TypeCheckFailure(
+  s"ShiftRightUnsigned expects long, integer, short or byte value as 
first argument and an " +
+s"integer value as second argument, not (${left.dataType}, 
${right.dataType})")
+  }
+
+  override def eval(input: InternalRow): Any = {
+val valueLeft = left.eval(input)
+if (valueLeft != null) {
+  val valueRight = right.eval(input)
+  if (valueRight != null) {
+left.dataType match {
--- End diff --

That one is interesting. Have you seen my comment? Can you have a look on 
that gist: https://gist.github.com/tarekauel/6994983b83a51668c5dc. Am I getting 
something wrong?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-118246123
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-118246087
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7190] [SPARK-7815] unsafe UTF8String

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7197#discussion_r33841830
  
--- Diff: 
unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -201,7 +234,7 @@ public int compare(final UTF8String other) {
   @Override
   public boolean equals(final Object other) {
 if (other instanceof UTF8String) {
-  return Arrays.equals(bytes, ((UTF8String) other).getBytes());
+  return compareTo((UTF8String) other) == 0;
--- End diff --

Since I suspect that string equality comparisons could be a very frequent / 
expensive operation, this might be a case where it would be worthwhile to use a 
fast byte array equality method (see my suggestion upthread on taking the byte 
comparison loop in `matches` and factoring it out into a static method in 
`ByteArrayMethods`).  It might also be faster to express this as a check that 
the strings have the same length and that one string matches at the start of 
the other.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7190] [SPARK-7815] unsafe UTF8String

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7197#discussion_r33841774
  
--- Diff: 
unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java ---
@@ -25,21 +25,30 @@
 public class UTF8StringSuite {
 
   private void checkBasic(String str, int len) throws 
UnsupportedEncodingException {
-Assert.assertEquals(UTF8String.fromString(str).length(), len);
-
Assert.assertEquals(UTF8String.fromBytes(str.getBytes("utf8")).length(), len);
+UTF8String s1 = UTF8String.fromString(str);
+UTF8String s2 = UTF8String.fromBytes(str.getBytes("utf8"));
+Assert.assertEquals(s1.length(), len);
+Assert.assertEquals(s2.length(), len);
 
-Assert.assertEquals(UTF8String.fromString(str).toString(), str);
-
Assert.assertEquals(UTF8String.fromBytes(str.getBytes("utf8")).toString(), str);
-Assert.assertEquals(UTF8String.fromBytes(str.getBytes("utf8")), 
UTF8String.fromString(str));
+Assert.assertEquals(s1.toString(), str);
+Assert.assertEquals(s2.toString(), str);
+Assert.assertEquals(s1, s2);
 
-Assert.assertEquals(UTF8String.fromString(str).hashCode(),
-  UTF8String.fromBytes(str.getBytes("utf8")).hashCode());
+Assert.assertEquals(s1.hashCode(), s2.hashCode());
+
+Assert.assertEquals(s1.compare(s2), 0);
+Assert.assertEquals(s1.compareTo(s2), 0);
+
+Assert.assertEquals(s1.contains(s2), true);
+Assert.assertEquals(s2.contains(s1), true);
+Assert.assertEquals(s1.startsWith(s1), true);
+Assert.assertEquals(s1.endsWith(s1), true);
   }
 
   @Test
   public void basicTest() throws UnsupportedEncodingException {
 checkBasic("hello", 5);
-checkBasic("世 界", 3);
+checkBasic("大 千 世 界", 7);
--- End diff --

We should probably also add a test case for the empty string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7203#issuecomment-118244963
  
  [Test build #36479 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36479/consoleFull)
 for   PR 7203 at commit 
[`dce1efd`](https://github.com/apache/spark/commit/dce1efd4f50e9ba0f356388c04e5bfddc8fc701c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7190] [SPARK-7815] unsafe UTF8String

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7197#discussion_r33841742
  
--- Diff: 
unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java ---
@@ -25,21 +25,30 @@
 public class UTF8StringSuite {
 
   private void checkBasic(String str, int len) throws 
UnsupportedEncodingException {
-Assert.assertEquals(UTF8String.fromString(str).length(), len);
-
Assert.assertEquals(UTF8String.fromBytes(str.getBytes("utf8")).length(), len);
+UTF8String s1 = UTF8String.fromString(str);
+UTF8String s2 = UTF8String.fromBytes(str.getBytes("utf8"));
+Assert.assertEquals(s1.length(), len);
+Assert.assertEquals(s2.length(), len);
 
-Assert.assertEquals(UTF8String.fromString(str).toString(), str);
-
Assert.assertEquals(UTF8String.fromBytes(str.getBytes("utf8")).toString(), str);
-Assert.assertEquals(UTF8String.fromBytes(str.getBytes("utf8")), 
UTF8String.fromString(str));
+Assert.assertEquals(s1.toString(), str);
+Assert.assertEquals(s2.toString(), str);
+Assert.assertEquals(s1, s2);
 
-Assert.assertEquals(UTF8String.fromString(str).hashCode(),
-  UTF8String.fromBytes(str.getBytes("utf8")).hashCode());
+Assert.assertEquals(s1.hashCode(), s2.hashCode());
+
+Assert.assertEquals(s1.compare(s2), 0);
+Assert.assertEquals(s1.compareTo(s2), 0);
+
+Assert.assertEquals(s1.contains(s2), true);
--- End diff --

I think you can use `assertTrue` here.  As long as you're changing this 
file, you might as well change it to statically import the `assert*` methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7190] [SPARK-7815] unsafe UTF8String

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7197#discussion_r33841711
  
--- Diff: 
unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -45,33 +46,29 @@
 6, 6, 6, 6};
 
   public static UTF8String fromBytes(byte[] bytes) {
--- End diff --

Maybe it's not necessary, but should we add a comment to call out the fact 
that this doesn't defensively copy `bytes`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7190] [SPARK-7815] unsafe UTF8String

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7197#discussion_r33841678
  
--- Diff: 
unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -106,92 +105,126 @@ public int length() {
* @param until the position after last code point, exclusive.
*/
   public UTF8String substring(final int start, final int until) {
-if (until <= start || start >= bytes.length) {
+if (until <= start || start >= size) {
   return UTF8String.fromBytes(new byte[0]);
 }
 
 int i = 0;
 int c = 0;
-for (; i < bytes.length && c < start; i += numBytes(bytes[i])) {
+for (; i < size && c < start; i += numBytes(getByte(i))) {
   c += 1;
 }
 
 int j = i;
-for (; j < bytes.length && c < until; j += numBytes(bytes[i])) {
+for (; j < size && c < until; j += numBytes(getByte(i))) {
   c += 1;
 }
 
-return UTF8String.fromBytes(Arrays.copyOfRange(bytes, i, j));
+byte[] bytes = new byte[j - i];
+copyMemory(base, offset + i, bytes, BYTE_ARRAY_OFFSET, j - i);
+return UTF8String.fromBytes(bytes);
   }
 
   public boolean contains(final UTF8String substring) {
-final byte[] b = substring.getBytes();
-if (b.length == 0) {
+if (substring.size == 0) {
   return true;
 }
 
-for (int i = 0; i <= bytes.length - b.length; i++) {
-  if (bytes[i] == b[0] && startsWith(b, i)) {
+byte first = substring.getByte(0);
+for (int i = 0; i <= size - substring.size; i++) {
+  if (getByte(i) == first && matchAt(substring, i)) {
 return true;
   }
 }
 return false;
   }
 
-  private boolean startsWith(final byte[] prefix, int offsetInBytes) {
-if (prefix.length + offsetInBytes > bytes.length || offsetInBytes < 0) 
{
+  private long getLong(int i) {
+return UNSAFE.getLong(base, offset + i);
+  }
+
+  private byte getByte(int i) {
+return UNSAFE.getByte(base, offset + i);
+  }
+
+  private boolean matchAt(final UTF8String s, int pos) {
+if (s.size + pos > size || pos < 0) {
   return false;
 }
+
 int i = 0;
-while (i < prefix.length && prefix[i] == bytes[i + offsetInBytes]) {
-  i++;
+while (i <= s.size - 8) {
+  if (getLong(pos + i) != s.getLong(i)) {
+return false;
+  }
+  i += 8;
 }
-return i == prefix.length;
+while (i < s.size) {
+  if (getByte(pos + i) != s.getByte(i)) {
+return false;
+  }
+  i += 1;
+}
+return true;
   }
 
   public boolean startsWith(final UTF8String prefix) {
-return startsWith(prefix.getBytes(), 0);
+return matchAt(prefix, 0);
   }
 
   public boolean endsWith(final UTF8String suffix) {
-return startsWith(suffix.getBytes(), bytes.length - 
suffix.getBytes().length);
+return matchAt(suffix, size - suffix.size);
   }
 
   public UTF8String toUpperCase() {
+// this is locale aware
 return UTF8String.fromString(toString().toUpperCase());
   }
 
   public UTF8String toLowerCase() {
+// this is locale aware
 return UTF8String.fromString(toString().toLowerCase());
   }
 
   @Override
   public String toString() {
 try {
-  return new String(bytes, "utf-8");
+  // this is slow
+  return new String(getBytes(), "utf-8");
--- End diff --

i.e.. why not pass `bytes` directly? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7203#issuecomment-118244289
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7203#issuecomment-118244168
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7190] [SPARK-7815] unsafe UTF8String

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7197#discussion_r33841668
  
--- Diff: 
unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -106,92 +105,126 @@ public int length() {
* @param until the position after last code point, exclusive.
*/
   public UTF8String substring(final int start, final int until) {
-if (until <= start || start >= bytes.length) {
+if (until <= start || start >= size) {
   return UTF8String.fromBytes(new byte[0]);
 }
 
 int i = 0;
 int c = 0;
-for (; i < bytes.length && c < start; i += numBytes(bytes[i])) {
+for (; i < size && c < start; i += numBytes(getByte(i))) {
   c += 1;
 }
 
 int j = i;
-for (; j < bytes.length && c < until; j += numBytes(bytes[i])) {
+for (; j < size && c < until; j += numBytes(getByte(i))) {
   c += 1;
 }
 
-return UTF8String.fromBytes(Arrays.copyOfRange(bytes, i, j));
+byte[] bytes = new byte[j - i];
+copyMemory(base, offset + i, bytes, BYTE_ARRAY_OFFSET, j - i);
+return UTF8String.fromBytes(bytes);
   }
 
   public boolean contains(final UTF8String substring) {
-final byte[] b = substring.getBytes();
-if (b.length == 0) {
+if (substring.size == 0) {
   return true;
 }
 
-for (int i = 0; i <= bytes.length - b.length; i++) {
-  if (bytes[i] == b[0] && startsWith(b, i)) {
+byte first = substring.getByte(0);
+for (int i = 0; i <= size - substring.size; i++) {
+  if (getByte(i) == first && matchAt(substring, i)) {
 return true;
   }
 }
 return false;
   }
 
-  private boolean startsWith(final byte[] prefix, int offsetInBytes) {
-if (prefix.length + offsetInBytes > bytes.length || offsetInBytes < 0) 
{
+  private long getLong(int i) {
+return UNSAFE.getLong(base, offset + i);
+  }
+
+  private byte getByte(int i) {
+return UNSAFE.getByte(base, offset + i);
+  }
+
+  private boolean matchAt(final UTF8String s, int pos) {
+if (s.size + pos > size || pos < 0) {
   return false;
 }
+
 int i = 0;
-while (i < prefix.length && prefix[i] == bytes[i + offsetInBytes]) {
-  i++;
+while (i <= s.size - 8) {
+  if (getLong(pos + i) != s.getLong(i)) {
+return false;
+  }
+  i += 8;
 }
-return i == prefix.length;
+while (i < s.size) {
+  if (getByte(pos + i) != s.getByte(i)) {
+return false;
+  }
+  i += 1;
+}
+return true;
   }
 
   public boolean startsWith(final UTF8String prefix) {
-return startsWith(prefix.getBytes(), 0);
+return matchAt(prefix, 0);
   }
 
   public boolean endsWith(final UTF8String suffix) {
-return startsWith(suffix.getBytes(), bytes.length - 
suffix.getBytes().length);
+return matchAt(suffix, size - suffix.size);
   }
 
   public UTF8String toUpperCase() {
+// this is locale aware
 return UTF8String.fromString(toString().toUpperCase());
   }
 
   public UTF8String toLowerCase() {
+// this is locale aware
 return UTF8String.fromString(toString().toLowerCase());
   }
 
   @Override
   public String toString() {
 try {
-  return new String(bytes, "utf-8");
+  // this is slow
+  return new String(getBytes(), "utf-8");
--- End diff --

Just curious: why do we need the copy here in `getBytes()`?  Does String 
hold onto the byte array or otherwise manipulate it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8809][SQL] Remove ConvertNaNs analyzer ...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7206#issuecomment-118243561
  
  [Test build #36477 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36477/consoleFull)
 for   PR 7206 at commit 
[`3d99c33`](https://github.com/apache/spark/commit/3d99c3300f8266051bd3b40a5672c6f4b0892419).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8192] [SPARK-8193] [SQL] udf current_da...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6985#issuecomment-118243500
  
  [Test build #36478 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36478/consoleFull)
 for   PR 6985 at commit 
[`27c9f95`](https://github.com/apache/spark/commit/27c9f95d512229b65eed10b0e2b6abc2d6952f11).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8572] [SQL] Type coercion for ScalaUDFs

2015-07-02 Thread piaozhexiu
Github user piaozhexiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/7203#discussion_r33841596
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -126,7 +126,8 @@ class UDFRegistration private[sql] (sqlContext: 
SQLContext) extends Logging {
*/
   def register[RT: TypeTag](name: String, func: Function0[RT]): 
UserDefinedFunction = {
 val dataType = ScalaReflection.schemaFor[RT].dataType
-def builder(e: Seq[Expression]) = ScalaUDF(func, dataType, e)
+val inputTypes = Seq[DataType]()
--- End diff --

@rxin Thank you for reviewing! I updated the script.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7190] [SPARK-7815] unsafe UTF8String

2015-07-02 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/7197#discussion_r33841582
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
 ---
@@ -263,17 +263,17 @@ public Object get(int i) {
   boolean isString = (v >> (OFFSET_BITS * 2)) > 0;
   int offset = (int) ((v >> OFFSET_BITS) & Integer.MAX_VALUE);
   int size = (int) (v & Integer.MAX_VALUE);
-  final byte[] bytes = new byte[size];
-  PlatformDependent.copyMemory(
-baseObject,
-baseOffset + offset,
-bytes,
-PlatformDependent.BYTE_ARRAY_OFFSET,
-size
-  );
   if (isString) {
-return UTF8String.fromBytes(bytes);
+return new UTF8String(baseObject, baseOffset + offset, size);
   } else {
+final byte[] bytes = new byte[size];
+PlatformDependent.copyMemory(
+baseObject,
--- End diff --

Intellij did, will fix it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8192] [SPARK-8193] [SQL] udf current_da...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6985#issuecomment-118242472
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8809][SQL] Remove ConvertNaNs analyzer ...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7206#issuecomment-118242425
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8809][SQL] Remove ConvertNaNs analyzer ...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7206#issuecomment-118242471
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8809][SQL] Remove ConvertNaNs analyzer ...

2015-07-02 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/7206

[SPARK-8809][SQL] Remove ConvertNaNs analyzer rule.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark convertnans

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7206


commit 3d99c3300f8266051bd3b40a5672c6f4b0892419
Author: Reynold Xin 
Date:   2015-07-03T05:08:52Z

[SPARK-8809][SQL] Remove ConvertNaNs analyzer rule.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8192] [SPARK-8193] [SQL] udf current_da...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6985#issuecomment-118242437
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7190] [SPARK-7815] unsafe UTF8String

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7197#discussion_r33841512
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
 ---
@@ -263,17 +263,17 @@ public Object get(int i) {
   boolean isString = (v >> (OFFSET_BITS * 2)) > 0;
   int offset = (int) ((v >> OFFSET_BITS) & Integer.MAX_VALUE);
   int size = (int) (v & Integer.MAX_VALUE);
-  final byte[] bytes = new byte[size];
-  PlatformDependent.copyMemory(
-baseObject,
-baseOffset + offset,
-bytes,
-PlatformDependent.BYTE_ARRAY_OFFSET,
-size
-  );
   if (isString) {
-return UTF8String.fromBytes(bytes);
+return new UTF8String(baseObject, baseOffset + offset, size);
   } else {
+final byte[] bytes = new byte[size];
+PlatformDependent.copyMemory(
+baseObject,
--- End diff --

This looks over-indented.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8802] [WIP] [SQL] Decimal.apply(BigDeci...

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/7198#issuecomment-118241572
  
It would also be fine to conclude that this is not an issue as long as it 
only happens for contrived BigDecimal values which can't actually appear in 
practice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8797] [WIP] Fix comparison of NaN value...

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/7194#issuecomment-118241214
  
One stubtlety: there can be multiple float / double bitpatterns that are 
NaN, so clustered sorting based on the bitpatterns is not always sufficient to 
properly implement COUNT DISTINCT over a set of grouping columns which may 
contain NaN values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-02 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r33841411
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -526,3 +529,171 @@ case class Logarithm(left: Expression, right: 
Expression)
 """
   }
 }
+
+case class Round(child: Expression, scale: Expression) extends Expression {
+
+  def this(child: Expression) = {
+this(child, Literal(0))
+  }
+
+  override def children: Seq[Expression] = Seq(child, scale)
+
+  override def nullable: Boolean = true
+
+  override def foldable: Boolean = child.foldable
+
+  override lazy val dataType: DataType = child.dataType match {
+  case StringType | BinaryType => DoubleType
+  case DecimalType.Fixed(p, s) => DecimalType(p, _scale)
+  case t => t
+}
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+child.dataType match {
+  case _: NumericType | NullType | BinaryType | StringType => // 
satisfy requirement
--- End diff --

+1 the implicit casting should be handled by the analyzer, not the 
expression itself. 

I did that here: 
https://github.com/apache/spark/pull/7175/files#diff-d33f6b266aab79a1708e888dc1a1caf3R725




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8777] [SQL] Add random data generator t...

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/7176#issuecomment-118240852
  
Alright, I've backed out the ScalaCheck usage and have replied to the 
review comments above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8777] [SQL] [DO NOT MERGE] Add random d...

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7176#discussion_r33841373
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala ---
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.lang.Double.longBitsToDouble
+import java.lang.Float.intBitsToFloat
+
+import scala.util.Random
+
+import org.apache.spark.sql.types._
+
+/**
+ * Random data generators for Spark SQL DataTypes. These generators do not 
generate uniformly random
+ * values; instead, they're biased to return "interesting" values (such as 
maximum / minimum values)
+ * with higher probability.
+ */
+object RandomDataGenerator {
+
+  /**
+   * The conditional probability of a non-null value being drawn from a 
set of "interesting" values
+   * instead of being chosen uniformly at random.
+   */
+  private val PROBABILITY_OF_INTERESTING_VALUE: Float = 0.5f
+
+  /**
+   * The probability of the generated value being null
+   */
+  private val PROBABILITY_OF_NULL: Float = 0.1f
+
+  private val MAX_STR_LEN: Int = 1024
+  private val MAX_ARR_SIZE: Int = 128
+  private val MAX_MAP_SIZE: Int = 128
+
+  /**
+   * Helper function for constructing a biased random number generator 
which returns "interesting"
+   * values with a higher probability.
+   */
+  private def randomNumeric[T](
+  rand: Random,
+  uniformRand: Random => T,
+  interestingValues: Seq[T]): Some[() => T] = {
+val f = () => {
+  if (rand.nextFloat() <= PROBABILITY_OF_INTERESTING_VALUE) {
+interestingValues(rand.nextInt(interestingValues.length))
+  } else {
+uniformRand(rand)
+  }
+}
+Some(f)
+  }
+
+  /**
+   * Returns a function which generates random values for the given 
[[DataType]], or `None` if no
+   * random data generator is defined for that data type. The generated 
values will use an external
+   * representation of the data type; for example, the random generator 
for [[DateType]] will return
+   * instances of [[java.sql.Date]] and the generator for [[StructType]] 
will return a
+   * [[org.apache.spark.Row]].
+   *
+   * @param dataType the type to generate values for
+   * @param nullable whether null values should be generated
+   * @param seed an optional seed for the random number generator
+   * @return a function which can be called to generate random values.
+   */
+  def forType(
+  dataType: DataType,
+  nullable: Boolean = true,
+  seed: Option[Long] = None): Option[() => Any] = {
+val rand = new Random()
+seed.foreach(rand.setSeed)
+
+val valueGenerator: Option[() => Any] = dataType match {
+  case StringType => Some(() => 
rand.nextString(rand.nextInt(MAX_STR_LEN)))
+  case BinaryType => Some(() => {
+val arr = new Array[Byte](rand.nextInt(MAX_STR_LEN))
+rand.nextBytes(arr)
+arr
+  })
+  case BooleanType => Some(() => rand.nextBoolean())
+  case DateType => Some(() => new 
java.sql.Date(rand.nextInt(Int.MaxValue)))
+  case DoubleType => randomNumeric[Double](
+rand, r => longBitsToDouble(r.nextLong()), Seq(Double.MinValue, 
Double.MinPositiveValue,
+  Double.MaxValue, Double.PositiveInfinity, 
Double.NegativeInfinity, Double.NaN, 0.0))
+  case FloatType => randomNumeric[Float](
+rand, r => intBitsToFloat(r.nextInt()), Seq(Float.MinValue, 
Float.MinPositiveValue,
--- End diff --

See comment at 
https://github.com/apache/spark/pull/7176#discussion_r33841371


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infra

[GitHub] spark pull request: [SPARK-8777] [SQL] [DO NOT MERGE] Add random d...

2015-07-02 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7176#discussion_r33841371
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala ---
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.lang.Double.longBitsToDouble
+import java.lang.Float.intBitsToFloat
+
+import scala.util.Random
+
+import org.apache.spark.sql.types._
+
+/**
+ * Random data generators for Spark SQL DataTypes. These generators do not 
generate uniformly random
+ * values; instead, they're biased to return "interesting" values (such as 
maximum / minimum values)
+ * with higher probability.
+ */
+object RandomDataGenerator {
+
+  /**
+   * The conditional probability of a non-null value being drawn from a 
set of "interesting" values
+   * instead of being chosen uniformly at random.
+   */
+  private val PROBABILITY_OF_INTERESTING_VALUE: Float = 0.5f
+
+  /**
+   * The probability of the generated value being null
+   */
+  private val PROBABILITY_OF_NULL: Float = 0.1f
+
+  private val MAX_STR_LEN: Int = 1024
+  private val MAX_ARR_SIZE: Int = 128
+  private val MAX_MAP_SIZE: Int = 128
+
+  /**
+   * Helper function for constructing a biased random number generator 
which returns "interesting"
+   * values with a higher probability.
+   */
+  private def randomNumeric[T](
+  rand: Random,
+  uniformRand: Random => T,
+  interestingValues: Seq[T]): Some[() => T] = {
+val f = () => {
+  if (rand.nextFloat() <= PROBABILITY_OF_INTERESTING_VALUE) {
+interestingValues(rand.nextInt(interestingValues.length))
+  } else {
+uniformRand(rand)
+  }
+}
+Some(f)
+  }
+
+  /**
+   * Returns a function which generates random values for the given 
[[DataType]], or `None` if no
+   * random data generator is defined for that data type. The generated 
values will use an external
+   * representation of the data type; for example, the random generator 
for [[DateType]] will return
+   * instances of [[java.sql.Date]] and the generator for [[StructType]] 
will return a
+   * [[org.apache.spark.Row]].
+   *
+   * @param dataType the type to generate values for
+   * @param nullable whether null values should be generated
+   * @param seed an optional seed for the random number generator
+   * @return a function which can be called to generate random values.
+   */
+  def forType(
+  dataType: DataType,
+  nullable: Boolean = true,
+  seed: Option[Long] = None): Option[() => Any] = {
+val rand = new Random()
+seed.foreach(rand.setSeed)
+
+val valueGenerator: Option[() => Any] = dataType match {
+  case StringType => Some(() => 
rand.nextString(rand.nextInt(MAX_STR_LEN)))
+  case BinaryType => Some(() => {
+val arr = new Array[Byte](rand.nextInt(MAX_STR_LEN))
+rand.nextBytes(arr)
+arr
+  })
+  case BooleanType => Some(() => rand.nextBoolean())
+  case DateType => Some(() => new 
java.sql.Date(rand.nextInt(Int.MaxValue)))
+  case DoubleType => randomNumeric[Double](
+rand, r => longBitsToDouble(r.nextLong()), Seq(Double.MinValue, 
Double.MinPositiveValue,
--- End diff --

The goal here was to produce doubles that were uniformly distributed over 
the range of possible double values (`rand.nextDouble() just returns doubles in 
the range 0.0 to 1.0).  Empirically, the number of NaNs produced by this seems 
to be quite small and a back-of-the-envelope calculation seems to back this up; 
I think that

((0x7fff - 0x7ff1) + (0x - 
0xfff1)) / 2^64 

works out to be a roughly 0.05% chance of producing a NaN through this 
method (see 
https://www.wolframalpha.com/input/?i=%28%280

[GitHub] spark pull request: [SPARK-8223][SPARK-8224][SQL] shift left and s...

2015-07-02 Thread tarekauel
Github user tarekauel commented on a diff in the pull request:

https://github.com/apache/spark/pull/7178#discussion_r33841320
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -351,6 +351,104 @@ case class Pow(left: Expression, right: Expression)
   }
 }
 
+case class ShiftLeft(left: Expression, right: Expression) extends 
BinaryExpression {
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+(left.dataType, right.dataType) match {
+  case (NullType, _) | (_, NullType) => return 
TypeCheckResult.TypeCheckSuccess
+  case (_, IntegerType) => left.dataType match {
+case LongType | IntegerType | ShortType | ByteType =>
+  return TypeCheckResult.TypeCheckSuccess
+case _ => // failed
+  }
+  case _ => // failed
+}
+TypeCheckResult.TypeCheckFailure(
+s"ShiftLeft expects long, integer, short or byte value as first 
argument and an " +
+  s"integer value as second argument, not (${left.dataType}, 
${right.dataType})")
+  }
+
+  override def eval(input: InternalRow): Any = {
+val valueLeft = left.eval(input)
+if (valueLeft != null) {
+  val valueRight = right.eval(input)
+  if (valueRight != null) {
+valueLeft match {
+  case l: Long => l << valueRight.asInstanceOf[Integer]
+  case i: Integer => i << valueRight.asInstanceOf[Integer]
+  case s: Short => s << valueRight.asInstanceOf[Integer]
+  case b: Byte => b << valueRight.asInstanceOf[Integer]
+}
+  } else {
+null
+  }
+} else {
+  null
+}
+  }
+
+  override def dataType: DataType = {
+left.dataType match {
+  case LongType => LongType
+  case IntegerType | ShortType | ByteType => IntegerType
+  case _ => NullType
+}
+  }
+
+  override protected def genCode(ctx: CodeGenContext, ev: 
GeneratedExpressionCode): String = {
+nullSafeCodeGen(ctx, ev, (result, left, right) => s"$result = $left << 
$right;")
+  }
+}
+
+case class ShiftRight(left: Expression, right: Expression) extends 
BinaryExpression {
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+(left.dataType, right.dataType) match {
+  case (NullType, _) | (_, NullType) => return 
TypeCheckResult.TypeCheckSuccess
+  case (_, IntegerType) => left.dataType match {
+case LongType | IntegerType | ShortType | ByteType =>
+  return TypeCheckResult.TypeCheckSuccess
+case _ => // failed
+  }
+  case _ => // failed
+}
+TypeCheckResult.TypeCheckFailure(
+  s"ShiftRight expects long, integer, short or byte value as first 
argument and an " +
+s"integer value as second argument, not (${left.dataType}, 
${right.dataType})")
+  }
+
+  override def eval(input: InternalRow): Any = {
+val valueLeft = left.eval(input)
+if (valueLeft != null) {
+  val valueRight = right.eval(input)
+  if (valueRight != null) {
+valueLeft match {
+  case l: Long => l >> valueRight.asInstanceOf[Integer]
+  case i: Integer => i >> valueRight.asInstanceOf[Integer]
+  case s: Short => s >> valueRight.asInstanceOf[Integer]
+  case b: Byte => b >> valueRight.asInstanceOf[Integer]
--- End diff --

@chenghao-intel I investigated it a little bit, see the gist: 
https://gist.github.com/tarekauel/6994983b83a51668c5dc . The interesting part 
is that the match on the value is even faster, did I something wrong?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8279][SQL]Add math function round

2015-07-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/6938#discussion_r33841264
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/math.scala
 ---
@@ -526,3 +529,171 @@ case class Logarithm(left: Expression, right: 
Expression)
 """
   }
 }
+
+case class Round(child: Expression, scale: Expression) extends Expression {
+
+  def this(child: Expression) = {
+this(child, Literal(0))
+  }
+
+  override def children: Seq[Expression] = Seq(child, scale)
+
+  override def nullable: Boolean = true
+
+  override def foldable: Boolean = child.foldable
+
+  override lazy val dataType: DataType = child.dataType match {
+  case StringType | BinaryType => DoubleType
+  case DecimalType.Fixed(p, s) => DecimalType(p, _scale)
+  case t => t
+}
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+child.dataType match {
+  case _: NumericType | NullType | BinaryType | StringType => // 
satisfy requirement
--- End diff --

I think `round` only makes sense for numeric type, and we should support 
`BinaryType` and `StringType` by adding type cast rules in `HiveTypeCoercion` 
or using `ExpectInputTypes`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8341] Significant selector feature tran...

2015-07-02 Thread catap
Github user catap commented on a diff in the pull request:

https://github.com/apache/spark/pull/6795#discussion_r33841249
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/SignificantSelector.scala 
---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import scala.collection.mutable
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: Experimental ::
+ * Model to extract significant indices from vector.
+ *
+ * Significant indices is vector's index that has different value for 
different vectors.
+ *
+ * For example, when you use HashingTF they create big sparse vector,
+ * and this code convert to smallest vector that don't include same values 
indices for all vectors.
+ *
+ * @param indices array of significant indices.
+ */
+@Experimental
+class SignificantSelectorModel(val indices: Array[Int]) extends 
VectorTransformer {
+
+  /**
+   * Applies transformation on a vector.
+   *
+   * @param vector vector to be transformed.
+   * @return transformed vector.
+   */
+  override def transform(vector: Vector): Vector = vector match {
+case DenseVector(vs) =>
+  Vectors.dense(indices.map(vs))
+
+case SparseVector(s, ids, vs) =>
+  var sv_idx = 0
+  var new_idx = 0
+  val elements = new mutable.ListBuffer[(Int, Double)]()
+  
+  for (idx <- indices) {
+while (sv_idx < ids.length && ids(sv_idx) < idx) {
+  sv_idx += 1
+}
+if (sv_idx < ids.length && ids(sv_idx) == idx) {
+  elements += ((new_idx, vs(sv_idx)))
+  sv_idx += 1
+}
+new_idx += 1
+  }
+  
+  Vectors.sparse(indices.length, elements)
+
+case v =>
+  throw new IllegalArgumentException("Don't support vector type " + 
v.getClass)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Specialized model for equivalent vectors
+ */
+@Experimental
+class SignificantSelectorEmptyModel extends 
SignificantSelectorModel(Array[Int]()) {
+  
+  val empty_vector = Vectors.dense(Array[Double]())
+  
+  override def transform(vector: Vector): Vector = empty_vector
+}
+
+/**
+ * :: Experimental ::
+ * Create Significant selector.
+ */
+@Experimental
+class SignificantSelector() {
+
+  /**
+   * Returns a significant vector indices selector.
+   *
+   * @param sources an `RDD[Vector]` containing the vectors.
+   */
+  def fit(sources: RDD[Vector]): SignificantSelectorModel = {
+val sources_count = sources.count()
+val significant_indices = sources.flatMap {
+case DenseVector(vs) =>
+  vs.zipWithIndex
+case SparseVector(_, ids, vs) =>
+  vs.zip(ids)
+case v =>
+  throw new IllegalArgumentException("Don't support vector type " 
+ v.getClass)
+  }
+  .map(e => (e.swap, 1))
+  .reduceByKey(_ + _)
+  .map { case ((idx, value), count) => (idx, (value, count))}
+  .groupByKey()
+  .mapValues { e =>
+val values = e.groupBy(_._1)
+val sum = e.map(_._2).sum
+
+values.size + (if (sum == sources_count || values.contains(0.0)) 0 
else 1)
+  }
+  .filter(_._2 > 1)
+  .keys
+  .collect()
+  .sorted
+
+if (significant_indices.nonEmpty)
+  new SignificantSelectorModel(significant_indices)
+else
+  new SignificantSelectorEmptyModel()
--- End diff --

No, because I can't create an empty sparse vector, only dense.

Here has two options:
 - create different model for empty indices.
 - make check if indices is empty each transform.

I think first way better

[GitHub] spark pull request: [SPARK-8341] Significant selector feature tran...

2015-07-02 Thread catap
Github user catap commented on a diff in the pull request:

https://github.com/apache/spark/pull/6795#discussion_r33841180
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/SignificantSelector.scala 
---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import scala.collection.mutable
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, 
Vectors}
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: Experimental ::
+ * Model to extract significant indices from vector.
+ *
+ * Significant indices is vector's index that has different value for 
different vectors.
+ *
+ * For example, when you use HashingTF they create big sparse vector,
+ * and this code convert to smallest vector that don't include same values 
indices for all vectors.
+ *
+ * @param indices array of significant indices.
+ */
+@Experimental
+class SignificantSelectorModel(val indices: Array[Int]) extends 
VectorTransformer {
+
+  /**
+   * Applies transformation on a vector.
+   *
+   * @param vector vector to be transformed.
+   * @return transformed vector.
+   */
+  override def transform(vector: Vector): Vector = vector match {
+case DenseVector(vs) =>
+  Vectors.dense(indices.map(vs))
+
+case SparseVector(s, ids, vs) =>
+  var sv_idx = 0
+  var new_idx = 0
+  val elements = new mutable.ListBuffer[(Int, Double)]()
+  
+  for (idx <- indices) {
+while (sv_idx < ids.length && ids(sv_idx) < idx) {
+  sv_idx += 1
+}
+if (sv_idx < ids.length && ids(sv_idx) == idx) {
+  elements += ((new_idx, vs(sv_idx)))
+  sv_idx += 1
+}
+new_idx += 1
+  }
+  
+  Vectors.sparse(indices.length, elements)
+
+case v =>
+  throw new IllegalArgumentException("Don't support vector type " + 
v.getClass)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Specialized model for equivalent vectors
+ */
+@Experimental
+class SignificantSelectorEmptyModel extends 
SignificantSelectorModel(Array[Int]()) {
+  
+  val empty_vector = Vectors.dense(Array[Double]())
+  
+  override def transform(vector: Vector): Vector = empty_vector
+}
+
+/**
+ * :: Experimental ::
+ * Create Significant selector.
+ */
+@Experimental
+class SignificantSelector() {
+
+  /**
+   * Returns a significant vector indices selector.
+   *
+   * @param sources an `RDD[Vector]` containing the vectors.
+   */
+  def fit(sources: RDD[Vector]): SignificantSelectorModel = {
+val sources_count = sources.count()
+val significant_indices = sources.flatMap {
+case DenseVector(vs) =>
+  vs.zipWithIndex
+case SparseVector(_, ids, vs) =>
+  vs.zip(ids)
+case v =>
+  throw new IllegalArgumentException("Don't support vector type " 
+ v.getClass)
+  }
+  .map(e => (e.swap, 1))
+  .reduceByKey(_ + _)
+  .map { case ((idx, value), count) => (idx, (value, count))}
+  .groupByKey()
+  .mapValues { e =>
+val values = e.groupBy(_._1)
+val sum = e.map(_._2).sum
+
+values.size + (if (sum == sources_count || values.contains(0.0)) 0 
else 1)
--- End diff --

Oh, this is hack for case when you has RDD what include dense and sparse 
vector.

Sparse vector hasn't got zero elements (`0d`) and `values.size` for sparse 
vector has count only for different non zero value.

If you has in RDD sparse and dense vector and last one has zero element 
where sparse vectro hasn't got element significant understand it's different 
values but isn't in fact.

For example, let's see following code:
```scala
val vectors = sc.paralleliz

[GitHub] spark pull request: [SPARK-8226][SQL]Add function shiftrightunsign...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7035#issuecomment-118239379
  
  [Test build #36476 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36476/consoleFull)
 for   PR 7035 at commit 
[`3e9f5ae`](https://github.com/apache/spark/commit/3e9f5aef20208c7e20e024c20f16745b12f0bea1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8226][SQL]Add function shiftrightunsign...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7035#issuecomment-118238647
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-118238737
  
  [Test build #36475 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36475/console)
 for   PR 7099 at commit 
[`509ae36`](https://github.com/apache/spark/commit/509ae360a539ca8e3d2906c8951d017ef2fc0627).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8226][SQL]Add function shiftrightunsign...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7035#issuecomment-118238687
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-118238745
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8538][SPARK-8539][ML] Linear Regression...

2015-07-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7099#issuecomment-118238286
  
  [Test build #36475 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36475/consoleFull)
 for   PR 7099 at commit 
[`509ae36`](https://github.com/apache/spark/commit/509ae360a539ca8e3d2906c8951d017ef2fc0627).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5016][MLLib] Distribute GMM mixture com...

2015-07-02 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/7166#issuecomment-118238218
  
I did some [perf 
testing](https://gist.github.com/feynmanliang/70d79c23dffc828939ec) and it 
shows that distributing the Gaussians does yield a significant improvement in 
performance when the number of clusters and dimensionality of the data is 
sufficiently large (>30 dimensions, >10 clusters).

In particular, the "typical" use case of 40 dimensions and 10k clusters 
gains about 15 seconds in runtime when distributing the Gaussians.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >