date:20150707

[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7263#issuecomment-119461156
  
  [Test build #36760 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36760/console)
 for   PR 7263 at commit 
[`4dfd418`](https://github.com/apache/spark/commit/4dfd41865f3722ff8f0dd3fe8abb05214859f418).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7279#issuecomment-119460708
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7279#issuecomment-119460686
  
  [Test build #36761 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36761/console)
 for   PR 7279 at commit 
[`d980757`](https://github.com/apache/spark/commit/d98075785c60e2baa50abebda8c500da4d2c09f0).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...

2015-07-07 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7226#issuecomment-119460060
  
looks good otherwise.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6743#issuecomment-119460146
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6743#issuecomment-119460116
  
  [Test build #36758 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36758/console)
 for   PR 6743 at commit 
[`27cfbac`](https://github.com/apache/spark/commit/27cfbac55f276bb3d11cdd6a417f541aa81524a9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...

2015-07-07 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7226#discussion_r34120043
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Interval.scala ---
@@ -0,0 +1,24 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.types
+
+
+/**
+ * The internal representation of interval type.
+ */
+case class Interval(months: Int, microseconds: Long) extends Serializable
--- End diff --

actually can you make this a java class and move it to 
unsafe/src/main/java/org/apache/spark/unsafe/types ?

Just leave the fields public so we can easily access them in codegen.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34119826
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala 
---
@@ -0,0 +1,434 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.nio.ByteOrder
+
+import scala.collection.JavaConversions._
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.parquet.column.Dictionary
+import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, 
PrimitiveConverter}
+import org.apache.parquet.schema.Type.Repetition
+import org.apache.parquet.schema.{GroupType, PrimitiveType, Type}
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * A [[ParentContainerUpdater]] is used by a Parquet converter to set 
converted values to some
+ * corresponding parent container. For example, a converter for a 
`StructType` field may set
+ * converted values to a [[MutableRow]]; or a converter for array elements 
may append converted
+ * values to an [[ArrayBuffer]].
+ */
+private[parquet] trait ParentContainerUpdater {
+  def set(value: Any): Unit = ()
+  def setBoolean(value: Boolean): Unit = set(value)
+  def setByte(value: Byte): Unit = set(value)
+  def setShort(value: Short): Unit = set(value)
+  def setInt(value: Int): Unit = set(value)
+  def setLong(value: Long): Unit = set(value)
+  def setFloat(value: Float): Unit = set(value)
+  def setDouble(value: Double): Unit = set(value)
+}
+
+/** A no-op updater used for root converter (who doesn't have a parent). */
+private[parquet] object NoopUpdater extends ParentContainerUpdater
+
+/**
+ * A [[CatalystRowConverter]] is used to convert Parquet "structs" into 
Spark SQL [[Row]]s.  Since
+ * any Parquet record is also a struct, this converter can also be used as 
root converter.
+ *
+ * When used as a root converter, [[NoopUpdater]] should be used since 
root converters don't have
+ * any "parent" container.
--- End diff --

Will add more explanations here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7279#issuecomment-119458590
  
  [Test build #36761 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36761/consoleFull)
 for   PR 7279 at commit 
[`d980757`](https://github.com/apache/spark/commit/d98075785c60e2baa50abebda8c500da4d2c09f0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34119723
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
+val sequences: RDD[Array[Int]],
+val minSupport: Int = 2,
+val maxPatternLength: Int = 50) extends java.io.Serializable {
+
+  /**
+   * Calculate sequential patterns:
+   * a) find and collect length-one patterns
+   * b) for each length-one patterns and each sequence,
+   *emit (pattern (prefix), suffix sequence) as key-value pairs
+   * c) group by key and then map value iterator to array
+   * d) local PrefixSpan on each prefix
+   * @return sequential patterns
+   */
+  def run(): RDD[(Seq[Int], Int)] = {
+val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne()
+val repartitionedRdd = repartitionSequences(prefixAndCandidates)
+val nextPatterns = getPatternsInLocal(repartitionedRdd)
+val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ 
nextPatterns
+allPatterns
+  }
+
+  /**
+   * Find the patterns that it's length is one
+   * @return length-one patterns and projection table
+   */
+  private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], 
Array[Int])]) = {
+val patternsOneLength = sequences
+  .map(_.distinct)
+  .flatMap(p => p)
+  .map((_, 1))
+  .reduceByKey(_ + _)
+
+val removedElements: Array[Int] = patternsOneLength
+  .filter(_._2 < minSupport)
+  .map(_._1)
+  .collect()
+
+val savedElements = patternsOneLength.filter(_._2 >= minSupport)
+
+val savedElementsArray = savedElements
+  .map(_._1)
+  .collect()
+
+val filteredSequences =
+  if (removedElements.isEmpty) {
+sequences
+  } else {
+sequences.map { p =>
+  p.filter { x => !removedElements.contains(x) }
+}
+  }
+
+val prefixAndCandidates = filteredSequences.flatMap { x =>
+  savedElementsArray.map { y =>
+val sub = getSuffix(y, x)
+(Seq(y), sub)
+  }
+}
+
+(savedElements, prefixAndCandidates)
+  }
+
+  /**
+   * Re-partition the RDD data, to get better balance and performance.
+   * @param data patterns and projected sequences data before re-partition
+   * @return patterns and projected sequences data after re-partition
+   */
+  private def repartitionSequences(
+  data: RDD[(Seq[Int], Array[Int])]): RDD[(Seq[Int], 
Array[Array[Int]])] = {
+val dataRemovedEmptyLine = data.filter(x => x._2.nonEmpty)
+val dataMerged = dataRemovedEmptyLine
+  .groupByKey()
+  .map(x => (x._1, x._2.toArray))
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at inf

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34119658
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
+val sequences: RDD[Array[Int]],
+val minSupport: Int = 2,
+val maxPatternLength: Int = 50) extends java.io.Serializable {
+
+  /**
+   * Calculate sequential patterns:
+   * a) find and collect length-one patterns
+   * b) for each length-one patterns and each sequence,
+   *emit (pattern (prefix), suffix sequence) as key-value pairs
+   * c) group by key and then map value iterator to array
+   * d) local PrefixSpan on each prefix
+   * @return sequential patterns
+   */
+  def run(): RDD[(Seq[Int], Int)] = {
+val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne()
+val repartitionedRdd = repartitionSequences(prefixAndCandidates)
+val nextPatterns = getPatternsInLocal(repartitionedRdd)
+val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ 
nextPatterns
+allPatterns
+  }
+
+  /**
+   * Find the patterns that it's length is one
+   * @return length-one patterns and projection table
+   */
+  private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], 
Array[Int])]) = {
+val patternsOneLength = sequences
+  .map(_.distinct)
+  .flatMap(p => p)
+  .map((_, 1))
+  .reduceByKey(_ + _)
+
+val removedElements: Array[Int] = patternsOneLength
+  .filter(_._2 < minSupport)
+  .map(_._1)
+  .collect()
+
+val savedElements = patternsOneLength.filter(_._2 >= minSupport)
+
+val savedElementsArray = savedElements
+  .map(_._1)
+  .collect()
+
+val filteredSequences =
+  if (removedElements.isEmpty) {
+sequences
+  } else {
+sequences.map { p =>
+  p.filter { x => !removedElements.contains(x) }
+}
+  }
+
+val prefixAndCandidates = filteredSequences.flatMap { x =>
+  savedElementsArray.map { y =>
+val sub = getSuffix(y, x)
+(Seq(y), sub)
+  }
+}
+
+(savedElements, prefixAndCandidates)
+  }
+
+  /**
+   * Re-partition the RDD data, to get better balance and performance.
+   * @param data patterns and projected sequences data before re-partition
+   * @return patterns and projected sequences data after re-partition
+   */
+  private def repartitionSequences(
+  data: RDD[(Seq[Int], Array[Int])]): RDD[(Seq[Int], 
Array[Array[Int]])] = {
+val dataRemovedEmptyLine = data.filter(x => x._2.nonEmpty)
--- End diff --

Fixed. Moved the filter to L95. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---

[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7279#issuecomment-119458150
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7279#issuecomment-119458113
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34119496
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
+val sequences: RDD[Array[Int]],
+val minSupport: Int = 2,
+val maxPatternLength: Int = 50) extends java.io.Serializable {
+
+  /**
+   * Calculate sequential patterns:
+   * a) find and collect length-one patterns
+   * b) for each length-one patterns and each sequence,
+   *emit (pattern (prefix), suffix sequence) as key-value pairs
+   * c) group by key and then map value iterator to array
+   * d) local PrefixSpan on each prefix
+   * @return sequential patterns
+   */
+  def run(): RDD[(Seq[Int], Int)] = {
+val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne()
+val repartitionedRdd = repartitionSequences(prefixAndCandidates)
+val nextPatterns = getPatternsInLocal(repartitionedRdd)
+val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ 
nextPatterns
+allPatterns
+  }
+
+  /**
+   * Find the patterns that it's length is one
+   * @return length-one patterns and projection table
+   */
+  private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], 
Array[Int])]) = {
+val patternsOneLength = sequences
+  .map(_.distinct)
+  .flatMap(p => p)
+  .map((_, 1))
+  .reduceByKey(_ + _)
+
+val removedElements: Array[Int] = patternsOneLength
+  .filter(_._2 < minSupport)
+  .map(_._1)
+  .collect()
+
+val savedElements = patternsOneLength.filter(_._2 >= minSupport)
+
+val savedElementsArray = savedElements
+  .map(_._1)
+  .collect()
+
+val filteredSequences =
+  if (removedElements.isEmpty) {
+sequences
+  } else {
+sequences.map { p =>
+  p.filter { x => !removedElements.contains(x) }
+}
+  }
+
+val prefixAndCandidates = filteredSequences.flatMap { x =>
+  savedElementsArray.map { y =>
+val sub = getSuffix(y, x)
+(Seq(y), sub)
+  }
+}
+
+(savedElements, prefixAndCandidates)
+  }
+
+  /**
+   * Re-partition the RDD data, to get better balance and performance.
+   * @param data patterns and projected sequences data before re-partition
+   * @return patterns and projected sequences data after re-partition
--- End diff --

Changed the function name to makePrefixProjectedDatabases. It's better than 
old one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spar

[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...

2015-07-07 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7226#discussion_r34119333
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala ---
@@ -304,6 +304,9 @@ private[sql] object ResolvedDataSource {
   mode: SaveMode,
   options: Map[String, String],
   data: DataFrame): ResolvedDataSource = {
+if (data.schema.map(_.dataType).exists(_.isInstanceOf[IntervalType])) {
+  sys.error("Cannot save interval data type into external storage.")
--- End diff --

when is this called? if it is called during analysis, it'd make more sense 
to throw AnalysisException, since that has better error reporting in Python.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34119347
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
+val sequences: RDD[Array[Int]],
+val minSupport: Int = 2,
+val maxPatternLength: Int = 50) extends java.io.Serializable {
+
+  /**
+   * Calculate sequential patterns:
+   * a) find and collect length-one patterns
+   * b) for each length-one patterns and each sequence,
+   *emit (pattern (prefix), suffix sequence) as key-value pairs
+   * c) group by key and then map value iterator to array
+   * d) local PrefixSpan on each prefix
+   * @return sequential patterns
+   */
+  def run(): RDD[(Seq[Int], Int)] = {
+val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne()
+val repartitionedRdd = repartitionSequences(prefixAndCandidates)
+val nextPatterns = getPatternsInLocal(repartitionedRdd)
+val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ 
nextPatterns
+allPatterns
+  }
+
+  /**
+   * Find the patterns that it's length is one
+   * @return length-one patterns and projection table
+   */
+  private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], 
Array[Int])]) = {
+val patternsOneLength = sequences
+  .map(_.distinct)
+  .flatMap(p => p)
+  .map((_, 1))
+  .reduceByKey(_ + _)
+
+val removedElements: Array[Int] = patternsOneLength
+  .filter(_._2 < minSupport)
+  .map(_._1)
+  .collect()
+
+val savedElements = patternsOneLength.filter(_._2 >= minSupport)
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34119217
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
+val sequences: RDD[Array[Int]],
+val minSupport: Int = 2,
+val maxPatternLength: Int = 50) extends java.io.Serializable {
+
+  /**
+   * Calculate sequential patterns:
+   * a) find and collect length-one patterns
+   * b) for each length-one patterns and each sequence,
+   *emit (pattern (prefix), suffix sequence) as key-value pairs
+   * c) group by key and then map value iterator to array
+   * d) local PrefixSpan on each prefix
+   * @return sequential patterns
+   */
+  def run(): RDD[(Seq[Int], Int)] = {
+val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne()
+val repartitionedRdd = repartitionSequences(prefixAndCandidates)
+val nextPatterns = getPatternsInLocal(repartitionedRdd)
+val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ 
nextPatterns
+allPatterns
+  }
+
+  /**
+   * Find the patterns that it's length is one
+   * @return length-one patterns and projection table
+   */
+  private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], 
Array[Int])]) = {
+val patternsOneLength = sequences
+  .map(_.distinct)
+  .flatMap(p => p)
+  .map((_, 1))
+  .reduceByKey(_ + _)
+
+val removedElements: Array[Int] = patternsOneLength
--- End diff --

Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34119143
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
+val sequences: RDD[Array[Int]],
+val minSupport: Int = 2,
+val maxPatternLength: Int = 50) extends java.io.Serializable {
+
+  /**
+   * Calculate sequential patterns:
+   * a) find and collect length-one patterns
+   * b) for each length-one patterns and each sequence,
+   *emit (pattern (prefix), suffix sequence) as key-value pairs
+   * c) group by key and then map value iterator to array
+   * d) local PrefixSpan on each prefix
+   * @return sequential patterns
+   */
+  def run(): RDD[(Seq[Int], Int)] = {
+val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne()
+val repartitionedRdd = repartitionSequences(prefixAndCandidates)
+val nextPatterns = getPatternsInLocal(repartitionedRdd)
+val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ 
nextPatterns
+allPatterns
+  }
+
+  /**
+   * Find the patterns that it's length is one
+   * @return length-one patterns and projection table
+   */
+  private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], 
Array[Int])]) = {
+val patternsOneLength = sequences
+  .map(_.distinct)
+  .flatMap(p => p)
+  .map((_, 1))
+  .reduceByKey(_ + _)
+
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34119063
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
+val sequences: RDD[Array[Int]],
+val minSupport: Int = 2,
+val maxPatternLength: Int = 50) extends java.io.Serializable {
+
+  /**
+   * Calculate sequential patterns:
+   * a) find and collect length-one patterns
+   * b) for each length-one patterns and each sequence,
+   *emit (pattern (prefix), suffix sequence) as key-value pairs
+   * c) group by key and then map value iterator to array
+   * d) local PrefixSpan on each prefix
+   * @return sequential patterns
+   */
+  def run(): RDD[(Seq[Int], Int)] = {
+val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne()
+val repartitionedRdd = repartitionSequences(prefixAndCandidates)
+val nextPatterns = getPatternsInLocal(repartitionedRdd)
+val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ 
nextPatterns
+allPatterns
+  }
+
+  /**
+   * Find the patterns that it's length is one
+   * @return length-one patterns and projection table
+   */
+  private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], 
Array[Int])]) = {
+val patternsOneLength = sequences
+  .map(_.distinct)
+  .flatMap(p => p)
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34118963
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
+val sequences: RDD[Array[Int]],
+val minSupport: Int = 2,
+val maxPatternLength: Int = 50) extends java.io.Serializable {
+
+  /**
+   * Calculate sequential patterns:
+   * a) find and collect length-one patterns
+   * b) for each length-one patterns and each sequence,
+   *emit (pattern (prefix), suffix sequence) as key-value pairs
+   * c) group by key and then map value iterator to array
+   * d) local PrefixSpan on each prefix
+   * @return sequential patterns
+   */
+  def run(): RDD[(Seq[Int], Int)] = {
+val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne()
+val repartitionedRdd = repartitionSequences(prefixAndCandidates)
+val nextPatterns = getPatternsInLocal(repartitionedRdd)
+val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ 
nextPatterns
+allPatterns
+  }
+
+  /**
+   * Find the patterns that it's length is one
+   * @return length-one patterns and projection table
+   */
+  private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], 
Array[Int])]) = {
+val patternsOneLength = sequences
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34118908
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
+val sequences: RDD[Array[Int]],
+val minSupport: Int = 2,
+val maxPatternLength: Int = 50) extends java.io.Serializable {
+
+  /**
+   * Calculate sequential patterns:
+   * a) find and collect length-one patterns
+   * b) for each length-one patterns and each sequence,
+   *emit (pattern (prefix), suffix sequence) as key-value pairs
+   * c) group by key and then map value iterator to array
+   * d) local PrefixSpan on each prefix
+   * @return sequential patterns
+   */
+  def run(): RDD[(Seq[Int], Int)] = {
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7263#issuecomment-119451883
  
  [Test build #36760 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36760/consoleFull)
 for   PR 7263 at commit 
[`4dfd418`](https://github.com/apache/spark/commit/4dfd41865f3722ff8f0dd3fe8abb05214859f418).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34118848
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
+val sequences: RDD[Array[Int]],
+val minSupport: Int = 2,
+val maxPatternLength: Int = 50) extends java.io.Serializable {
--- End diff --

Fixed, changed maxPatternLength to 10, changed minSupport to 0.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8883][SQL]Remove the OverrideFunctionRe...

2015-07-07 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7260#discussion_r34118817
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala 
---
@@ -76,7 +76,7 @@ private[hive] class HiveFunctionRegistry(underlying: 
analysis.FunctionRegistry)
   }
 
   override def registerFunction(name: String, builder: FunctionBuilder): 
Unit =
-throw new UnsupportedOperationException
+underlying.registerFunction(name, builder)
--- End diff --

can you add a unit test in hive to make sure function registration works.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34118782
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
+val sequences: RDD[Array[Int]],
+val minSupport: Int = 2,
--- End diff --

Changed these input parameters to private var s, and add getters and 
setters.

But I have a question, why don't we use the public input parameters ?
The following code, the class A is smaller than class B, Why is class B 
code style better ?

class A (var x: Int = 12) {
  def f() = {
println(x)
  }
}

class B (private var x: Int) {
  def this() = this(12)
  def setX(x: Int): this.type = {
this.x=x
this
  }
  def f() = {
println(x)
  }
}

val a = new A()
a.f()
a.x = 15
a.f()

val b = new B()
b.f()
b.setX(15)
b.f()



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7263#issuecomment-119451559
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7263#issuecomment-119451544
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7078] [SPARK-7079] [WIP] Binary process...

2015-07-07 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/6444#discussion_r34118575
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection.unsafe.sort;
+
+import java.io.*;
+
+import com.google.common.io.ByteStreams;
+
+import org.apache.spark.storage.BlockId;
+import org.apache.spark.storage.BlockManager;
+import org.apache.spark.unsafe.PlatformDependent;
+
+/**
+ * Reads spill files written by {@link UnsafeSorterSpillWriter} (see that 
class for a description
+ * of the file format).
+ */
+final class UnsafeSorterSpillReader extends UnsafeSorterIterator {
+
+  private InputStream in;
+  private DataInputStream din;
+
+  // Variables that change with every record read:
+  private int recordLength;
+  private long keyPrefix;
+  private int numRecordsRemaining;
+
+  private final byte[] arr = new byte[1024 * 1024];  // TODO: tune this 
(maybe grow dynamically)?
+  private final Object baseObject = arr;
+  private final long baseOffset = PlatformDependent.BYTE_ARRAY_OFFSET;
+
+  public UnsafeSorterSpillReader(
+  BlockManager blockManager,
+  File file,
+  BlockId blockId) throws IOException {
+assert (file.length() > 0);
+final BufferedInputStream bs = new BufferedInputStream(new 
FileInputStream(file));
+this.in = blockManager.wrapForCompression(blockId, bs);
+this.din = new DataInputStream(this.in);
+numRecordsRemaining = din.readInt();
+  }
+
+  @Override
+  public boolean hasNext() {
+return (numRecordsRemaining > 0);
+  }
+
+  @Override
+  public void loadNext() throws IOException {
+recordLength = din.readInt();
+keyPrefix = din.readLong();
+ByteStreams.readFully(in, arr, 0, recordLength);
+numRecordsRemaining--;
+if (numRecordsRemaining == 0) {
+  in.close();
--- End diff --

Even that isn't foolproof, though, since wrapping of the iterator would 
break that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7078] [SPARK-7079] [WIP] Binary process...

2015-07-07 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/6444#discussion_r34118539
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection.unsafe.sort;
+
+import java.io.*;
+
+import com.google.common.io.ByteStreams;
+
+import org.apache.spark.storage.BlockId;
+import org.apache.spark.storage.BlockManager;
+import org.apache.spark.unsafe.PlatformDependent;
+
+/**
+ * Reads spill files written by {@link UnsafeSorterSpillWriter} (see that 
class for a description
+ * of the file format).
+ */
+final class UnsafeSorterSpillReader extends UnsafeSorterIterator {
+
+  private InputStream in;
+  private DataInputStream din;
+
+  // Variables that change with every record read:
+  private int recordLength;
+  private long keyPrefix;
+  private int numRecordsRemaining;
+
+  private final byte[] arr = new byte[1024 * 1024];  // TODO: tune this 
(maybe grow dynamically)?
+  private final Object baseObject = arr;
+  private final long baseOffset = PlatformDependent.BYTE_ARRAY_OFFSET;
+
+  public UnsafeSorterSpillReader(
+  BlockManager blockManager,
+  File file,
+  BlockId blockId) throws IOException {
+assert (file.length() > 0);
+final BufferedInputStream bs = new BufferedInputStream(new 
FileInputStream(file));
+this.in = blockManager.wrapForCompression(blockId, bs);
+this.din = new DataInputStream(this.in);
+numRecordsRemaining = din.readInt();
+  }
+
+  @Override
+  public boolean hasNext() {
+return (numRecordsRemaining > 0);
+  }
+
+  @Override
+  public void loadNext() throws IOException {
+recordLength = din.readInt();
+keyPrefix = din.readLong();
+ByteStreams.readFully(in, arr, 0, recordLength);
+numRecordsRemaining--;
+if (numRecordsRemaining == 0) {
+  in.close();
--- End diff --

This reminds me: this kind of cleanup is actually a problem in general for 
this sort code, since we also need to free memory pages, etc. in the `order by 
xxx limit xxx` case.  Until we implement a proper internal iterator interface, 
a stopgap measure might be to define our own Scala iterator subclass which 
implements a close method and to update operators like `Limit` to call 
`close()` if it's available.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...

2015-07-07 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/7263#discussion_r34118525
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -146,6 +151,55 @@ class Word2VecModel private[ml] (
 wordVectors: feature.Word2VecModel)
   extends Model[Word2VecModel] with Word2VecBase {
 
+
+  /**
+   * Return a map of every word to its Vector representation.
+   */
+  val getVectors: Map[String, Array[Float]] = wordVectors.getVectors
+
+  /**
+   * Java friendly version of getVectors called by the python wrappers.
+   */
+  private[spark] val javaGetVectors: JList[Object] = {
+val (words, vectors) = getVectors.toSeq.unzip
+
+// There are problems with serialization of Arrays which makes this 
conversion
+// necessary.
+val vectorsList = vectors.toList.map(vec => 
Vectors.dense(vec.map(_.toDouble)))
+List(words.toList, 
vectorsList).map(_.asJava.asInstanceOf[Object]).asJava
+  }
+
+  /**
+   * Find "num" no. words closest to similarity to the given word.
+   * Returns a coupled array with the second element giving the similarity 
measure.
+   */
+  def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
--- End diff --

Hmm. I had a similar thought in mind. If I do that I could remove the 
"java-friendly" code below.

@mengxr thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...

2015-07-07 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/7263#discussion_r34118474
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -146,6 +151,55 @@ class Word2VecModel private[ml] (
 wordVectors: feature.Word2VecModel)
   extends Model[Word2VecModel] with Word2VecBase {
 
+
+  /**
+   * Return a map of every word to its Vector representation.
+   */
+  val getVectors: Map[String, Array[Float]] = wordVectors.getVectors
--- End diff --

Thanks for clarifying. I just thought I was being ignorant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7078] [SPARK-7079] [WIP] Binary process...

2015-07-07 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/6444#discussion_r34118225
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java
 ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import scala.collection.Iterator;
+import scala.math.Ordering;
+
+import com.google.common.annotations.VisibleForTesting;
+
+import org.apache.spark.SparkEnv;
+import org.apache.spark.TaskContext;
+import org.apache.spark.sql.AbstractScalaRowIterator;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.ObjectUnsafeColumnWriter;
+import org.apache.spark.sql.catalyst.expressions.UnsafeColumnWriter;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRowConverter;
+import org.apache.spark.sql.catalyst.util.ObjectPool;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.PlatformDependent;
+import org.apache.spark.util.collection.unsafe.sort.PrefixComparator;
+import org.apache.spark.util.collection.unsafe.sort.RecordComparator;
+import org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter;
+import org.apache.spark.util.collection.unsafe.sort.UnsafeSorterIterator;
+
+final class UnsafeExternalRowSorter {
+
+  /**
+   * If positive, forces records to be spilled to disk at the given 
frequency (measured in numbers
+   * of records). This is only intended to be used in tests.
+   */
+  private int testSpillFrequency = 0;
+
+  private long numRowsInserted = 0;
+
+  private final StructType schema;
+  private final UnsafeRowConverter rowConverter;
+  private final PrefixComputer prefixComputer;
+  private final ObjectPool objPool = new ObjectPool(128);
+  private final UnsafeExternalSorter sorter;
+  private byte[] rowConversionBuffer = new byte[1024 * 8];
+
+  public static abstract class PrefixComputer {
+abstract long computePrefix(InternalRow row);
+  }
+
+  public UnsafeExternalRowSorter(
+  StructType schema,
+  Ordering ordering,
+  PrefixComparator prefixComparator,
+  PrefixComputer prefixComputer) throws IOException {
+this.schema = schema;
+this.rowConverter = new UnsafeRowConverter(schema);
+this.prefixComputer = prefixComputer;
+final SparkEnv sparkEnv = SparkEnv.get();
+final TaskContext taskContext = TaskContext.get();
+sorter = new UnsafeExternalSorter(
+  taskContext.taskMemoryManager(),
+  sparkEnv.shuffleMemoryManager(),
+  sparkEnv.blockManager(),
+  taskContext,
+  new RowComparator(ordering, schema.length(), objPool),
+  prefixComparator,
+  4096,
+  sparkEnv.conf()
+);
+  }
+
+  /**
+   * Forces spills to occur every `frequency` records. Only for use in 
tests.
+   */
+  @VisibleForTesting
+  void setTestSpillFrequency(int frequency) {
+assert frequency > 0 : "Frequency must be positive";
+testSpillFrequency = frequency;
+  }
+
+  @VisibleForTesting
+  void insertRow(InternalRow row) throws IOException {
+final int sizeRequirement = rowConverter.getSizeRequirement(row);
+if (sizeRequirement > rowConversionBuffer.length) {
+  rowConversionBuffer = new byte[sizeRequirement];
+} else {
+  // Zero out the buffer that's used to hold the current row. This is 
necessary in order
+  // to ensure that rows hash properly, since garbage data from the 
previous row could
+  // otherwise end up as padding in this row. As a performance 
optimization, we only zero
+  // out the portion of the buffer that we'll actuall

[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7226#issuecomment-119449087
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7226#issuecomment-119449043
  
  [Test build #36757 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36757/console)
 for   PR 7226 at commit 
[`ac348c3`](https://github.com/apache/spark/commit/ac348c3ff9007d1b377eeb3570dfad86e05732aa).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class Interval(months: Int, microseconds: Long) extends 
Serializable`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8389][Streaming][PySpark] Expose KafkaR...

2015-07-07 Thread jerryshao

Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/7185#issuecomment-119447855
  
Hi @amit-ramesh , what I mentioned about getting offsetRanges in 
`transform` function is something like this:

```python
dstream.transform(lambda r: r.offsetRanges())
```

Here `r.offsetRanges` is executed in driver side, if you have follow-up 
transformations inside `transfrom` function which need to use this 
offsetRanges, this offsetRanges will implicitly be sent to executor side. 
That's what I mean about.

Also:
>1. Events in an RDD partition are ordered by Kafka offset
>2. The index of an OffsetRanges object in the getOffsets() list 
corresponds to the partition index in the RDD.

This two assumptions are true as I know. so you could rely on this 
assumptions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34117781
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala 
---
@@ -0,0 +1,434 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.nio.ByteOrder
+
+import scala.collection.JavaConversions._
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.parquet.column.Dictionary
+import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, 
PrimitiveConverter}
+import org.apache.parquet.schema.Type.Repetition
+import org.apache.parquet.schema.{GroupType, PrimitiveType, Type}
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * A [[ParentContainerUpdater]] is used by a Parquet converter to set 
converted values to some
+ * corresponding parent container. For example, a converter for a 
`StructType` field may set
+ * converted values to a [[MutableRow]]; or a converter for array elements 
may append converted
+ * values to an [[ArrayBuffer]].
+ */
+private[parquet] trait ParentContainerUpdater {
+  def set(value: Any): Unit = ()
+  def setBoolean(value: Boolean): Unit = set(value)
+  def setByte(value: Byte): Unit = set(value)
+  def setShort(value: Short): Unit = set(value)
+  def setInt(value: Int): Unit = set(value)
+  def setLong(value: Long): Unit = set(value)
+  def setFloat(value: Float): Unit = set(value)
+  def setDouble(value: Double): Unit = set(value)
--- End diff --

No, `DateType` is converted to `Int` and we just call `setInt`. 
`TimestampType` is similar.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34117743
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetThriftCompatibilitySuite.scala
 ---
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.nio.ByteBuffer
+import java.util.{List => JList, Map => JMap}
+
+import scala.collection.JavaConversions._
+
+import org.apache.hadoop.fs.Path
+import org.apache.parquet.hadoop.metadata.CompressionCodecName
+import org.apache.parquet.thrift.ThriftParquetWriter
+
+import org.apache.spark.sql.parquet.test.thrift.{Nested, 
ParquetThriftCompat, Suit}
+import org.apache.spark.sql.test.TestSQLContext
+import org.apache.spark.sql.{Row, SQLContext}
+
+object ParquetThriftCompatibilitySuite {
+  def makeParquetThriftCompat(i: Int): ParquetThriftCompat = {
--- End diff --

Yes. Originally I moved it here because of serialization failure, because I 
used to write the Parquet file using a Spark job. But now I resort to 
`ParquetWriter` classes and the companion object isn't necessary anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34117676
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala 
---
@@ -0,0 +1,434 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.nio.ByteOrder
+
+import scala.collection.JavaConversions._
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.parquet.column.Dictionary
+import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, 
PrimitiveConverter}
+import org.apache.parquet.schema.Type.Repetition
+import org.apache.parquet.schema.{GroupType, PrimitiveType, Type}
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * A [[ParentContainerUpdater]] is used by a Parquet converter to set 
converted values to some
+ * corresponding parent container. For example, a converter for a 
`StructType` field may set
+ * converted values to a [[MutableRow]]; or a converter for array elements 
may append converted
+ * values to an [[ArrayBuffer]].
+ */
+private[parquet] trait ParentContainerUpdater {
+  def set(value: Any): Unit = ()
+  def setBoolean(value: Boolean): Unit = set(value)
+  def setByte(value: Byte): Unit = set(value)
+  def setShort(value: Short): Unit = set(value)
+  def setInt(value: Int): Unit = set(value)
+  def setLong(value: Long): Unit = set(value)
+  def setFloat(value: Float): Unit = set(value)
+  def setDouble(value: Double): Unit = set(value)
+}
+
+/** A no-op updater used for root converter (who doesn't have a parent). */
+private[parquet] object NoopUpdater extends ParentContainerUpdater
+
+/**
+ * A [[CatalystRowConverter]] is used to convert Parquet "structs" into 
Spark SQL [[Row]]s.  Since
+ * any Parquet record is also a struct, this converter can also be used as 
root converter.
--- End diff --

See here: 
https://github.com/apache/spark/pull/7231/files#diff-83ef4d5f1029c8bebb49a0c139fa3154R50


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34117556
  
--- Diff: project/MimaExcludes.scala ---
@@ -81,6 +81,32 @@ object MimaExcludes {
   "org.apache.spark.mllib.linalg.Matrix.numNonzeros"),
 ProblemFilters.exclude[MissingMethodProblem](
   "org.apache.spark.mllib.linalg.Matrix.numActives")
+  ) ++ Seq(
+// SPARK-6776 Implement backwards-compatibility rules in 
Catalyst converters
+ProblemFilters.exclude[MissingClassProblem](
+  "org.apache.spark.sql.parquet.CatalystGroupConverter"),
--- End diff --

we should exclude the entire parquet package.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34117535
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetAvroCompatibilitySuite.scala
 ---
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.nio.ByteBuffer
+import java.util.{List => JList, Map => JMap}
+
+import scala.collection.JavaConversions._
+
+import org.apache.hadoop.fs.Path
+import org.apache.parquet.avro.AvroParquetWriter
+
+import org.apache.spark.sql.parquet.test.avro.{Nested, ParquetAvroCompat}
+import org.apache.spark.sql.test.TestSQLContext
+import org.apache.spark.sql.{Row, SQLContext}
+
+object ParquetAvroCompatibilitySuite {
--- End diff --

I'd remove this too, unless it is used somewhere else. It makes it more 
confusing when you first open the test suite file if you don't see the suite.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34117514
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala 
---
@@ -0,0 +1,434 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.nio.ByteOrder
+
+import scala.collection.JavaConversions._
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.parquet.column.Dictionary
+import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, 
PrimitiveConverter}
+import org.apache.parquet.schema.Type.Repetition
+import org.apache.parquet.schema.{GroupType, PrimitiveType, Type}
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * A [[ParentContainerUpdater]] is used by a Parquet converter to set 
converted values to some
+ * corresponding parent container. For example, a converter for a 
`StructType` field may set
+ * converted values to a [[MutableRow]]; or a converter for array elements 
may append converted
+ * values to an [[ArrayBuffer]].
+ */
+private[parquet] trait ParentContainerUpdater {
+  def set(value: Any): Unit = ()
+  def setBoolean(value: Boolean): Unit = set(value)
+  def setByte(value: Byte): Unit = set(value)
+  def setShort(value: Short): Unit = set(value)
+  def setInt(value: Int): Unit = set(value)
+  def setLong(value: Long): Unit = set(value)
+  def setFloat(value: Float): Unit = set(value)
+  def setDouble(value: Double): Unit = set(value)
+}
+
+/** A no-op updater used for root converter (who doesn't have a parent). */
+private[parquet] object NoopUpdater extends ParentContainerUpdater
+
+/**
+ * A [[CatalystRowConverter]] is used to convert Parquet "structs" into 
Spark SQL [[Row]]s.  Since
+ * any Parquet record is also a struct, this converter can also be used as 
root converter.
--- End diff --

The root converter is the converter defined in `RowRecordMaterializer`, 
which is used to convert a complete Parquet record to a single Spark SQL `Row`. 
It's always a `CatalystRowConverter`. The root converter's updater is a 
`NoopUpdater`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34117471
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetThriftCompatibilitySuite.scala
 ---
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.nio.ByteBuffer
+import java.util.{List => JList, Map => JMap}
+
+import scala.collection.JavaConversions._
+
+import org.apache.hadoop.fs.Path
+import org.apache.parquet.hadoop.metadata.CompressionCodecName
+import org.apache.parquet.thrift.ThriftParquetWriter
+
+import org.apache.spark.sql.parquet.test.thrift.{Nested, 
ParquetThriftCompat, Suit}
+import org.apache.spark.sql.test.TestSQLContext
+import org.apache.spark.sql.{Row, SQLContext}
+
+object ParquetThriftCompatibilitySuite {
+  def makeParquetThriftCompat(i: Int): ParquetThriftCompat = {
--- End diff --

can't we just make this a function of ParquetThriftCompatibilitySuite to 
remove the companion object?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8851][YARN] In Yarn client mode, Client...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7255#issuecomment-11906
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8851][YARN] In Yarn client mode, Client...

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7255#issuecomment-119444300
  
  [Test build #36753 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36753/console)
 for   PR 7255 at commit 
[`3122470`](https://github.com/apache/spark/commit/31224709a69edd36d3a14ce48b20e578cad4ddcd).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8700][ML] Disable feature scaling in Lo...

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7080#issuecomment-119443905
  
  [Test build #36754 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36754/console)
 for   PR 7080 at commit 
[`877e6c7`](https://github.com/apache/spark/commit/877e6c777874c3684c62b94a8704c8a1866defd9).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8700][ML] Disable feature scaling in Lo...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7080#issuecomment-119444002
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8886][Documentation]python Style update

2015-07-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7281


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8886][Documentation]python Style update

2015-07-07 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7281#issuecomment-119442813
  
Thanks - I've merged this.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8886][Documentation]python Style update

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7281#issuecomment-119442368
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8615][Documentation]Fixed Sample deprec...

2015-07-07 Thread tijoparacka

Github user tijoparacka commented on a diff in the pull request:

https://github.com/apache/spark/pull/7039#discussion_r34116963
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1798,7 +1798,7 @@ DataFrame jdbcDF = sqlContext.load("jdbc", options)
 
 {% highlight python %}
 
-df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", 
dbtable="schema.tablename")
+df = sqlContext.read.format('jdbc').options(url = 
'jdbc:postgresql:dbserver', dbtable='schema.tablename').load()
--- End diff --

@rxin  I have submitted a followup patch with jira id -SPARK-8886.
thanks 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8886][Documentation]python Style update

2015-07-07 Thread tijoparacka

GitHub user tijoparacka opened a pull request:

https://github.com/apache/spark/pull/7281

[SPARK-8886][Documentation]python Style update

Fixed comment given by rxin

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tijoparacka/spark 
modification_for_python_style

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7281.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7281


commit 3de4cd8c5b9b8e2d3d3dba166db898f88cf095d6
Author: Tijo Thomas 
Date:   2015-07-07T17:42:52Z

python Style update

commit 6334e212f91626edaaeb9c8bc03a306feff705ac
Author: Tijo Thomas 
Date:   2015-07-08T11:04:21Z

removed space




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8297] [YARN] Scheduler backend is not n...

2015-07-07 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/7243#issuecomment-119439476
  
@vanzin Oh great, that is a very generous offer !
I will keep it around as long as you need :-) I just want to get it into 
the next release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34116646
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala 
---
@@ -0,0 +1,434 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.nio.ByteOrder
+
+import scala.collection.JavaConversions._
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.parquet.column.Dictionary
+import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, 
PrimitiveConverter}
+import org.apache.parquet.schema.Type.Repetition
+import org.apache.parquet.schema.{GroupType, PrimitiveType, Type}
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * A [[ParentContainerUpdater]] is used by a Parquet converter to set 
converted values to some
+ * corresponding parent container. For example, a converter for a 
`StructType` field may set
+ * converted values to a [[MutableRow]]; or a converter for array elements 
may append converted
+ * values to an [[ArrayBuffer]].
+ */
+private[parquet] trait ParentContainerUpdater {
+  def set(value: Any): Unit = ()
+  def setBoolean(value: Boolean): Unit = set(value)
+  def setByte(value: Byte): Unit = set(value)
+  def setShort(value: Short): Unit = set(value)
+  def setInt(value: Int): Unit = set(value)
+  def setLong(value: Long): Unit = set(value)
+  def setFloat(value: Float): Unit = set(value)
+  def setDouble(value: Double): Unit = set(value)
+}
+
+/** A no-op updater used for root converter (who doesn't have a parent). */
+private[parquet] object NoopUpdater extends ParentContainerUpdater
+
+/**
+ * A [[CatalystRowConverter]] is used to convert Parquet "structs" into 
Spark SQL [[Row]]s.  Since
+ * any Parquet record is also a struct, this converter can also be used as 
root converter.
+ *
+ * When used as a root converter, [[NoopUpdater]] should be used since 
root converters don't have
+ * any "parent" container.
--- End diff --

Not quite understand this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34116640
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala 
---
@@ -0,0 +1,434 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.nio.ByteOrder
+
+import scala.collection.JavaConversions._
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.parquet.column.Dictionary
+import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, 
PrimitiveConverter}
+import org.apache.parquet.schema.Type.Repetition
+import org.apache.parquet.schema.{GroupType, PrimitiveType, Type}
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * A [[ParentContainerUpdater]] is used by a Parquet converter to set 
converted values to some
+ * corresponding parent container. For example, a converter for a 
`StructType` field may set
--- End diff --

I meant "a converter for a field of a `StructType`". 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34116637
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala 
---
@@ -0,0 +1,434 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.nio.ByteOrder
+
+import scala.collection.JavaConversions._
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.parquet.column.Dictionary
+import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, 
PrimitiveConverter}
+import org.apache.parquet.schema.Type.Repetition
+import org.apache.parquet.schema.{GroupType, PrimitiveType, Type}
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * A [[ParentContainerUpdater]] is used by a Parquet converter to set 
converted values to some
+ * corresponding parent container. For example, a converter for a 
`StructType` field may set
+ * converted values to a [[MutableRow]]; or a converter for array elements 
may append converted
+ * values to an [[ArrayBuffer]].
+ */
+private[parquet] trait ParentContainerUpdater {
+  def set(value: Any): Unit = ()
+  def setBoolean(value: Boolean): Unit = set(value)
+  def setByte(value: Byte): Unit = set(value)
+  def setShort(value: Short): Unit = set(value)
+  def setInt(value: Int): Unit = set(value)
+  def setLong(value: Long): Unit = set(value)
+  def setFloat(value: Float): Unit = set(value)
+  def setDouble(value: Double): Unit = set(value)
+}
+
+/** A no-op updater used for root converter (who doesn't have a parent). */
+private[parquet] object NoopUpdater extends ParentContainerUpdater
+
+/**
+ * A [[CatalystRowConverter]] is used to convert Parquet "structs" into 
Spark SQL [[Row]]s.  Since
+ * any Parquet record is also a struct, this converter can also be used as 
root converter.
--- End diff --

Which one is the root converter, NoopUpdater or CatalystRowConverter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34116548
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala 
---
@@ -0,0 +1,434 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.nio.ByteOrder
+
+import scala.collection.JavaConversions._
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.parquet.column.Dictionary
+import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, 
PrimitiveConverter}
+import org.apache.parquet.schema.Type.Repetition
+import org.apache.parquet.schema.{GroupType, PrimitiveType, Type}
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * A [[ParentContainerUpdater]] is used by a Parquet converter to set 
converted values to some
+ * corresponding parent container. For example, a converter for a 
`StructType` field may set
+ * converted values to a [[MutableRow]]; or a converter for array elements 
may append converted
+ * values to an [[ArrayBuffer]].
+ */
+private[parquet] trait ParentContainerUpdater {
+  def set(value: Any): Unit = ()
+  def setBoolean(value: Boolean): Unit = set(value)
+  def setByte(value: Byte): Unit = set(value)
+  def setShort(value: Short): Unit = set(value)
+  def setInt(value: Int): Unit = set(value)
+  def setLong(value: Long): Unit = set(value)
+  def setFloat(value: Float): Unit = set(value)
+  def setDouble(value: Double): Unit = set(value)
--- End diff --

Do we need to add setDate and setTimestamp?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...

2015-07-07 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/7231#discussion_r34116504
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala 
---
@@ -0,0 +1,434 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.parquet
+
+import java.nio.ByteOrder
+
+import scala.collection.JavaConversions._
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.parquet.column.Dictionary
+import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, 
PrimitiveConverter}
+import org.apache.parquet.schema.Type.Repetition
+import org.apache.parquet.schema.{GroupType, PrimitiveType, Type}
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * A [[ParentContainerUpdater]] is used by a Parquet converter to set 
converted values to some
+ * corresponding parent container. For example, a converter for a 
`StructType` field may set
--- End diff --

a converter for a `StructField`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8879][SQL] Remove EmptyRow class.

2015-07-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7277


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8840][SparkR] Add float coercion on Spa...

2015-07-07 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/7280#issuecomment-119435268
  
@viirya The nice thing would be to add a test case based on say a JSON or 
Parquet input file. We can check the file into 
https://github.com/apache/spark/tree/master/R/pkg/inst/test_support and use it 
in a test case 
https://github.com/apache/spark/blob/master/R/pkg/inst/tests/test_sparkSQL.R 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8840][SparkR] Add float coercion on Spa...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7280#issuecomment-119435158
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8840][SparkR] Add float coercion on Spa...

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7280#issuecomment-119435129
  
  [Test build #36750 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36750/console)
 for   PR 7280 at commit 
[`0dcc992`](https://github.com/apache/spark/commit/0dcc9923816114ae6e413af1ca8f195b150e0b5d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8271][SQL]string function: soundex

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7115#issuecomment-119434592
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8271][SQL]string function: soundex

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7115#issuecomment-119434659
  
  [Test build #36759 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36759/consoleFull)
 for   PR 7115 at commit 
[`731`](https://github.com/apache/spark/commit/7318854da623610dfcfc89a64d508993c61f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8271][SQL]string function: soundex

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7115#issuecomment-119434579
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-07 Thread tianyi

Github user tianyi commented on the pull request:

https://github.com/apache/spark/pull/7239#issuecomment-119433617
  
@SaintBacchus yes, I agree that, but I don't understand why we have to 
removed a closed session immediately. If there are more than 200 session alive, 
I think there will be more client connecting soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6743#issuecomment-119433552
  
  [Test build #36758 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36758/consoleFull)
 for   PR 6743 at commit 
[`27cfbac`](https://github.com/apache/spark/commit/27cfbac55f276bb3d11cdd6a417f541aa81524a9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6743#issuecomment-119432827
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6743#issuecomment-119432781
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7279#issuecomment-119430782
  
  [Test build #36756 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36756/console)
 for   PR 7279 at commit 
[`6602052`](https://github.com/apache/spark/commit/66020529cc22d0aa677a547652b53653ce618f38).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7279#issuecomment-119430849
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...

2015-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/7226#discussion_r34115538
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/IntervalType.scala ---
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.types
+
+import org.apache.spark.annotation.DeveloperApi
+
+
+/**
+ * :: DeveloperApi ::
+ * The data type representing time intervals.
+ *
+ * Please use the singleton [[DataTypes.IntervalType]].
+ */
+@DeveloperApi
+class IntervalType private() extends DataType {
--- End diff --

I didn't make it extend `AtomicType` here, as I haven't figured out how to 
compare intervals. `30 days` and `1 months` may have different compare result 
in different context. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7226#issuecomment-119428300
  
  [Test build #36757 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36757/consoleFull)
 for   PR 7226 at commit 
[`ac348c3`](https://github.com/apache/spark/commit/ac348c3ff9007d1b377eeb3570dfad86e05732aa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7279#issuecomment-119427827
  
  [Test build #36756 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36756/consoleFull)
 for   PR 7279 at commit 
[`6602052`](https://github.com/apache/spark/commit/66020529cc22d0aa677a547652b53653ce618f38).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7279#issuecomment-119427692
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7226#issuecomment-119427675
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7226#issuecomment-119427691
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6743#issuecomment-119427632
  
  [Test build #36755 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36755/console)
 for   PR 6743 at commit 
[`0036265`](https://github.com/apache/spark/commit/0036265f41df9064fa4b93716a02cee40be8c7cf).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7279#issuecomment-119427670
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6743#issuecomment-119427659
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing

2015-07-07 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/7279#issuecomment-119427073
  
Thank you @witgo. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-07 Thread SaintBacchus

Github user SaintBacchus commented on the pull request:

https://github.com/apache/spark/pull/7239#issuecomment-119425235
  
@tianyi In my solution all the unfinished sessions will keep in memory.  If 
we don't check after session finish, we have to wait a new client to trigger 
this check.
> do the checking work when a session opened or an execution started


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...

2015-07-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6743#issuecomment-119424280
  
  [Test build #36755 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36755/consoleFull)
 for   PR 6743 at commit 
[`0036265`](https://github.com/apache/spark/commit/0036265f41df9064fa4b93716a02cee40be8c7cf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6743#issuecomment-119423486
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...

2015-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6743#issuecomment-119423495
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8457][ML]NGram Documentation

2015-07-07 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/7244#issuecomment-119422607
  
Could you please put links for the API docs under the Java and Python 
codetabs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8457][ML]NGram Documentation

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/7244#discussion_r34114829
  
--- Diff: docs/ml-features.md ---
@@ -288,6 +288,88 @@ for words_label in wordsDataFrame.select("words", 
"label").take(3):
 
 
 
+## $n$-gram
+
+An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of $n$ 
tokens (typically words) for some integer $n$. The `NGram` class can be used to 
transform input features into $n$-grams.
+
+`NGram` takes as input a sequence of strings (e.g. the output of a 
[Tokenizer](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer)).  The 
parameter `n` is used to determine the number of terms in each $n$-gram. The 
output will consist of a sequence of $n$-grams where each $n$-gram is 
represented by a space-delimited string of $n$ consecutive words.  If the input 
sequence contains fewer than `n` strings, no output is produced.
+
+
+
+
+
+
+
+[`NGram`](api/scala/index.html#org.apache.spark.ml.feature.NGram) takes an 
input column name, an output column name, and an optional length parameter n 
(n=2 by default).
+
+{% highlight scala %}
+import org.apache.spark.ml.feature.NGram
+
+val wordDataFrame = sqlContext.createDataFrame(Seq(
+  (0, Array("Hi", "I", "heard", "about", "Spark")),
+  (1, Array("I", "wish", "Java", "could", "use", "case", "classes")),
+  (2, Array("Logistic", "regression", "models", "are", "neat"))
+)).toDF("label", "words")
+
+val ngram = new NGram().setInputCol("words").setOutputCol("ngrams")
+val ngramDataFrame = ngram.transform(wordDataFrame)
+ngramDataFrame.select("ngrams", "label").take(3).foreach(println)
--- End diff --

If this looks weird, could you please modify it to be prettier?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8457][ML]NGram Documentation

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/7244#discussion_r34114827
  
--- Diff: docs/ml-features.md ---
@@ -288,6 +288,88 @@ for words_label in wordsDataFrame.select("words", 
"label").take(3):
 
 
 
+## $n$-gram
+
+An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of $n$ 
tokens (typically words) for some integer $n$. The `NGram` class can be used to 
transform input features into $n$-grams.
+
+`NGram` takes as input a sequence of strings (e.g. the output of a 
[Tokenizer](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer)).  The 
parameter `n` is used to determine the number of terms in each $n$-gram. The 
output will consist of a sequence of $n$-grams where each $n$-gram is 
represented by a space-delimited string of $n$ consecutive words.  If the input 
sequence contains fewer than `n` strings, no output is produced.
--- End diff --

When I wrote
> Could link to this page's section instead (once API doc links are moved 
to codetabs section).

I meant that you could change the Tokenizer link to 
```(ml-features.html#tokenizer)```.  (That was ambiguous.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8233][SQL] misc function: hash

2015-07-07 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/6971#issuecomment-119416710
  
Jenkins, ok to test.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...

2015-07-07 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7226#discussion_r34114401
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala
 ---
@@ -301,6 +302,16 @@ object CatalystTypeConverters {
   DateTimeUtils.toJavaTimestamp(row.getLong(column))
   }
 
+  private object IntervalConverter extends CatalystTypeConverter[Interval, 
Interval, Array[Byte]] {
+override def toCatalystImpl(scalaValue: Interval): Array[Byte] =
--- End diff --

@adrian-wang also suggested using a single long to encode -- we should 
consider and decide on that as well.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...

2015-07-07 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7226#discussion_r34114392
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala
 ---
@@ -301,6 +302,16 @@ object CatalystTypeConverters {
   DateTimeUtils.toJavaTimestamp(row.getLong(column))
   }
 
+  private object IntervalConverter extends CatalystTypeConverter[Interval, 
Interval, Array[Byte]] {
+override def toCatalystImpl(scalaValue: Interval): Array[Byte] =
--- End diff --

Yes that's what I meant - just use an object with int + long.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-07 Thread tianyi

Github user tianyi commented on the pull request:

https://github.com/apache/spark/pull/7239#issuecomment-119416313
  
We already do the checking work when a session opened or an execution 
started.
The number of these two lists should NOT increase in other phase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8840][SparkR] Add float coercion on Spa...

2015-07-07 Thread viirya

Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/7280#issuecomment-119415856
  
@shivaram I will check it later. For the test case, I don't see that for 
other types. Where do you suggest me to add the test case into? Directly adding 
it in deserialize.R?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-07 Thread tianyi

Github user tianyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/7239#discussion_r34114144
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala
 ---
@@ -179,6 +179,7 @@ object HiveThriftServer2 extends Logging {
 def onSessionClosed(sessionId: String): Unit = {
   sessionList(sessionId).finishTimestamp = System.currentTimeMillis
   onlineSessionNum -= 1
+  trimSessionIfNecessary()
--- End diff --

I think it is not necessary here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-07 Thread tianyi

Github user tianyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/7239#discussion_r34114132
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala
 ---
@@ -206,18 +208,20 @@ object HiveThriftServer2 extends Logging {
   executionList(id).detail = errorMessage
   executionList(id).state = ExecutionState.FAILED
   totalRunning -= 1
+  trimExecutionIfNecessary()
 }
 
 def onStatementFinish(id: String): Unit = {
   executionList(id).finishTimestamp = System.currentTimeMillis
   executionList(id).state = ExecutionState.FINISHED
   totalRunning -= 1
+  trimExecutionIfNecessary()
--- End diff --

I think it is not necessary here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-07 Thread tianyi

Github user tianyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/7239#discussion_r34114126
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala
 ---
@@ -206,18 +208,20 @@ object HiveThriftServer2 extends Logging {
   executionList(id).detail = errorMessage
   executionList(id).state = ExecutionState.FAILED
   totalRunning -= 1
+  trimExecutionIfNecessary()
--- End diff --

I think it is not necessary here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-07 Thread tianyi

Github user tianyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/7239#discussion_r34114118
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala
 ---
@@ -199,6 +200,7 @@ object HiveThriftServer2 extends Logging {
 def onStatementParsed(id: String, executionPlan: String): Unit = {
   executionList(id).executePlan = executionPlan
   executionList(id).state = ExecutionState.COMPILED
+  trimExecutionIfNecessary()
--- End diff --

I think it is not necessary here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...

2015-07-07 Thread tianyi

Github user tianyi commented on the pull request:

https://github.com/apache/spark/pull/7239#issuecomment-119414302
  
I think we should only fix the mistake of removing alive sessions or 
executions in this PR.
If the number of alive objects is more than 200, you should set the 
`spark.sql.thriftserver.ui.retainedStatements` and 
`spark.sql.thriftserver.ui.retainedSessions` to a bigger value. You should also 
remember to give more memory to the driver side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34113909
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
+val sequences: RDD[Array[Int]],
--- End diff --

OK, It would be fixed in a follow-up PR before 1.5. It's soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8389][Streaming][PySpark] Expose KafkaR...

2015-07-07 Thread amit-ramesh

Github user amit-ramesh commented on the pull request:

https://github.com/apache/spark/pull/7185#issuecomment-119412470
  
@jerryshao
Although the title of SPARK-8337 is worded more broadly, all we really 
wanted is a way to attach Kafka offsets to individual events/messages :).

AFAICT, the implementation of transform appears to be doing executor side 
operations. And if the following two assumptions hold then we should be able to 
achieve what we need:
1. Events in an RDD partition are ordered by Kafka offset
2. The index of an OffsetRanges object in the getOffsets() list corresponds 
to the partition index in the RDD.

If this is not indeed so, it would be great if you could point me to some 
lines that use driver side operations in transform.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...

2015-07-07 Thread zhangjiajin

Github user zhangjiajin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7258#discussion_r34113588
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.fpm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ *
+ * A parallel PrefixSpan algorithm to mine sequential pattern.
+ * The PrefixSpan algorithm is described in
+ * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]].
+ *
+ * @param sequences original sequences data
+ * @param minSupport the minimal support level of the sequential pattern, 
any pattern appears
+ *   more than minSupport times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern appears
+ *   less than maxPatternLength will be output
+ *
+ * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining 
Sequential Pattern Mining
+ *   (Wikipedia)]]
+ */
+class Prefixspan(
--- End diff --

Fixed, change Prefixspan to PrefixSpan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1025 matches

Mail list logo