[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7263#issuecomment-119461156 [Test build #36760 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36760/console) for PR 7263 at commit [`4dfd418`](https://github.com/apache/spark/commit/4dfd41865f3722ff8f0dd3fe8abb05214859f418). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-119460708 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-119460686 [Test build #36761 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36761/console) for PR 7279 at commit [`d980757`](https://github.com/apache/spark/commit/d98075785c60e2baa50abebda8c500da4d2c09f0). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/7226#issuecomment-119460060 looks good otherwise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6743#issuecomment-119460146 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6743#issuecomment-119460116 [Test build #36758 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36758/console) for PR 6743 at commit [`27cfbac`](https://github.com/apache/spark/commit/27cfbac55f276bb3d11cdd6a417f541aa81524a9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/7226#discussion_r34120043 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/Interval.scala --- @@ -0,0 +1,24 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.types + + +/** + * The internal representation of interval type. + */ +case class Interval(months: Int, microseconds: Long) extends Serializable --- End diff -- actually can you make this a java class and move it to unsafe/src/main/java/org/apache/spark/unsafe/types ? Just leave the fields public so we can easily access them in codegen. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34119826 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala --- @@ -0,0 +1,434 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.nio.ByteOrder + +import scala.collection.JavaConversions._ +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.parquet.column.Dictionary +import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, PrimitiveConverter} +import org.apache.parquet.schema.Type.Repetition +import org.apache.parquet.schema.{GroupType, PrimitiveType, Type} + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * A [[ParentContainerUpdater]] is used by a Parquet converter to set converted values to some + * corresponding parent container. For example, a converter for a `StructType` field may set + * converted values to a [[MutableRow]]; or a converter for array elements may append converted + * values to an [[ArrayBuffer]]. + */ +private[parquet] trait ParentContainerUpdater { + def set(value: Any): Unit = () + def setBoolean(value: Boolean): Unit = set(value) + def setByte(value: Byte): Unit = set(value) + def setShort(value: Short): Unit = set(value) + def setInt(value: Int): Unit = set(value) + def setLong(value: Long): Unit = set(value) + def setFloat(value: Float): Unit = set(value) + def setDouble(value: Double): Unit = set(value) +} + +/** A no-op updater used for root converter (who doesn't have a parent). */ +private[parquet] object NoopUpdater extends ParentContainerUpdater + +/** + * A [[CatalystRowConverter]] is used to convert Parquet "structs" into Spark SQL [[Row]]s. Since + * any Parquet record is also a struct, this converter can also be used as root converter. + * + * When used as a root converter, [[NoopUpdater]] should be used since root converters don't have + * any "parent" container. --- End diff -- Will add more explanations here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-119458590 [Test build #36761 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36761/consoleFull) for PR 7279 at commit [`d980757`](https://github.com/apache/spark/commit/d98075785c60e2baa50abebda8c500da4d2c09f0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34119723 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( +val sequences: RDD[Array[Int]], +val minSupport: Int = 2, +val maxPatternLength: Int = 50) extends java.io.Serializable { + + /** + * Calculate sequential patterns: + * a) find and collect length-one patterns + * b) for each length-one patterns and each sequence, + *emit (pattern (prefix), suffix sequence) as key-value pairs + * c) group by key and then map value iterator to array + * d) local PrefixSpan on each prefix + * @return sequential patterns + */ + def run(): RDD[(Seq[Int], Int)] = { +val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne() +val repartitionedRdd = repartitionSequences(prefixAndCandidates) +val nextPatterns = getPatternsInLocal(repartitionedRdd) +val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ nextPatterns +allPatterns + } + + /** + * Find the patterns that it's length is one + * @return length-one patterns and projection table + */ + private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], Array[Int])]) = { +val patternsOneLength = sequences + .map(_.distinct) + .flatMap(p => p) + .map((_, 1)) + .reduceByKey(_ + _) + +val removedElements: Array[Int] = patternsOneLength + .filter(_._2 < minSupport) + .map(_._1) + .collect() + +val savedElements = patternsOneLength.filter(_._2 >= minSupport) + +val savedElementsArray = savedElements + .map(_._1) + .collect() + +val filteredSequences = + if (removedElements.isEmpty) { +sequences + } else { +sequences.map { p => + p.filter { x => !removedElements.contains(x) } +} + } + +val prefixAndCandidates = filteredSequences.flatMap { x => + savedElementsArray.map { y => +val sub = getSuffix(y, x) +(Seq(y), sub) + } +} + +(savedElements, prefixAndCandidates) + } + + /** + * Re-partition the RDD data, to get better balance and performance. + * @param data patterns and projected sequences data before re-partition + * @return patterns and projected sequences data after re-partition + */ + private def repartitionSequences( + data: RDD[(Seq[Int], Array[Int])]): RDD[(Seq[Int], Array[Array[Int]])] = { +val dataRemovedEmptyLine = data.filter(x => x._2.nonEmpty) +val dataMerged = dataRemovedEmptyLine + .groupByKey() + .map(x => (x._1, x._2.toArray)) --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at inf
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34119658 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( +val sequences: RDD[Array[Int]], +val minSupport: Int = 2, +val maxPatternLength: Int = 50) extends java.io.Serializable { + + /** + * Calculate sequential patterns: + * a) find and collect length-one patterns + * b) for each length-one patterns and each sequence, + *emit (pattern (prefix), suffix sequence) as key-value pairs + * c) group by key and then map value iterator to array + * d) local PrefixSpan on each prefix + * @return sequential patterns + */ + def run(): RDD[(Seq[Int], Int)] = { +val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne() +val repartitionedRdd = repartitionSequences(prefixAndCandidates) +val nextPatterns = getPatternsInLocal(repartitionedRdd) +val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ nextPatterns +allPatterns + } + + /** + * Find the patterns that it's length is one + * @return length-one patterns and projection table + */ + private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], Array[Int])]) = { +val patternsOneLength = sequences + .map(_.distinct) + .flatMap(p => p) + .map((_, 1)) + .reduceByKey(_ + _) + +val removedElements: Array[Int] = patternsOneLength + .filter(_._2 < minSupport) + .map(_._1) + .collect() + +val savedElements = patternsOneLength.filter(_._2 >= minSupport) + +val savedElementsArray = savedElements + .map(_._1) + .collect() + +val filteredSequences = + if (removedElements.isEmpty) { +sequences + } else { +sequences.map { p => + p.filter { x => !removedElements.contains(x) } +} + } + +val prefixAndCandidates = filteredSequences.flatMap { x => + savedElementsArray.map { y => +val sub = getSuffix(y, x) +(Seq(y), sub) + } +} + +(savedElements, prefixAndCandidates) + } + + /** + * Re-partition the RDD data, to get better balance and performance. + * @param data patterns and projected sequences data before re-partition + * @return patterns and projected sequences data after re-partition + */ + private def repartitionSequences( + data: RDD[(Seq[Int], Array[Int])]): RDD[(Seq[Int], Array[Array[Int]])] = { +val dataRemovedEmptyLine = data.filter(x => x._2.nonEmpty) --- End diff -- Fixed. Moved the filter to L95. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- ---
[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-119458150 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-119458113 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34119496 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( +val sequences: RDD[Array[Int]], +val minSupport: Int = 2, +val maxPatternLength: Int = 50) extends java.io.Serializable { + + /** + * Calculate sequential patterns: + * a) find and collect length-one patterns + * b) for each length-one patterns and each sequence, + *emit (pattern (prefix), suffix sequence) as key-value pairs + * c) group by key and then map value iterator to array + * d) local PrefixSpan on each prefix + * @return sequential patterns + */ + def run(): RDD[(Seq[Int], Int)] = { +val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne() +val repartitionedRdd = repartitionSequences(prefixAndCandidates) +val nextPatterns = getPatternsInLocal(repartitionedRdd) +val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ nextPatterns +allPatterns + } + + /** + * Find the patterns that it's length is one + * @return length-one patterns and projection table + */ + private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], Array[Int])]) = { +val patternsOneLength = sequences + .map(_.distinct) + .flatMap(p => p) + .map((_, 1)) + .reduceByKey(_ + _) + +val removedElements: Array[Int] = patternsOneLength + .filter(_._2 < minSupport) + .map(_._1) + .collect() + +val savedElements = patternsOneLength.filter(_._2 >= minSupport) + +val savedElementsArray = savedElements + .map(_._1) + .collect() + +val filteredSequences = + if (removedElements.isEmpty) { +sequences + } else { +sequences.map { p => + p.filter { x => !removedElements.contains(x) } +} + } + +val prefixAndCandidates = filteredSequences.flatMap { x => + savedElementsArray.map { y => +val sub = getSuffix(y, x) +(Seq(y), sub) + } +} + +(savedElements, prefixAndCandidates) + } + + /** + * Re-partition the RDD data, to get better balance and performance. + * @param data patterns and projected sequences data before re-partition + * @return patterns and projected sequences data after re-partition --- End diff -- Changed the function name to makePrefixProjectedDatabases. It's better than old one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spar
[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/7226#discussion_r34119333 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala --- @@ -304,6 +304,9 @@ private[sql] object ResolvedDataSource { mode: SaveMode, options: Map[String, String], data: DataFrame): ResolvedDataSource = { +if (data.schema.map(_.dataType).exists(_.isInstanceOf[IntervalType])) { + sys.error("Cannot save interval data type into external storage.") --- End diff -- when is this called? if it is called during analysis, it'd make more sense to throw AnalysisException, since that has better error reporting in Python. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34119347 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( +val sequences: RDD[Array[Int]], +val minSupport: Int = 2, +val maxPatternLength: Int = 50) extends java.io.Serializable { + + /** + * Calculate sequential patterns: + * a) find and collect length-one patterns + * b) for each length-one patterns and each sequence, + *emit (pattern (prefix), suffix sequence) as key-value pairs + * c) group by key and then map value iterator to array + * d) local PrefixSpan on each prefix + * @return sequential patterns + */ + def run(): RDD[(Seq[Int], Int)] = { +val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne() +val repartitionedRdd = repartitionSequences(prefixAndCandidates) +val nextPatterns = getPatternsInLocal(repartitionedRdd) +val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ nextPatterns +allPatterns + } + + /** + * Find the patterns that it's length is one + * @return length-one patterns and projection table + */ + private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], Array[Int])]) = { +val patternsOneLength = sequences + .map(_.distinct) + .flatMap(p => p) + .map((_, 1)) + .reduceByKey(_ + _) + +val removedElements: Array[Int] = patternsOneLength + .filter(_._2 < minSupport) + .map(_._1) + .collect() + +val savedElements = patternsOneLength.filter(_._2 >= minSupport) --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34119217 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( +val sequences: RDD[Array[Int]], +val minSupport: Int = 2, +val maxPatternLength: Int = 50) extends java.io.Serializable { + + /** + * Calculate sequential patterns: + * a) find and collect length-one patterns + * b) for each length-one patterns and each sequence, + *emit (pattern (prefix), suffix sequence) as key-value pairs + * c) group by key and then map value iterator to array + * d) local PrefixSpan on each prefix + * @return sequential patterns + */ + def run(): RDD[(Seq[Int], Int)] = { +val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne() +val repartitionedRdd = repartitionSequences(prefixAndCandidates) +val nextPatterns = getPatternsInLocal(repartitionedRdd) +val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ nextPatterns +allPatterns + } + + /** + * Find the patterns that it's length is one + * @return length-one patterns and projection table + */ + private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], Array[Int])]) = { +val patternsOneLength = sequences + .map(_.distinct) + .flatMap(p => p) + .map((_, 1)) + .reduceByKey(_ + _) + +val removedElements: Array[Int] = patternsOneLength --- End diff -- Fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34119143 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( +val sequences: RDD[Array[Int]], +val minSupport: Int = 2, +val maxPatternLength: Int = 50) extends java.io.Serializable { + + /** + * Calculate sequential patterns: + * a) find and collect length-one patterns + * b) for each length-one patterns and each sequence, + *emit (pattern (prefix), suffix sequence) as key-value pairs + * c) group by key and then map value iterator to array + * d) local PrefixSpan on each prefix + * @return sequential patterns + */ + def run(): RDD[(Seq[Int], Int)] = { +val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne() +val repartitionedRdd = repartitionSequences(prefixAndCandidates) +val nextPatterns = getPatternsInLocal(repartitionedRdd) +val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ nextPatterns +allPatterns + } + + /** + * Find the patterns that it's length is one + * @return length-one patterns and projection table + */ + private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], Array[Int])]) = { +val patternsOneLength = sequences + .map(_.distinct) + .flatMap(p => p) + .map((_, 1)) + .reduceByKey(_ + _) + --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34119063 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( +val sequences: RDD[Array[Int]], +val minSupport: Int = 2, +val maxPatternLength: Int = 50) extends java.io.Serializable { + + /** + * Calculate sequential patterns: + * a) find and collect length-one patterns + * b) for each length-one patterns and each sequence, + *emit (pattern (prefix), suffix sequence) as key-value pairs + * c) group by key and then map value iterator to array + * d) local PrefixSpan on each prefix + * @return sequential patterns + */ + def run(): RDD[(Seq[Int], Int)] = { +val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne() +val repartitionedRdd = repartitionSequences(prefixAndCandidates) +val nextPatterns = getPatternsInLocal(repartitionedRdd) +val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ nextPatterns +allPatterns + } + + /** + * Find the patterns that it's length is one + * @return length-one patterns and projection table + */ + private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], Array[Int])]) = { +val patternsOneLength = sequences + .map(_.distinct) + .flatMap(p => p) --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34118963 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( +val sequences: RDD[Array[Int]], +val minSupport: Int = 2, +val maxPatternLength: Int = 50) extends java.io.Serializable { + + /** + * Calculate sequential patterns: + * a) find and collect length-one patterns + * b) for each length-one patterns and each sequence, + *emit (pattern (prefix), suffix sequence) as key-value pairs + * c) group by key and then map value iterator to array + * d) local PrefixSpan on each prefix + * @return sequential patterns + */ + def run(): RDD[(Seq[Int], Int)] = { +val (patternsOneLength, prefixAndCandidates) = findPatternsLengthOne() +val repartitionedRdd = repartitionSequences(prefixAndCandidates) +val nextPatterns = getPatternsInLocal(repartitionedRdd) +val allPatterns = patternsOneLength.map(x => (Seq(x._1), x._2)) ++ nextPatterns +allPatterns + } + + /** + * Find the patterns that it's length is one + * @return length-one patterns and projection table + */ + private def findPatternsLengthOne(): (RDD[(Int, Int)], RDD[(Seq[Int], Array[Int])]) = { +val patternsOneLength = sequences --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34118908 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( +val sequences: RDD[Array[Int]], +val minSupport: Int = 2, +val maxPatternLength: Int = 50) extends java.io.Serializable { + + /** + * Calculate sequential patterns: + * a) find and collect length-one patterns + * b) for each length-one patterns and each sequence, + *emit (pattern (prefix), suffix sequence) as key-value pairs + * c) group by key and then map value iterator to array + * d) local PrefixSpan on each prefix + * @return sequential patterns + */ + def run(): RDD[(Seq[Int], Int)] = { --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7263#issuecomment-119451883 [Test build #36760 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36760/consoleFull) for PR 7263 at commit [`4dfd418`](https://github.com/apache/spark/commit/4dfd41865f3722ff8f0dd3fe8abb05214859f418). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34118848 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( +val sequences: RDD[Array[Int]], +val minSupport: Int = 2, +val maxPatternLength: Int = 50) extends java.io.Serializable { --- End diff -- Fixed, changed maxPatternLength to 10, changed minSupport to 0.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8883][SQL]Remove the OverrideFunctionRe...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/7260#discussion_r34118817 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala --- @@ -76,7 +76,7 @@ private[hive] class HiveFunctionRegistry(underlying: analysis.FunctionRegistry) } override def registerFunction(name: String, builder: FunctionBuilder): Unit = -throw new UnsupportedOperationException +underlying.registerFunction(name, builder) --- End diff -- can you add a unit test in hive to make sure function registration works. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34118782 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( +val sequences: RDD[Array[Int]], +val minSupport: Int = 2, --- End diff -- Changed these input parameters to private var s, and add getters and setters. But I have a question, why don't we use the public input parameters ? The following code, the class A is smaller than class B, Why is class B code style better ? class A (var x: Int = 12) { def f() = { println(x) } } class B (private var x: Int) { def this() = this(12) def setX(x: Int): this.type = { this.x=x this } def f() = { println(x) } } val a = new A() a.f() a.x = 15 a.f() val b = new B() b.f() b.setX(15) b.f() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7263#issuecomment-119451559 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7263#issuecomment-119451544 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7078] [SPARK-7079] [WIP] Binary process...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/6444#discussion_r34118575 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection.unsafe.sort; + +import java.io.*; + +import com.google.common.io.ByteStreams; + +import org.apache.spark.storage.BlockId; +import org.apache.spark.storage.BlockManager; +import org.apache.spark.unsafe.PlatformDependent; + +/** + * Reads spill files written by {@link UnsafeSorterSpillWriter} (see that class for a description + * of the file format). + */ +final class UnsafeSorterSpillReader extends UnsafeSorterIterator { + + private InputStream in; + private DataInputStream din; + + // Variables that change with every record read: + private int recordLength; + private long keyPrefix; + private int numRecordsRemaining; + + private final byte[] arr = new byte[1024 * 1024]; // TODO: tune this (maybe grow dynamically)? + private final Object baseObject = arr; + private final long baseOffset = PlatformDependent.BYTE_ARRAY_OFFSET; + + public UnsafeSorterSpillReader( + BlockManager blockManager, + File file, + BlockId blockId) throws IOException { +assert (file.length() > 0); +final BufferedInputStream bs = new BufferedInputStream(new FileInputStream(file)); +this.in = blockManager.wrapForCompression(blockId, bs); +this.din = new DataInputStream(this.in); +numRecordsRemaining = din.readInt(); + } + + @Override + public boolean hasNext() { +return (numRecordsRemaining > 0); + } + + @Override + public void loadNext() throws IOException { +recordLength = din.readInt(); +keyPrefix = din.readLong(); +ByteStreams.readFully(in, arr, 0, recordLength); +numRecordsRemaining--; +if (numRecordsRemaining == 0) { + in.close(); --- End diff -- Even that isn't foolproof, though, since wrapping of the iterator would break that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7078] [SPARK-7079] [WIP] Binary process...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/6444#discussion_r34118539 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection.unsafe.sort; + +import java.io.*; + +import com.google.common.io.ByteStreams; + +import org.apache.spark.storage.BlockId; +import org.apache.spark.storage.BlockManager; +import org.apache.spark.unsafe.PlatformDependent; + +/** + * Reads spill files written by {@link UnsafeSorterSpillWriter} (see that class for a description + * of the file format). + */ +final class UnsafeSorterSpillReader extends UnsafeSorterIterator { + + private InputStream in; + private DataInputStream din; + + // Variables that change with every record read: + private int recordLength; + private long keyPrefix; + private int numRecordsRemaining; + + private final byte[] arr = new byte[1024 * 1024]; // TODO: tune this (maybe grow dynamically)? + private final Object baseObject = arr; + private final long baseOffset = PlatformDependent.BYTE_ARRAY_OFFSET; + + public UnsafeSorterSpillReader( + BlockManager blockManager, + File file, + BlockId blockId) throws IOException { +assert (file.length() > 0); +final BufferedInputStream bs = new BufferedInputStream(new FileInputStream(file)); +this.in = blockManager.wrapForCompression(blockId, bs); +this.din = new DataInputStream(this.in); +numRecordsRemaining = din.readInt(); + } + + @Override + public boolean hasNext() { +return (numRecordsRemaining > 0); + } + + @Override + public void loadNext() throws IOException { +recordLength = din.readInt(); +keyPrefix = din.readLong(); +ByteStreams.readFully(in, arr, 0, recordLength); +numRecordsRemaining--; +if (numRecordsRemaining == 0) { + in.close(); --- End diff -- This reminds me: this kind of cleanup is actually a problem in general for this sort code, since we also need to free memory pages, etc. in the `order by xxx limit xxx` case. Until we implement a proper internal iterator interface, a stopgap measure might be to define our own Scala iterator subclass which implements a close method and to update operators like `Limit` to call `close()` if it's available. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/7263#discussion_r34118525 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -146,6 +151,55 @@ class Word2VecModel private[ml] ( wordVectors: feature.Word2VecModel) extends Model[Word2VecModel] with Word2VecBase { + + /** + * Return a map of every word to its Vector representation. + */ + val getVectors: Map[String, Array[Float]] = wordVectors.getVectors + + /** + * Java friendly version of getVectors called by the python wrappers. + */ + private[spark] val javaGetVectors: JList[Object] = { +val (words, vectors) = getVectors.toSeq.unzip + +// There are problems with serialization of Arrays which makes this conversion +// necessary. +val vectorsList = vectors.toList.map(vec => Vectors.dense(vec.map(_.toDouble))) +List(words.toList, vectorsList).map(_.asJava.asInstanceOf[Object]).asJava + } + + /** + * Find "num" no. words closest to similarity to the given word. + * Returns a coupled array with the second element giving the similarity measure. + */ + def findSynonyms(word: String, num: Int): Array[(String, Double)] = { --- End diff -- Hmm. I had a similar thought in mind. If I do that I could remove the "java-friendly" code below. @mengxr thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8874] [ML] Add missing methods in Word2...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/7263#discussion_r34118474 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -146,6 +151,55 @@ class Word2VecModel private[ml] ( wordVectors: feature.Word2VecModel) extends Model[Word2VecModel] with Word2VecBase { + + /** + * Return a map of every word to its Vector representation. + */ + val getVectors: Map[String, Array[Float]] = wordVectors.getVectors --- End diff -- Thanks for clarifying. I just thought I was being ignorant. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7078] [SPARK-7079] [WIP] Binary process...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/6444#discussion_r34118225 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution; + +import java.io.IOException; +import java.util.Arrays; + +import scala.collection.Iterator; +import scala.math.Ordering; + +import com.google.common.annotations.VisibleForTesting; + +import org.apache.spark.SparkEnv; +import org.apache.spark.TaskContext; +import org.apache.spark.sql.AbstractScalaRowIterator; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.ObjectUnsafeColumnWriter; +import org.apache.spark.sql.catalyst.expressions.UnsafeColumnWriter; +import org.apache.spark.sql.catalyst.expressions.UnsafeRow; +import org.apache.spark.sql.catalyst.expressions.UnsafeRowConverter; +import org.apache.spark.sql.catalyst.util.ObjectPool; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.PlatformDependent; +import org.apache.spark.util.collection.unsafe.sort.PrefixComparator; +import org.apache.spark.util.collection.unsafe.sort.RecordComparator; +import org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter; +import org.apache.spark.util.collection.unsafe.sort.UnsafeSorterIterator; + +final class UnsafeExternalRowSorter { + + /** + * If positive, forces records to be spilled to disk at the given frequency (measured in numbers + * of records). This is only intended to be used in tests. + */ + private int testSpillFrequency = 0; + + private long numRowsInserted = 0; + + private final StructType schema; + private final UnsafeRowConverter rowConverter; + private final PrefixComputer prefixComputer; + private final ObjectPool objPool = new ObjectPool(128); + private final UnsafeExternalSorter sorter; + private byte[] rowConversionBuffer = new byte[1024 * 8]; + + public static abstract class PrefixComputer { +abstract long computePrefix(InternalRow row); + } + + public UnsafeExternalRowSorter( + StructType schema, + Ordering ordering, + PrefixComparator prefixComparator, + PrefixComputer prefixComputer) throws IOException { +this.schema = schema; +this.rowConverter = new UnsafeRowConverter(schema); +this.prefixComputer = prefixComputer; +final SparkEnv sparkEnv = SparkEnv.get(); +final TaskContext taskContext = TaskContext.get(); +sorter = new UnsafeExternalSorter( + taskContext.taskMemoryManager(), + sparkEnv.shuffleMemoryManager(), + sparkEnv.blockManager(), + taskContext, + new RowComparator(ordering, schema.length(), objPool), + prefixComparator, + 4096, + sparkEnv.conf() +); + } + + /** + * Forces spills to occur every `frequency` records. Only for use in tests. + */ + @VisibleForTesting + void setTestSpillFrequency(int frequency) { +assert frequency > 0 : "Frequency must be positive"; +testSpillFrequency = frequency; + } + + @VisibleForTesting + void insertRow(InternalRow row) throws IOException { +final int sizeRequirement = rowConverter.getSizeRequirement(row); +if (sizeRequirement > rowConversionBuffer.length) { + rowConversionBuffer = new byte[sizeRequirement]; +} else { + // Zero out the buffer that's used to hold the current row. This is necessary in order + // to ensure that rows hash properly, since garbage data from the previous row could + // otherwise end up as padding in this row. As a performance optimization, we only zero + // out the portion of the buffer that we'll actuall
[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7226#issuecomment-119449087 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7226#issuecomment-119449043 [Test build #36757 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36757/console) for PR 7226 at commit [`ac348c3`](https://github.com/apache/spark/commit/ac348c3ff9007d1b377eeb3570dfad86e05732aa). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Interval(months: Int, microseconds: Long) extends Serializable` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8389][Streaming][PySpark] Expose KafkaR...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/7185#issuecomment-119447855 Hi @amit-ramesh , what I mentioned about getting offsetRanges in `transform` function is something like this: ```python dstream.transform(lambda r: r.offsetRanges()) ``` Here `r.offsetRanges` is executed in driver side, if you have follow-up transformations inside `transfrom` function which need to use this offsetRanges, this offsetRanges will implicitly be sent to executor side. That's what I mean about. Also: >1. Events in an RDD partition are ordered by Kafka offset >2. The index of an OffsetRanges object in the getOffsets() list corresponds to the partition index in the RDD. This two assumptions are true as I know. so you could rely on this assumptions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34117781 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala --- @@ -0,0 +1,434 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.nio.ByteOrder + +import scala.collection.JavaConversions._ +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.parquet.column.Dictionary +import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, PrimitiveConverter} +import org.apache.parquet.schema.Type.Repetition +import org.apache.parquet.schema.{GroupType, PrimitiveType, Type} + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * A [[ParentContainerUpdater]] is used by a Parquet converter to set converted values to some + * corresponding parent container. For example, a converter for a `StructType` field may set + * converted values to a [[MutableRow]]; or a converter for array elements may append converted + * values to an [[ArrayBuffer]]. + */ +private[parquet] trait ParentContainerUpdater { + def set(value: Any): Unit = () + def setBoolean(value: Boolean): Unit = set(value) + def setByte(value: Byte): Unit = set(value) + def setShort(value: Short): Unit = set(value) + def setInt(value: Int): Unit = set(value) + def setLong(value: Long): Unit = set(value) + def setFloat(value: Float): Unit = set(value) + def setDouble(value: Double): Unit = set(value) --- End diff -- No, `DateType` is converted to `Int` and we just call `setInt`. `TimestampType` is similar. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34117743 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetThriftCompatibilitySuite.scala --- @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.nio.ByteBuffer +import java.util.{List => JList, Map => JMap} + +import scala.collection.JavaConversions._ + +import org.apache.hadoop.fs.Path +import org.apache.parquet.hadoop.metadata.CompressionCodecName +import org.apache.parquet.thrift.ThriftParquetWriter + +import org.apache.spark.sql.parquet.test.thrift.{Nested, ParquetThriftCompat, Suit} +import org.apache.spark.sql.test.TestSQLContext +import org.apache.spark.sql.{Row, SQLContext} + +object ParquetThriftCompatibilitySuite { + def makeParquetThriftCompat(i: Int): ParquetThriftCompat = { --- End diff -- Yes. Originally I moved it here because of serialization failure, because I used to write the Parquet file using a Spark job. But now I resort to `ParquetWriter` classes and the companion object isn't necessary anymore. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34117676 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala --- @@ -0,0 +1,434 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.nio.ByteOrder + +import scala.collection.JavaConversions._ +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.parquet.column.Dictionary +import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, PrimitiveConverter} +import org.apache.parquet.schema.Type.Repetition +import org.apache.parquet.schema.{GroupType, PrimitiveType, Type} + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * A [[ParentContainerUpdater]] is used by a Parquet converter to set converted values to some + * corresponding parent container. For example, a converter for a `StructType` field may set + * converted values to a [[MutableRow]]; or a converter for array elements may append converted + * values to an [[ArrayBuffer]]. + */ +private[parquet] trait ParentContainerUpdater { + def set(value: Any): Unit = () + def setBoolean(value: Boolean): Unit = set(value) + def setByte(value: Byte): Unit = set(value) + def setShort(value: Short): Unit = set(value) + def setInt(value: Int): Unit = set(value) + def setLong(value: Long): Unit = set(value) + def setFloat(value: Float): Unit = set(value) + def setDouble(value: Double): Unit = set(value) +} + +/** A no-op updater used for root converter (who doesn't have a parent). */ +private[parquet] object NoopUpdater extends ParentContainerUpdater + +/** + * A [[CatalystRowConverter]] is used to convert Parquet "structs" into Spark SQL [[Row]]s. Since + * any Parquet record is also a struct, this converter can also be used as root converter. --- End diff -- See here: https://github.com/apache/spark/pull/7231/files#diff-83ef4d5f1029c8bebb49a0c139fa3154R50 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34117556 --- Diff: project/MimaExcludes.scala --- @@ -81,6 +81,32 @@ object MimaExcludes { "org.apache.spark.mllib.linalg.Matrix.numNonzeros"), ProblemFilters.exclude[MissingMethodProblem]( "org.apache.spark.mllib.linalg.Matrix.numActives") + ) ++ Seq( +// SPARK-6776 Implement backwards-compatibility rules in Catalyst converters +ProblemFilters.exclude[MissingClassProblem]( + "org.apache.spark.sql.parquet.CatalystGroupConverter"), --- End diff -- we should exclude the entire parquet package. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34117535 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetAvroCompatibilitySuite.scala --- @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.nio.ByteBuffer +import java.util.{List => JList, Map => JMap} + +import scala.collection.JavaConversions._ + +import org.apache.hadoop.fs.Path +import org.apache.parquet.avro.AvroParquetWriter + +import org.apache.spark.sql.parquet.test.avro.{Nested, ParquetAvroCompat} +import org.apache.spark.sql.test.TestSQLContext +import org.apache.spark.sql.{Row, SQLContext} + +object ParquetAvroCompatibilitySuite { --- End diff -- I'd remove this too, unless it is used somewhere else. It makes it more confusing when you first open the test suite file if you don't see the suite. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34117514 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala --- @@ -0,0 +1,434 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.nio.ByteOrder + +import scala.collection.JavaConversions._ +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.parquet.column.Dictionary +import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, PrimitiveConverter} +import org.apache.parquet.schema.Type.Repetition +import org.apache.parquet.schema.{GroupType, PrimitiveType, Type} + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * A [[ParentContainerUpdater]] is used by a Parquet converter to set converted values to some + * corresponding parent container. For example, a converter for a `StructType` field may set + * converted values to a [[MutableRow]]; or a converter for array elements may append converted + * values to an [[ArrayBuffer]]. + */ +private[parquet] trait ParentContainerUpdater { + def set(value: Any): Unit = () + def setBoolean(value: Boolean): Unit = set(value) + def setByte(value: Byte): Unit = set(value) + def setShort(value: Short): Unit = set(value) + def setInt(value: Int): Unit = set(value) + def setLong(value: Long): Unit = set(value) + def setFloat(value: Float): Unit = set(value) + def setDouble(value: Double): Unit = set(value) +} + +/** A no-op updater used for root converter (who doesn't have a parent). */ +private[parquet] object NoopUpdater extends ParentContainerUpdater + +/** + * A [[CatalystRowConverter]] is used to convert Parquet "structs" into Spark SQL [[Row]]s. Since + * any Parquet record is also a struct, this converter can also be used as root converter. --- End diff -- The root converter is the converter defined in `RowRecordMaterializer`, which is used to convert a complete Parquet record to a single Spark SQL `Row`. It's always a `CatalystRowConverter`. The root converter's updater is a `NoopUpdater`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34117471 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetThriftCompatibilitySuite.scala --- @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.nio.ByteBuffer +import java.util.{List => JList, Map => JMap} + +import scala.collection.JavaConversions._ + +import org.apache.hadoop.fs.Path +import org.apache.parquet.hadoop.metadata.CompressionCodecName +import org.apache.parquet.thrift.ThriftParquetWriter + +import org.apache.spark.sql.parquet.test.thrift.{Nested, ParquetThriftCompat, Suit} +import org.apache.spark.sql.test.TestSQLContext +import org.apache.spark.sql.{Row, SQLContext} + +object ParquetThriftCompatibilitySuite { + def makeParquetThriftCompat(i: Int): ParquetThriftCompat = { --- End diff -- can't we just make this a function of ParquetThriftCompatibilitySuite to remove the companion object? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8851][YARN] In Yarn client mode, Client...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7255#issuecomment-11906 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8851][YARN] In Yarn client mode, Client...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7255#issuecomment-119444300 [Test build #36753 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36753/console) for PR 7255 at commit [`3122470`](https://github.com/apache/spark/commit/31224709a69edd36d3a14ce48b20e578cad4ddcd). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8700][ML] Disable feature scaling in Lo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7080#issuecomment-119443905 [Test build #36754 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36754/console) for PR 7080 at commit [`877e6c7`](https://github.com/apache/spark/commit/877e6c777874c3684c62b94a8704c8a1866defd9). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8700][ML] Disable feature scaling in Lo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7080#issuecomment-119444002 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8886][Documentation]python Style update
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7281 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8886][Documentation]python Style update
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/7281#issuecomment-119442813 Thanks - I've merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8886][Documentation]python Style update
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7281#issuecomment-119442368 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8615][Documentation]Fixed Sample deprec...
Github user tijoparacka commented on a diff in the pull request: https://github.com/apache/spark/pull/7039#discussion_r34116963 --- Diff: docs/sql-programming-guide.md --- @@ -1798,7 +1798,7 @@ DataFrame jdbcDF = sqlContext.load("jdbc", options) {% highlight python %} -df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename") +df = sqlContext.read.format('jdbc').options(url = 'jdbc:postgresql:dbserver', dbtable='schema.tablename').load() --- End diff -- @rxin I have submitted a followup patch with jira id -SPARK-8886. thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8886][Documentation]python Style update
GitHub user tijoparacka opened a pull request: https://github.com/apache/spark/pull/7281 [SPARK-8886][Documentation]python Style update Fixed comment given by rxin You can merge this pull request into a Git repository by running: $ git pull https://github.com/tijoparacka/spark modification_for_python_style Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7281.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7281 commit 3de4cd8c5b9b8e2d3d3dba166db898f88cf095d6 Author: Tijo Thomas Date: 2015-07-07T17:42:52Z python Style update commit 6334e212f91626edaaeb9c8bc03a306feff705ac Author: Tijo Thomas Date: 2015-07-08T11:04:21Z removed space --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8297] [YARN] Scheduler backend is not n...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/7243#issuecomment-119439476 @vanzin Oh great, that is a very generous offer ! I will keep it around as long as you need :-) I just want to get it into the next release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34116646 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala --- @@ -0,0 +1,434 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.nio.ByteOrder + +import scala.collection.JavaConversions._ +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.parquet.column.Dictionary +import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, PrimitiveConverter} +import org.apache.parquet.schema.Type.Repetition +import org.apache.parquet.schema.{GroupType, PrimitiveType, Type} + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * A [[ParentContainerUpdater]] is used by a Parquet converter to set converted values to some + * corresponding parent container. For example, a converter for a `StructType` field may set + * converted values to a [[MutableRow]]; or a converter for array elements may append converted + * values to an [[ArrayBuffer]]. + */ +private[parquet] trait ParentContainerUpdater { + def set(value: Any): Unit = () + def setBoolean(value: Boolean): Unit = set(value) + def setByte(value: Byte): Unit = set(value) + def setShort(value: Short): Unit = set(value) + def setInt(value: Int): Unit = set(value) + def setLong(value: Long): Unit = set(value) + def setFloat(value: Float): Unit = set(value) + def setDouble(value: Double): Unit = set(value) +} + +/** A no-op updater used for root converter (who doesn't have a parent). */ +private[parquet] object NoopUpdater extends ParentContainerUpdater + +/** + * A [[CatalystRowConverter]] is used to convert Parquet "structs" into Spark SQL [[Row]]s. Since + * any Parquet record is also a struct, this converter can also be used as root converter. + * + * When used as a root converter, [[NoopUpdater]] should be used since root converters don't have + * any "parent" container. --- End diff -- Not quite understand this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34116640 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala --- @@ -0,0 +1,434 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.nio.ByteOrder + +import scala.collection.JavaConversions._ +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.parquet.column.Dictionary +import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, PrimitiveConverter} +import org.apache.parquet.schema.Type.Repetition +import org.apache.parquet.schema.{GroupType, PrimitiveType, Type} + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * A [[ParentContainerUpdater]] is used by a Parquet converter to set converted values to some + * corresponding parent container. For example, a converter for a `StructType` field may set --- End diff -- I meant "a converter for a field of a `StructType`". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34116637 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala --- @@ -0,0 +1,434 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.nio.ByteOrder + +import scala.collection.JavaConversions._ +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.parquet.column.Dictionary +import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, PrimitiveConverter} +import org.apache.parquet.schema.Type.Repetition +import org.apache.parquet.schema.{GroupType, PrimitiveType, Type} + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * A [[ParentContainerUpdater]] is used by a Parquet converter to set converted values to some + * corresponding parent container. For example, a converter for a `StructType` field may set + * converted values to a [[MutableRow]]; or a converter for array elements may append converted + * values to an [[ArrayBuffer]]. + */ +private[parquet] trait ParentContainerUpdater { + def set(value: Any): Unit = () + def setBoolean(value: Boolean): Unit = set(value) + def setByte(value: Byte): Unit = set(value) + def setShort(value: Short): Unit = set(value) + def setInt(value: Int): Unit = set(value) + def setLong(value: Long): Unit = set(value) + def setFloat(value: Float): Unit = set(value) + def setDouble(value: Double): Unit = set(value) +} + +/** A no-op updater used for root converter (who doesn't have a parent). */ +private[parquet] object NoopUpdater extends ParentContainerUpdater + +/** + * A [[CatalystRowConverter]] is used to convert Parquet "structs" into Spark SQL [[Row]]s. Since + * any Parquet record is also a struct, this converter can also be used as root converter. --- End diff -- Which one is the root converter, NoopUpdater or CatalystRowConverter? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34116548 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala --- @@ -0,0 +1,434 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.nio.ByteOrder + +import scala.collection.JavaConversions._ +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.parquet.column.Dictionary +import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, PrimitiveConverter} +import org.apache.parquet.schema.Type.Repetition +import org.apache.parquet.schema.{GroupType, PrimitiveType, Type} + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * A [[ParentContainerUpdater]] is used by a Parquet converter to set converted values to some + * corresponding parent container. For example, a converter for a `StructType` field may set + * converted values to a [[MutableRow]]; or a converter for array elements may append converted + * values to an [[ArrayBuffer]]. + */ +private[parquet] trait ParentContainerUpdater { + def set(value: Any): Unit = () + def setBoolean(value: Boolean): Unit = set(value) + def setByte(value: Byte): Unit = set(value) + def setShort(value: Short): Unit = set(value) + def setInt(value: Int): Unit = set(value) + def setLong(value: Long): Unit = set(value) + def setFloat(value: Float): Unit = set(value) + def setDouble(value: Double): Unit = set(value) --- End diff -- Do we need to add setDate and setTimestamp? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] R...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/7231#discussion_r34116504 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystRowConverter.scala --- @@ -0,0 +1,434 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import java.nio.ByteOrder + +import scala.collection.JavaConversions._ +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer + +import org.apache.parquet.column.Dictionary +import org.apache.parquet.io.api.{Binary, Converter, GroupConverter, PrimitiveConverter} +import org.apache.parquet.schema.Type.Repetition +import org.apache.parquet.schema.{GroupType, PrimitiveType, Type} + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * A [[ParentContainerUpdater]] is used by a Parquet converter to set converted values to some + * corresponding parent container. For example, a converter for a `StructType` field may set --- End diff -- a converter for a `StructField`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8879][SQL] Remove EmptyRow class.
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7277 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8840][SparkR] Add float coercion on Spa...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/7280#issuecomment-119435268 @viirya The nice thing would be to add a test case based on say a JSON or Parquet input file. We can check the file into https://github.com/apache/spark/tree/master/R/pkg/inst/test_support and use it in a test case https://github.com/apache/spark/blob/master/R/pkg/inst/tests/test_sparkSQL.R --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8840][SparkR] Add float coercion on Spa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7280#issuecomment-119435158 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8840][SparkR] Add float coercion on Spa...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7280#issuecomment-119435129 [Test build #36750 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36750/console) for PR 7280 at commit [`0dcc992`](https://github.com/apache/spark/commit/0dcc9923816114ae6e413af1ca8f195b150e0b5d). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8271][SQL]string function: soundex
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7115#issuecomment-119434592 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8271][SQL]string function: soundex
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7115#issuecomment-119434659 [Test build #36759 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36759/consoleFull) for PR 7115 at commit [`731`](https://github.com/apache/spark/commit/7318854da623610dfcfc89a64d508993c61f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8271][SQL]string function: soundex
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7115#issuecomment-119434579 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...
Github user tianyi commented on the pull request: https://github.com/apache/spark/pull/7239#issuecomment-119433617 @SaintBacchus yes, I agree that, but I don't understand why we have to removed a closed session immediately. If there are more than 200 session alive, I think there will be more client connecting soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6743#issuecomment-119433552 [Test build #36758 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36758/consoleFull) for PR 6743 at commit [`27cfbac`](https://github.com/apache/spark/commit/27cfbac55f276bb3d11cdd6a417f541aa81524a9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6743#issuecomment-119432827 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6743#issuecomment-119432781 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-119430782 [Test build #36756 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36756/console) for PR 7279 at commit [`6602052`](https://github.com/apache/spark/commit/66020529cc22d0aa677a547652b53653ce618f38). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-119430849 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/7226#discussion_r34115538 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/IntervalType.scala --- @@ -0,0 +1,37 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.types + +import org.apache.spark.annotation.DeveloperApi + + +/** + * :: DeveloperApi :: + * The data type representing time intervals. + * + * Please use the singleton [[DataTypes.IntervalType]]. + */ +@DeveloperApi +class IntervalType private() extends DataType { --- End diff -- I didn't make it extend `AtomicType` here, as I haven't figured out how to compare intervals. `30 days` and `1 months` may have different compare result in different context. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7226#issuecomment-119428300 [Test build #36757 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36757/consoleFull) for PR 7226 at commit [`ac348c3`](https://github.com/apache/spark/commit/ac348c3ff9007d1b377eeb3570dfad86e05732aa). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-119427827 [Test build #36756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36756/consoleFull) for PR 7279 at commit [`6602052`](https://github.com/apache/spark/commit/66020529cc22d0aa677a547652b53653ce618f38). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-119427692 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7226#issuecomment-119427675 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7226#issuecomment-119427691 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6743#issuecomment-119427632 [Test build #36755 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36755/console) for PR 6743 at commit [`0036265`](https://github.com/apache/spark/commit/0036265f41df9064fa4b93716a02cee40be8c7cf). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-119427670 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6743#issuecomment-119427659 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-7292] Cheap checkpointing
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/7279#issuecomment-119427073 Thank you @witgo. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...
Github user SaintBacchus commented on the pull request: https://github.com/apache/spark/pull/7239#issuecomment-119425235 @tianyi In my solution all the unfinished sessions will keep in memory. If we don't check after session finish, we have to wait a new client to trigger this check. > do the checking work when a session opened or an execution started --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6743#issuecomment-119424280 [Test build #36755 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36755/consoleFull) for PR 6743 at commit [`0036265`](https://github.com/apache/spark/commit/0036265f41df9064fa4b93716a02cee40be8c7cf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6743#issuecomment-119423486 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6797][SPARKR] Add support for YARN clus...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6743#issuecomment-119423495 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8457][ML]NGram Documentation
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/7244#issuecomment-119422607 Could you please put links for the API docs under the Java and Python codetabs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8457][ML]NGram Documentation
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7244#discussion_r34114829 --- Diff: docs/ml-features.md --- @@ -288,6 +288,88 @@ for words_label in wordsDataFrame.select("words", "label").take(3): +## $n$-gram + +An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of $n$ tokens (typically words) for some integer $n$. The `NGram` class can be used to transform input features into $n$-grams. + +`NGram` takes as input a sequence of strings (e.g. the output of a [Tokenizer](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer)). The parameter `n` is used to determine the number of terms in each $n$-gram. The output will consist of a sequence of $n$-grams where each $n$-gram is represented by a space-delimited string of $n$ consecutive words. If the input sequence contains fewer than `n` strings, no output is produced. + + + + + + + +[`NGram`](api/scala/index.html#org.apache.spark.ml.feature.NGram) takes an input column name, an output column name, and an optional length parameter n (n=2 by default). + +{% highlight scala %} +import org.apache.spark.ml.feature.NGram + +val wordDataFrame = sqlContext.createDataFrame(Seq( + (0, Array("Hi", "I", "heard", "about", "Spark")), + (1, Array("I", "wish", "Java", "could", "use", "case", "classes")), + (2, Array("Logistic", "regression", "models", "are", "neat")) +)).toDF("label", "words") + +val ngram = new NGram().setInputCol("words").setOutputCol("ngrams") +val ngramDataFrame = ngram.transform(wordDataFrame) +ngramDataFrame.select("ngrams", "label").take(3).foreach(println) --- End diff -- If this looks weird, could you please modify it to be prettier? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8457][ML]NGram Documentation
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7244#discussion_r34114827 --- Diff: docs/ml-features.md --- @@ -288,6 +288,88 @@ for words_label in wordsDataFrame.select("words", "label").take(3): +## $n$-gram + +An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of $n$ tokens (typically words) for some integer $n$. The `NGram` class can be used to transform input features into $n$-grams. + +`NGram` takes as input a sequence of strings (e.g. the output of a [Tokenizer](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer)). The parameter `n` is used to determine the number of terms in each $n$-gram. The output will consist of a sequence of $n$-grams where each $n$-gram is represented by a space-delimited string of $n$ consecutive words. If the input sequence contains fewer than `n` strings, no output is produced. --- End diff -- When I wrote > Could link to this page's section instead (once API doc links are moved to codetabs section). I meant that you could change the Tokenizer link to ```(ml-features.html#tokenizer)```. (That was ambiguous.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8233][SQL] misc function: hash
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/6971#issuecomment-119416710 Jenkins, ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/7226#discussion_r34114401 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala --- @@ -301,6 +302,16 @@ object CatalystTypeConverters { DateTimeUtils.toJavaTimestamp(row.getLong(column)) } + private object IntervalConverter extends CatalystTypeConverter[Interval, Interval, Array[Byte]] { +override def toCatalystImpl(scalaValue: Interval): Array[Byte] = --- End diff -- @adrian-wang also suggested using a single long to encode -- we should consider and decide on that as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8753][SQL] Create an IntervalType data ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/7226#discussion_r34114392 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala --- @@ -301,6 +302,16 @@ object CatalystTypeConverters { DateTimeUtils.toJavaTimestamp(row.getLong(column)) } + private object IntervalConverter extends CatalystTypeConverter[Interval, Interval, Array[Byte]] { +override def toCatalystImpl(scalaValue: Interval): Array[Byte] = --- End diff -- Yes that's what I meant - just use an object with int + long. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...
Github user tianyi commented on the pull request: https://github.com/apache/spark/pull/7239#issuecomment-119416313 We already do the checking work when a session opened or an execution started. The number of these two lists should NOT increase in other phase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8840][SparkR] Add float coercion on Spa...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/7280#issuecomment-119415856 @shivaram I will check it later. For the test case, I don't see that for other types. Where do you suggest me to add the test case into? Directly adding it in deserialize.R? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...
Github user tianyi commented on a diff in the pull request: https://github.com/apache/spark/pull/7239#discussion_r34114144 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala --- @@ -179,6 +179,7 @@ object HiveThriftServer2 extends Logging { def onSessionClosed(sessionId: String): Unit = { sessionList(sessionId).finishTimestamp = System.currentTimeMillis onlineSessionNum -= 1 + trimSessionIfNecessary() --- End diff -- I think it is not necessary here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...
Github user tianyi commented on a diff in the pull request: https://github.com/apache/spark/pull/7239#discussion_r34114132 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala --- @@ -206,18 +208,20 @@ object HiveThriftServer2 extends Logging { executionList(id).detail = errorMessage executionList(id).state = ExecutionState.FAILED totalRunning -= 1 + trimExecutionIfNecessary() } def onStatementFinish(id: String): Unit = { executionList(id).finishTimestamp = System.currentTimeMillis executionList(id).state = ExecutionState.FINISHED totalRunning -= 1 + trimExecutionIfNecessary() --- End diff -- I think it is not necessary here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...
Github user tianyi commented on a diff in the pull request: https://github.com/apache/spark/pull/7239#discussion_r34114126 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala --- @@ -206,18 +208,20 @@ object HiveThriftServer2 extends Logging { executionList(id).detail = errorMessage executionList(id).state = ExecutionState.FAILED totalRunning -= 1 + trimExecutionIfNecessary() --- End diff -- I think it is not necessary here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...
Github user tianyi commented on a diff in the pull request: https://github.com/apache/spark/pull/7239#discussion_r34114118 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala --- @@ -199,6 +200,7 @@ object HiveThriftServer2 extends Logging { def onStatementParsed(id: String, executionPlan: String): Unit = { executionList(id).executePlan = executionPlan executionList(id).state = ExecutionState.COMPILED + trimExecutionIfNecessary() --- End diff -- I think it is not necessary here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8839][SQL]ThriftServer2 will remove ses...
Github user tianyi commented on the pull request: https://github.com/apache/spark/pull/7239#issuecomment-119414302 I think we should only fix the mistake of removing alive sessions or executions in this PR. If the number of alive objects is more than 200, you should set the `spark.sql.thriftserver.ui.retainedStatements` and `spark.sql.thriftserver.ui.retainedSessions` to a bigger value. You should also remember to give more memory to the driver side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34113909 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( +val sequences: RDD[Array[Int]], --- End diff -- OK, It would be fixed in a follow-up PR before 1.5. It's soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8389][Streaming][PySpark] Expose KafkaR...
Github user amit-ramesh commented on the pull request: https://github.com/apache/spark/pull/7185#issuecomment-119412470 @jerryshao Although the title of SPARK-8337 is worded more broadly, all we really wanted is a way to attach Kafka offsets to individual events/messages :). AFAICT, the implementation of transform appears to be doing executor side operations. And if the following two assumptions hold then we should be able to achieve what we need: 1. Events in an RDD partition are ordered by Kafka offset 2. The index of an OffsetRanges object in the getOffsets() list corresponds to the partition index in the RDD. If this is not indeed so, it would be great if you could point me to some lines that use driver side operations in transform. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6487][MLlib] Add sequential pattern min...
Github user zhangjiajin commented on a diff in the pull request: https://github.com/apache/spark/pull/7258#discussion_r34113588 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/Prefixspan.scala --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.fpm + +import org.apache.spark.rdd.RDD + +/** + * + * A parallel PrefixSpan algorithm to mine sequential pattern. + * The PrefixSpan algorithm is described in + * [[http://web.engr.illinois.edu/~hanj/pdf/span01.pdf]]. + * + * @param sequences original sequences data + * @param minSupport the minimal support level of the sequential pattern, any pattern appears + * more than minSupport times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * less than maxPatternLength will be output + * + * @see [[https://en.wikipedia.org/wiki/Sequential_Pattern_Mining Sequential Pattern Mining + * (Wikipedia)]] + */ +class Prefixspan( --- End diff -- Fixed, change Prefixspan to PrefixSpan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org