[GitHub] spark pull request: [SQL] SPARK-6548: Adding stddev to DataFrame f...
Github user dreamquster commented on the pull request: https://github.com/apache/spark/pull/5357#issuecomment-95611837 @yhuai . Is this comment OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/5660#discussion_r28974357 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala --- @@ -61,39 +62,40 @@ private[hive] case class HiveFunctionWrapper(var functionClassName: String) // for Serialization def this() = this(null) + + import java.io.{OutputStream, InputStream} + + import com.esotericsoftware.kryo.io.Input + import com.esotericsoftware.kryo.io.Output + import org.apache.spark.util.Utils._ @transient - private val methodDeSerialize = { -val method = classOf[Utilities].getDeclaredMethod( - deserializeObjectByKryo, - classOf[Kryo], - classOf[java.io.InputStream], - classOf[Class[_]]) -method.setAccessible(true) - -method + private def deserializeObjectByKryo[T: ClassTag](kryo: Kryo, + in: InputStream, + clazz: Class[_]): T = { --- End diff -- Don't indent method arguments in this style. Spark uses the following style: ``` def methodName( arg1: Type1, arg2: Type2): ReturnType = { // ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/5660#discussion_r28974373 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala --- @@ -61,39 +62,40 @@ private[hive] case class HiveFunctionWrapper(var functionClassName: String) // for Serialization def this() = this(null) + + import java.io.{OutputStream, InputStream} + + import com.esotericsoftware.kryo.io.Input + import com.esotericsoftware.kryo.io.Output + import org.apache.spark.util.Utils._ @transient - private val methodDeSerialize = { -val method = classOf[Utilities].getDeclaredMethod( - deserializeObjectByKryo, - classOf[Kryo], - classOf[java.io.InputStream], - classOf[Class[_]]) -method.setAccessible(true) - -method + private def deserializeObjectByKryo[T: ClassTag](kryo: Kryo, + in: InputStream, + clazz: Class[_]): T = { +val inp = new Input(in) +val t: T = kryo.readObject(inp,clazz).asInstanceOf[T] +inp.close() +t } @transient - private val methodSerialize = { -val method = classOf[Utilities].getDeclaredMethod( - serializeObjectByKryo, - classOf[Kryo], - classOf[Object], - classOf[java.io.OutputStream]) -method.setAccessible(true) - -method + private def serializeObjectByKryo(kryo: Kryo, +plan: Object, +out: OutputStream ) { --- End diff -- Same as above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7093][SQL] Using newPredicate in Nested...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5665#issuecomment-95633990 [Test build #30844 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30844/consoleFull) for PR 5665 at commit [`a887c02`](https://github.com/apache/spark/commit/a887c02bc16c3d8527e3108090706209717cbe62). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7093][SQL] Using newPredicate in Nested...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5665#issuecomment-95634009 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30844/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/5526#discussion_r28977394 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala --- @@ -197,3 +233,69 @@ trait InsertableRelation { trait CatalystScan { def buildScan(requiredColumns: Seq[Attribute], filters: Seq[Expression]): RDD[Row] } + +/** + * ::Experimental:: + * [[OutputWriter]] is used together with [[FSBasedRelation]] for persisting rows to the + * underlying file system. An [[OutputWriter]] instance is created when a new output file is + * opened. This instance is used to persist rows to this single output file. + */ +@Experimental +trait OutputWriter { + /** + * Persists a single row. Invoked on the executor side. + */ + def write(row: Row): Unit + + /** + * Closes the [[OutputWriter]]. Invoked on the executor side after all rows are persisted, before + * the task output is committed. + */ + def close(): Unit +} + +/** + * ::Experimental:: + * A [[BaseRelation]] that abstracts file system based data sources. + * + * For the read path, similar to [[PrunedFilteredScan]], it can eliminate unneeded columns and + * filter using selected predicates before producing an RDD containing all matching tuples as + * [[Row]] objects. + * + * In addition, when reading from Hive style partitioned tables stored in file systems, it's able to + * discover partitioning information from the paths of input directories, and perform partition + * pruning before start reading the data. + * + * For the write path, it provides the ability to write to both non-partitioned and partitioned + * tables. Directory layout of the partitioned tables is compatible with Hive. + */ +@Experimental +trait FSBasedRelation extends BaseRelation { + /** + * Builds an `RDD[Row]` containing all rows within this relation. + * + * @param requiredColumns Required columns. + * @param filters Candidate filters to be pushed down. The actual filter should be the conjunction + *of all `filters`. The pushed down filters are currently purely an optimization as they + *will all be evaluated again. This means it is safe to use them with methods that produce + *false positives such as filtering partitions based on a bloom filter. + * @param inputPaths Data files to be read. If the underlying relation is partitioned, only data + *files within required partition directories are included. + */ + def buildScan( --- End diff -- This issue has been fixed. Three variants are provided. Developers can override any one of them to implement the read path. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7092] Update spark scala version to 2.1...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5662#issuecomment-95613732 [Test build #30838 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30838/consoleFull) for PR 5662 at commit [`58cf4f9`](https://github.com/apache/spark/commit/58cf4f96f62d290cc915c47bef08bd8666f2e73c). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5526#issuecomment-95614464 [Test build #30846 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30846/consoleFull) for PR 5526 at commit [`4e93e9b`](https://github.com/apache/spark/commit/4e93e9b04790b24d0df57d8c62c9447abea0f74f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5526#issuecomment-95614524 [Test build #30846 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30846/consoleFull) for PR 5526 at commit [`4e93e9b`](https://github.com/apache/spark/commit/4e93e9b04790b24d0df57d8c62c9447abea0f74f). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait FSBasedRelationProvider ` * `abstract class OutputWriter ` * `abstract class FSBasedRelation extends BaseRelation ` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5660#issuecomment-95629047 [Test build #30849 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30849/consoleFull) for PR 5660 at commit [`0b522a7`](https://github.com/apache/spark/commit/0b522a77ffe83889e7c6afaa963dd49e5675fe36). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7092] Update spark scala version to 2.1...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5662#issuecomment-95613792 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30838/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-7090][MLlib] Introduce LDAOptimizer to ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5661#issuecomment-95622405 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30843/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5660#issuecomment-95625891 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5660#issuecomment-95626620 [Test build #30848 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30848/consoleFull) for PR 5660 at commit [`0b522a7`](https://github.com/apache/spark/commit/0b522a77ffe83889e7c6afaa963dd49e5675fe36). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7093][SQL] Using newPredicate in Nested...
GitHub user scwf opened a pull request: https://github.com/apache/spark/pull/5665 [SPARK-7093][SQL] Using newPredicate in NestedLoopJoin to enable code generation Using newPredicate in NestedLoopJoin instead of InterpretedPredicate to enable code generation You can merge this pull request into a Git repository by running: $ git pull https://github.com/scwf/spark NLP Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5665.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5665 commit a887c02bc16c3d8527e3108090706209717cbe62 Author: scwf wangf...@huawei.com Date: 2015-04-23T13:55:26Z improve for NLP boundCondition --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-7090][MLlib] Introduce LDAOptimizer to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5661#issuecomment-95622391 [Test build #30843 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30843/consoleFull) for PR 5661 at commit [`e756ce4`](https://github.com/apache/spark/commit/e756ce4c351a67e92afc0faef42b314c8ab8a31d). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait LDAOptimizer` * `class EMLDAOptimizer extends LDAOptimizer` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5660#issuecomment-95627460 Would like to add some background about this change: Spark SQL repackaged Hive jars to remove unnecessary shading and dependencies. One of them is Kryo. Hive shades Kryo to package `org.apache.hive.com.esotericsoftware`. Normally this should be fine. However, MapR replaces Spark SQL's repackaged Hive dependencies with genuine Hive packages, thus the reflection call mentioned in this PR couldn't find Kryo because it's moved into another package. On the other hand, the two reflected methods are actually quite simple. That's why we chose to reimplement them in Spark SQL instead of using them via reflection. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/5660#discussion_r28974257 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala --- @@ -61,39 +62,40 @@ private[hive] case class HiveFunctionWrapper(var functionClassName: String) // for Serialization def this() = this(null) + + import java.io.{OutputStream, InputStream} + + import com.esotericsoftware.kryo.io.Input + import com.esotericsoftware.kryo.io.Output --- End diff -- Move imports to the header of the file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5526#issuecomment-95636852 @marmbrus @yhuai @rxin Previous comments are addressed, Also added tests (`ignore`d for now) in `FSBasedRelationSuite`. Going to implement all the interface. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7093][SQL] Using newPredicate in Nested...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5665#issuecomment-95603340 [Test build #30844 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30844/consoleFull) for PR 5665 at commit [`a887c02`](https://github.com/apache/spark/commit/a887c02bc16c3d8527e3108090706209717cbe62). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6924][YARN] Fix driver hangs in yarn-cl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5663#issuecomment-95603329 [Test build #30840 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30840/consoleFull) for PR 5663 at commit [`5a28319`](https://github.com/apache/spark/commit/5a283199952b70c4d007e9da60d80fd96fb9c2a6). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6924][YARN] Fix driver hangs in yarn-cl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5663#issuecomment-95603345 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30840/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7044] [SQL] Fix the deadlock in script ...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/5625#discussion_r28970625 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala --- @@ -145,20 +147,27 @@ case class ScriptTransformation( val dataOutputStream = new DataOutputStream(outputStream) val outputProjection = new InterpretedProjection(input, child.output) - iter -.map(outputProjection) -.foreach { row = + // Put the write(output to the pipeline) into a single thread + // and keep the collector as remain in the main thread. + // otherwise it will causes deadlock if the data size greater than + // the pipeline / buffer capacity. + future { --- End diff -- Thank you @rxin, I think you are right, updated! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5660#issuecomment-95628037 add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5526#issuecomment-95646404 [Test build #30851 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30851/consoleFull) for PR 5526 at commit [`3b22c32`](https://github.com/apache/spark/commit/3b22c3291cd920eb02a345dace2556dc23d57efb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6734] [SQL] Add UDTF.close support in G...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5383#issuecomment-95610869 [Test build #30845 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30845/consoleFull) for PR 5383 at commit [`8953be3`](https://github.com/apache/spark/commit/8953be3eda4b120966f6cc7fb1e02f48c632a90f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/5526#discussion_r28975672 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala --- @@ -197,3 +233,69 @@ trait InsertableRelation { trait CatalystScan { def buildScan(requiredColumns: Seq[Attribute], filters: Seq[Expression]): RDD[Row] } + +/** + * ::Experimental:: + * [[OutputWriter]] is used together with [[FSBasedRelation]] for persisting rows to the + * underlying file system. An [[OutputWriter]] instance is created when a new output file is + * opened. This instance is used to persist rows to this single output file. + */ +@Experimental +trait OutputWriter { + /** + * Persists a single row. Invoked on the executor side. + */ + def write(row: Row): Unit + + /** + * Closes the [[OutputWriter]]. Invoked on the executor side after all rows are persisted, before + * the task output is committed. + */ + def close(): Unit +} + +/** + * ::Experimental:: + * A [[BaseRelation]] that abstracts file system based data sources. + * + * For the read path, similar to [[PrunedFilteredScan]], it can eliminate unneeded columns and + * filter using selected predicates before producing an RDD containing all matching tuples as + * [[Row]] objects. + * + * In addition, when reading from Hive style partitioned tables stored in file systems, it's able to + * discover partitioning information from the paths of input directories, and perform partition + * pruning before start reading the data. + * + * For the write path, it provides the ability to write to both non-partitioned and partitioned + * tables. Directory layout of the partitioned tables is compatible with Hive. + */ +@Experimental +trait FSBasedRelation extends BaseRelation { + /** + * Builds an `RDD[Row]` containing all rows within this relation. + * + * @param requiredColumns Required columns. + * @param filters Candidate filters to be pushed down. The actual filter should be the conjunction + *of all `filters`. The pushed down filters are currently purely an optimization as they + *will all be evaluated again. This means it is safe to use them with methods that produce + *false positives such as filtering partitions based on a bloom filter. + * @param inputPaths Data files to be read. If the underlying relation is partitioned, only data + *files within required partition directories are included. + */ + def buildScan( + requiredColumns: Array[String], + filters: Array[Filter], + inputPaths: Array[String]): RDD[Row] + + /** + * When writing rows to this relation, this method is invoked on the driver side before the actual + * write job is issued. It provides an opportunity to configure the write job to be performed. + */ + def prepareForWrite(conf: Configuration): Unit + + /** + * This method is responsible for producing a new [[OutputWriter]] for each newly opened output + * file on the executor side. + */ + def newOutputWriter(path: String): OutputWriter --- End diff -- One issue here is about passing driver side Hadoop configuration to OutputWriters on executor side. Users may set properties to Hadoop configurations on driver side (e.g. `mapreduce.fileoutputcommitter.marksuccessfuljobs`), and we should inherit these settings on executor side when writing data. zero-arg constructor plus `init(...)` is a good way to avoid forcing `BaseRelation` to be serializable, but I guess we have to put `Configuration` as an argument of `OutputWriter.init(...)`. This makes the data sources API coupled with Hadoop API via `Configuration`, but I guess this should be more acceptable comparing to forcing `BaseRelation` subclasses to be serializable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5526#issuecomment-95640536 [Test build #30850 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30850/consoleFull) for PR 5526 at commit [`b63f813`](https://github.com/apache/spark/commit/b63f81375e3e3cdea884c2e7c3f1925294c61c21). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait FSBasedRelationProvider ` * `abstract class OutputWriter ` * `abstract class FSBasedRelation extends BaseRelation ` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5526#issuecomment-95640578 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30850/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5660#issuecomment-95661012 [Test build #30849 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30849/consoleFull) for PR 5660 at commit [`0b522a7`](https://github.com/apache/spark/commit/0b522a77ffe83889e7c6afaa963dd49e5675fe36). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6856] [R] Make RDD information more use...
GitHub user His-name-is-Joof opened a pull request: https://github.com/apache/spark/pull/5667 [SPARK-6856] [R] Make RDD information more useful in SparkR You can merge this pull request into a Git repository by running: $ git pull https://github.com/His-name-is-Joof/spark joofspark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5667.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5667 commit 123be654a155140f7b8d78203daf546d68cef2e8 Author: Jeff Harrison jeffrharri...@gmail.com Date: 2015-04-23T17:08:10Z SPARK-6856 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7084] improve saveAsTable documentation
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5654#issuecomment-95666418 [Test build #30856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30856/consoleFull) for PR 5654 at commit [`00bc819`](https://github.com/apache/spark/commit/00bc819ba948dd29d78251c7d59089ce1116bc2e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5668#issuecomment-95676314 [Test build #30860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30860/consoleFull) for PR 5668 at commit [`b0beb34`](https://github.com/apache/spark/commit/b0beb34d6a77c738660cb161306c947411d70ab5). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5932][CORE] Use consistent naming for s...
Github user ilganeli commented on a diff in the pull request: https://github.com/apache/spark/pull/5574#discussion_r28989856 --- Diff: network/common/src/main/java/org/apache/spark/network/util/ByteUnit.java --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.network.util; + +public enum ByteUnit { + BYTE (1), + KiB (1024L), + MiB ((long) Math.pow(1024L, 2L)), + GiB ((long) Math.pow(1024L, 3L)), + TiB ((long) Math.pow(1024L, 4L)), + PiB ((long) Math.pow(1024L, 5L)); + + private ByteUnit(long multiplier) { +this.multiplier = multiplier; + } + + // Interpret the provided number (d) with suffix (u) as this unit type. + // E.g. KiB.interpret(1, MiB) interprets 1MiB as its KiB representation = 1024k + public long interpret(long d, ByteUnit u) { +return u.toBytes(d) / multiplier; + } + + // Convert the provided number (d) interpreted as this unit type to unit type (u). + public long convert(long d, ByteUnit u) { +return toBytes(d) / u.multiplier; --- End diff -- Marcelo - I think this could be readily solved if ```toBytes``` returns a double. Max for a double is 1.79769e+308 which is more than we conceivably ever need and would solve the overflow issue (we'd just need to check that the resulting number if less than Long.MAX_VALUE and throw an exception if it's not). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2342#issuecomment-95676957 [Test build #30864 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30864/consoleFull) for PR 2342 at commit [`974a64a`](https://github.com/apache/spark/commit/974a64a5670c8e8a8078b2e81c915e5808424e14). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5894][ML] Add polynomial mapper
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/5245#issuecomment-95678246 @yinxusen I'm not sure whether it is faster or not. That's why I put the new approach side by side. Please help test the performance. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5668#issuecomment-95679048 [Test build #30868 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30868/consoleFull) for PR 5668 at commit [`d25bc2a`](https://github.com/apache/spark/commit/d25bc2ab87163cda40f75ae4d7579a848d42fe58). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2342#issuecomment-95679035 [Test build #30867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30867/consoleFull) for PR 2342 at commit [`8110acf`](https://github.com/apache/spark/commit/8110acf82426a5da3e7925f9f27cb7042c817746). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6856] [R] Make RDD information more use...
Github user His-name-is-Joof commented on the pull request: https://github.com/apache/spark/pull/5667#issuecomment-95679966 How's that? Very new to contributing to large projects in general, so criticism welcome. Excellent bugtracker and starter bugs! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7056] Make the Write Ahead Log pluggabl...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/5645#issuecomment-95678755 @jerryshao @hshreedharan Can you please take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2342#issuecomment-95680982 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30867/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6818][SPARKR] Support column deletion i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5655#issuecomment-95680969 [Test build #30854 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30854/consoleFull) for PR 5655 at commit [`7c66570`](https://github.com/apache/spark/commit/7c66570897da1bc048aa1a3abb95785e7216c302). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2342#issuecomment-95680971 [Test build #30867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30867/consoleFull) for PR 2342 at commit [`8110acf`](https://github.com/apache/spark/commit/8110acf82426a5da3e7925f9f27cb7042c817746). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class ExecutorUIData(` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6818][SPARKR] Support column deletion i...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5655#issuecomment-95680983 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30854/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7033][SPARKR] Clean usage of split. Use...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/5628#issuecomment-95651218 Thanks, @sun-rui - doing a grep, could you also update test_rdd.R's line 124? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5666#issuecomment-95653046 [Test build #30852 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30852/consoleFull) for PR 5666 at commit [`d77990f`](https://github.com/apache/spark/commit/d77990f0b1d51114e32956d8ce08adfa1f6eff30). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6818][SPARKR] Support column deletion i...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/5655#issuecomment-95657442 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7044] [SQL] Fix the deadlock in script ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/5625#issuecomment-95665707 Merging in master branch-1.3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7044] [SQL] Fix the deadlock in script ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/5625#issuecomment-95666195 Actually this doesn't merge cleanly into 1.3. Do you mind submitting a pull request for that branch? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6856] [R] Make RDD information more use...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/5667#issuecomment-95667726 Jenkins, ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7041] Avoid writing empty files in Exte...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/5622#discussion_r28986044 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala --- @@ -736,11 +734,16 @@ private[spark] class ExternalSorter[K, V, C]( val writeStartTime = System.nanoTime util.Utils.tryWithSafeFinally { for (i - 0 until numPartitions) { - val in = new FileInputStream(partitionWriters(i).fileSegment().file) - util.Utils.tryWithSafeFinally { -lengths(i) = org.apache.spark.util.Utils.copyStream(in, out, false, transferToEnabled) - } { -in.close() + val file = partitionWriters(i).fileSegment().file + if (!file.exists()) { +lengths(i) = 0 + } else { +val in = new FileInputStream(file) +util.Utils.tryWithSafeFinally { --- End diff -- Shouldn't we avoid using partial package names like this? Just import spark.util.Utils. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7055][SQL]Use correct ClassLoader for J...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/5633#issuecomment-95671327 Unfortunately we can't run the docker tests on Jenkins and they cause issues with dependencies during the release so we temporarily removed them. I can try running them manually. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2750][WIP]Add Https support for Web UI
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/5664#issuecomment-95671673 This needs to be integrated with the `SSLOptions` configuration added in cfea30037f (#3571), instead of creating its own way of configuring things. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/5666#issuecomment-95673967 Hey, we probably do want to do this at some point, but I'm not sure the answer is ByteBuffer. Big changes like this should be proposed in JIRA and discussed before coding begins. Since we are close to the merge deadline for Spark 1.4 it is unlikely this patch will be merged anytime soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/5666#issuecomment-95674982 Thanks for submitting this. It's our fault to have that JIRA ticket there and suggest nio.ByteBuffer. ByteBuffer is not the right abstraction for this. It's API is pretty clunky (flip, etc), and cannot be serialized. We will do something as part of https://issues.apache.org/jira/browse/SPARK-7075 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-7063 when lz4 compression is used, it ca...
Github user linlin200605 commented on the pull request: https://github.com/apache/spark/pull/5641#issuecomment-95674803 lz4-1.3.0 needs Java 7 for building the jar, usually not necessarily require JDK 7 at runtime if there is backward compatibility. I will follow up on that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7060][SQL] Add alias function to python...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5634#issuecomment-95678712 [Test build #30855 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30855/consoleFull) for PR 5634 at commit [`f157c30`](https://github.com/apache/spark/commit/f157c3096f16f16a57b03eca10bc69cdc90de533). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7060][SQL] Add alias function to python...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5634#issuecomment-95678726 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30855/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2342#discussion_r28990365 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala --- @@ -17,20 +17,167 @@ package org.apache.spark.ui.jobs -import scala.collection.mutable -import scala.xml.{NodeSeq, Node} +import java.util.Date + +import scala.collection.mutable.{Buffer, ListBuffer} +import scala.xml.{NodeSeq, Node, Unparsed} import javax.servlet.http.HttpServletRequest import org.apache.spark.JobExecutionStatus import org.apache.spark.scheduler.StageInfo import org.apache.spark.ui.{UIUtils, WebUIPage} +import org.apache.spark.ui.jobs.UIData.ExecutorUIData /** Page showing statistics and stage list for a given job */ private[ui] class JobPage(parent: JobsTab) extends WebUIPage(job) { - private val listener = parent.listener + private val STAGES_LEGEND = +div class=legend-areasvg width=200px height=85px + rect x=5px y=5px width=20px height=15px +rx=2px ry=2px stroke=#97B0F8 fill=#D5DDF6/rect + text x=35px y=17pxCompleted Stage /text + rect x=5px y=35px width=20px height=15px +rx=2px ry=2px stroke=#97B0F8 fill=#FF5475/rect + text x=35px y=47pxFailed Stage/text + rect x=5px y=65px width=20px height=15px +rx=2px ry=2px stroke=#97B0F8 fill=#FDFFCA/rect + text x=35px y=77pxActive Stage/text +/svg/div.toString.filter(_ != '\n') + + private val EXECUTORS_LEGEND = +div class=legend-areasvg width=200px height=55px + rect x=5px y=5px width=20px height=15px +rx=2px ry=2px stroke=#97B0F8 fill=#D5DDF6/rect + text x=35px y=17pxExecutor Added/text + rect x=5px y=35px width=20px height=15px +rx=2px ry=2px stroke=#97B0F8 fill=#EBCA59/rect + text x=35px y=47pxExecutor Removed/text +/svg/div.toString.filter(_ != '\n') + + private def makeStageEvent(stageInfos: Seq[StageInfo]): Seq[String] = { +stageInfos.map { stage = + val stageId = stage.stageId + val attemptId = stage.attemptId + val name = stage.name + val status = { --- End diff -- This is very minor, though --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2342#discussion_r28990248 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala --- @@ -17,20 +17,167 @@ package org.apache.spark.ui.jobs -import scala.collection.mutable -import scala.xml.{NodeSeq, Node} +import java.util.Date + +import scala.collection.mutable.{Buffer, ListBuffer} +import scala.xml.{NodeSeq, Node, Unparsed} import javax.servlet.http.HttpServletRequest import org.apache.spark.JobExecutionStatus import org.apache.spark.scheduler.StageInfo import org.apache.spark.ui.{UIUtils, WebUIPage} +import org.apache.spark.ui.jobs.UIData.ExecutorUIData /** Page showing statistics and stage list for a given job */ private[ui] class JobPage(parent: JobsTab) extends WebUIPage(job) { - private val listener = parent.listener + private val STAGES_LEGEND = +div class=legend-areasvg width=200px height=85px + rect x=5px y=5px width=20px height=15px +rx=2px ry=2px stroke=#97B0F8 fill=#D5DDF6/rect + text x=35px y=17pxCompleted Stage /text + rect x=5px y=35px width=20px height=15px +rx=2px ry=2px stroke=#97B0F8 fill=#FF5475/rect + text x=35px y=47pxFailed Stage/text + rect x=5px y=65px width=20px height=15px +rx=2px ry=2px stroke=#97B0F8 fill=#FDFFCA/rect + text x=35px y=77pxActive Stage/text +/svg/div.toString.filter(_ != '\n') + + private val EXECUTORS_LEGEND = +div class=legend-areasvg width=200px height=55px + rect x=5px y=5px width=20px height=15px +rx=2px ry=2px stroke=#97B0F8 fill=#D5DDF6/rect + text x=35px y=17pxExecutor Added/text + rect x=5px y=35px width=20px height=15px +rx=2px ry=2px stroke=#97B0F8 fill=#EBCA59/rect + text x=35px y=47pxExecutor Removed/text +/svg/div.toString.filter(_ != '\n') + + private def makeStageEvent(stageInfos: Seq[StageInfo]): Seq[String] = { +stageInfos.map { stage = + val stageId = stage.stageId + val attemptId = stage.attemptId + val name = stage.name + val status = { --- End diff -- It might be better to add a private[spark] method called `getStatusString` to `StageInfo`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6122][Core] Upgrade tachyon-client vers...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/5354#issuecomment-95680725 It seems like the Jackson dep has to be excluded to get SBT + Hadoop 1.0.4 to work. I think that has to stay then, yeah. I think the httpclient stuff can be cleaned up a small bit but that too is essential. I'm getting worried at how much the divergence between SBT and Maven is causing us to hack the build, making it harder to get the build right for both. For example, these changes aren't necessary at all for Maven. It's exacerbated by trying to support Hadoop 1.x. Still maybe we kick this can down the road a bit longer, to get in this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2342#issuecomment-95671115 [Test build #30859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30859/consoleFull) for PR 2342 at commit [`ee7a7f0`](https://github.com/apache/spark/commit/ee7a7f0c9a9618b05b67f50e5698898b343a059d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1537 [WiP] Application Timeline Server i...
Github user steveloughran commented on the pull request: https://github.com/apache/spark/pull/5423#issuecomment-95675032 There's no obvious reason why the Jenkins build failed; the console says all the tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5668#issuecomment-95678594 retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6856] [R] Make RDD information more use...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5667#issuecomment-95679774 [Test build #30869 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30869/consoleFull) for PR 5667 at commit [`c8c0b80`](https://github.com/apache/spark/commit/c8c0b8095088a8845adc7149a69cee051774c689). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5932][CORE] Use consistent naming for s...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/5574#discussion_r28990766 --- Diff: network/common/src/main/java/org/apache/spark/network/util/ByteUnit.java --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.network.util; + +public enum ByteUnit { + BYTE (1), + KiB (1024L), + MiB ((long) Math.pow(1024L, 2L)), + GiB ((long) Math.pow(1024L, 3L)), + TiB ((long) Math.pow(1024L, 4L)), + PiB ((long) Math.pow(1024L, 5L)); + + private ByteUnit(long multiplier) { +this.multiplier = multiplier; + } + + // Interpret the provided number (d) with suffix (u) as this unit type. + // E.g. KiB.interpret(1, MiB) interprets 1MiB as its KiB representation = 1024k + public long interpret(long d, ByteUnit u) { +return u.toBytes(d) / multiplier; + } + + // Convert the provided number (d) interpreted as this unit type to unit type (u). + public long convert(long d, ByteUnit u) { +return toBytes(d) / u.multiplier; --- End diff -- I saw your comment about using double - I don't think that's a great idea because doubles lose precision as you try to work with values at different orders of magniture. Regarding the last paragraph of my comment above, I don't think it's going to be an issue in practice; but the code here can be changed to at least avoid overflows where possible. I checked `j.u.c.TimeUnit`, used in the time functions in this class, and it seems to follow the approach you took, than when an overflow is inevitable it caps the value at `Long.MAX_VALUE`. So that part is fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7070][MLLIB] LDA.setBeta should call se...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5649#issuecomment-95680421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30853/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/5666 [SPARK-5553][SQL] Replace Array[Byte] with Java NIO ByteBuffer as binary type representation JIRA: https://issues.apache.org/jira/browse/SPARK-5553 This pr attempts to replace Array[Byte] with Java NIO ByteBuffer as SQL binary type representation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 bytebuffer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5666.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5666 commit d77990f0b1d51114e32956d8ce08adfa1f6eff30 Author: Liang-Chi Hsieh vii...@gmail.com Date: 2015-04-23T16:47:43Z Replace Array[Byte] with Java NIO ByteBuffer as binary type representation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6818][SPARKR] Support column deletion i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5655#issuecomment-95658961 [Test build #30854 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30854/consoleFull) for PR 5655 at commit [`7c66570`](https://github.com/apache/spark/commit/7c66570897da1bc048aa1a3abb95785e7216c302). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7060][SQL] Add alias function to python...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5634#issuecomment-95661762 [Test build #30855 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30855/consoleFull) for PR 5634 at commit [`f157c30`](https://github.com/apache/spark/commit/f157c3096f16f16a57b03eca10bc69cdc90de533). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [minor][streaming]fixed scala string interpola...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/5653#issuecomment-95665334 I've merged this into master branch-1.3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7070][MLLIB] LDA.setBeta should call se...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/5649#issuecomment-95667912 Thanks for catching that! LGTM pending tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5666#issuecomment-95673533 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30858/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5526#issuecomment-95674382 [Test build #30851 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30851/consoleFull) for PR 5526 at commit [`3b22c32`](https://github.com/apache/spark/commit/3b22c3291cd920eb02a345dace2556dc23d57efb). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait FSBasedRelationProvider ` * `abstract class OutputWriter ` * `abstract class FSBasedRelation extends BaseRelation ` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] [WIP] Partitioning support for the data ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5526#issuecomment-95674413 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30851/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5668#issuecomment-95676058 [Test build #30860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30860/consoleFull) for PR 5668 at commit [`b0beb34`](https://github.com/apache/spark/commit/b0beb34d6a77c738660cb161306c947411d70ab5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7067][SQL] fix bug when use complex nes...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5659#issuecomment-95676079 [Test build #30861 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30861/consoleFull) for PR 5659 at commit [`ef6039c`](https://github.com/apache/spark/commit/ef6039c85ccfd396aad46797940a5611bc3325b4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6752][Streaming] Allow StreamingContext...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/5428 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4705] Handle multiple app attempts even...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5432#issuecomment-95677173 [Test build #30863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30863/consoleFull) for PR 5432 at commit [`d5a9c37`](https://github.com/apache/spark/commit/d5a9c37a00f3b0b5aa66c5f92c325fdf0ac05bf0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7058] Include RDD deserialization time ...
Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/5635#issuecomment-95678486 LGTM! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7070][MLLIB] LDA.setBeta should call se...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5649#issuecomment-95680404 [Test build #30853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30853/consoleFull) for PR 5649 at commit [`c66023c`](https://github.com/apache/spark/commit/c66023cdd1468aa66e7ffcc6b242ccc3ca80ea7c). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5660#issuecomment-95658337 [Test build #30848 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30848/consoleFull) for PR 5660 at commit [`0b522a7`](https://github.com/apache/spark/commit/0b522a77ffe83889e7c6afaa963dd49e5675fe36). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5666#issuecomment-95658377 [Test build #30852 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30852/consoleFull) for PR 5666 at commit [`d77990f`](https://github.com/apache/spark/commit/d77990f0b1d51114e32956d8ce08adfa1f6eff30). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5666#issuecomment-95658408 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30852/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6505][SQL]Remove the reflection call in...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5660#issuecomment-95658372 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30848/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7044] [SQL] Fix the deadlock in script ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/5625 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5666#issuecomment-95668372 [Test build #30858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30858/consoleFull) for PR 5666 at commit [`68c9b00`](https://github.com/apache/spark/commit/68c9b006606e12d0b7e5a27e74e4469581ca45b8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5553][SQL] Replace Array[Byte] with Jav...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5666#issuecomment-95673520 [Test build #30858 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30858/consoleFull) for PR 5666 at commit [`68c9b00`](https://github.com/apache/spark/commit/68c9b006606e12d0b7e5a27e74e4469581ca45b8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5626#discussion_r28987596 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import com.github.fommil.netlib.BLAS.{getInstance = blas} + +import org.apache.spark.Logging +import org.apache.spark.annotation.AlphaComponent +import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor} +import org.apache.spark.ml.impl.tree._ +import org.apache.spark.ml.param.{Param, Params, ParamMap} +import org.apache.spark.ml.regression.DecisionTreeRegressionModel +import org.apache.spark.ml.tree.{DecisionTreeModel, TreeEnsembleModel} +import org.apache.spark.ml.util.MetadataUtils +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.{GradientBoostedTrees = OldGBT} +import org.apache.spark.mllib.tree.configuration.{Algo = OldAlgo} +import org.apache.spark.mllib.tree.loss.{Loss = OldLoss, LogLoss = OldLogLoss} +import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel = OldGBTModel} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.DataFrame + + +/** + * :: AlphaComponent :: + * + * [[http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees (GBTs)]] + * learning algorithm for classification. + * It supports binary labels, as well as both continuous and categorical features. + * Note: Multiclass labels are not currently supported. + */ +@AlphaComponent +final class GBTClassifier + extends Predictor[Vector, GBTClassifier, GBTClassificationModel] + with GBTParams with TreeClassifierParams with Logging { + + // Override parameter setters from parent trait for Java API compatibility. + + // Parameters from TreeClassifierParams: + + override def setMaxDepth(value: Int): this.type = super.setMaxDepth(value) + + override def setMaxBins(value: Int): this.type = super.setMaxBins(value) + + override def setMinInstancesPerNode(value: Int): this.type = +super.setMinInstancesPerNode(value) + + override def setMinInfoGain(value: Double): this.type = super.setMinInfoGain(value) + + override def setMaxMemoryInMB(value: Int): this.type = super.setMaxMemoryInMB(value) + + override def setCacheNodeIds(value: Boolean): this.type = super.setCacheNodeIds(value) + + override def setCheckpointInterval(value: Int): this.type = super.setCheckpointInterval(value) + + /** + * The impurity setting is ignored for GBT models. + * Individual trees are built using impurity Variance. + */ + override def setImpurity(value: String): this.type = { +logWarning(GBTClassifier.setImpurity should NOT be used) +this + } + + // Parameters from TreeEnsembleParams: + + override def setSubsamplingRate(value: Double): this.type = super.setSubsamplingRate(value) + + override def setSeed(value: Long): this.type = { +logWarning(The 'seed' parameter is currently ignored by Gradient Boosting.) +super.setSeed(value) + } + + // Parameters from GBTParams: + + override def setMaxIter(value: Int): this.type = super.setMaxIter(value) + + override def setLearningRate(value: Double): this.type = super.setLearningRate(value) + + // Parameters for GBTClassifier: + + /** + * Loss function which GBT tries to minimize. (case-insensitive) + * Supported: LogLoss + * (default = LogLoss) + * @group param + */ + val loss: Param[String] = new Param[String](this, loss, Loss function which GBT tries to + + minimize (case-insensitive). Supported options: LogLoss) + + setDefault(loss - logloss) + + /** @group setParam */ + def setLoss(value: String):
[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5626#discussion_r28987585 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import com.github.fommil.netlib.BLAS.{getInstance = blas} + +import org.apache.spark.Logging +import org.apache.spark.annotation.AlphaComponent +import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor} +import org.apache.spark.ml.impl.tree._ +import org.apache.spark.ml.param.{Param, Params, ParamMap} +import org.apache.spark.ml.regression.DecisionTreeRegressionModel +import org.apache.spark.ml.tree.{DecisionTreeModel, TreeEnsembleModel} +import org.apache.spark.ml.util.MetadataUtils +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.{GradientBoostedTrees = OldGBT} +import org.apache.spark.mllib.tree.configuration.{Algo = OldAlgo} +import org.apache.spark.mllib.tree.loss.{Loss = OldLoss, LogLoss = OldLogLoss} +import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel = OldGBTModel} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.DataFrame + + +/** + * :: AlphaComponent :: + * + * [[http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees (GBTs)]] + * learning algorithm for classification. + * It supports binary labels, as well as both continuous and categorical features. + * Note: Multiclass labels are not currently supported. + */ +@AlphaComponent +final class GBTClassifier + extends Predictor[Vector, GBTClassifier, GBTClassificationModel] + with GBTParams with TreeClassifierParams with Logging { + + // Override parameter setters from parent trait for Java API compatibility. + + // Parameters from TreeClassifierParams: + + override def setMaxDepth(value: Int): this.type = super.setMaxDepth(value) + + override def setMaxBins(value: Int): this.type = super.setMaxBins(value) + + override def setMinInstancesPerNode(value: Int): this.type = +super.setMinInstancesPerNode(value) + + override def setMinInfoGain(value: Double): this.type = super.setMinInfoGain(value) + + override def setMaxMemoryInMB(value: Int): this.type = super.setMaxMemoryInMB(value) + + override def setCacheNodeIds(value: Boolean): this.type = super.setCacheNodeIds(value) + + override def setCheckpointInterval(value: Int): this.type = super.setCheckpointInterval(value) --- End diff -- Should it go to shared params? I see the problem with the doc. If we want to put something special, we can put it in the JavaDoc. No strong preference about this. But it makes me think that whether we should mark shared params final. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5626#discussion_r28987643 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala --- @@ -58,3 +58,43 @@ trait DecisionTreeModel { header + rootNode.subtreeToString(2) } } + +/** + * :: AlphaComponent :: + * + * Abstraction for models which are ensembles of decision trees + * + * TODO: Add support for predicting probabilities and raw predictions + */ +@AlphaComponent +trait TreeEnsembleModel { --- End diff -- Should it be public? Note that adding method to an interface counts as a break change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5626#discussion_r28987593 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import com.github.fommil.netlib.BLAS.{getInstance = blas} + +import org.apache.spark.Logging +import org.apache.spark.annotation.AlphaComponent +import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor} +import org.apache.spark.ml.impl.tree._ +import org.apache.spark.ml.param.{Param, Params, ParamMap} +import org.apache.spark.ml.regression.DecisionTreeRegressionModel +import org.apache.spark.ml.tree.{DecisionTreeModel, TreeEnsembleModel} +import org.apache.spark.ml.util.MetadataUtils +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.{GradientBoostedTrees = OldGBT} +import org.apache.spark.mllib.tree.configuration.{Algo = OldAlgo} +import org.apache.spark.mllib.tree.loss.{Loss = OldLoss, LogLoss = OldLogLoss} +import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel = OldGBTModel} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.DataFrame + + +/** + * :: AlphaComponent :: + * + * [[http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees (GBTs)]] + * learning algorithm for classification. + * It supports binary labels, as well as both continuous and categorical features. + * Note: Multiclass labels are not currently supported. + */ +@AlphaComponent +final class GBTClassifier + extends Predictor[Vector, GBTClassifier, GBTClassificationModel] + with GBTParams with TreeClassifierParams with Logging { + + // Override parameter setters from parent trait for Java API compatibility. + + // Parameters from TreeClassifierParams: + + override def setMaxDepth(value: Int): this.type = super.setMaxDepth(value) + + override def setMaxBins(value: Int): this.type = super.setMaxBins(value) + + override def setMinInstancesPerNode(value: Int): this.type = +super.setMinInstancesPerNode(value) + + override def setMinInfoGain(value: Double): this.type = super.setMinInfoGain(value) + + override def setMaxMemoryInMB(value: Int): this.type = super.setMaxMemoryInMB(value) + + override def setCacheNodeIds(value: Boolean): this.type = super.setCacheNodeIds(value) + + override def setCheckpointInterval(value: Int): this.type = super.setCheckpointInterval(value) + + /** + * The impurity setting is ignored for GBT models. + * Individual trees are built using impurity Variance. + */ + override def setImpurity(value: String): this.type = { +logWarning(GBTClassifier.setImpurity should NOT be used) +this + } + + // Parameters from TreeEnsembleParams: + + override def setSubsamplingRate(value: Double): this.type = super.setSubsamplingRate(value) + + override def setSeed(value: Long): this.type = { +logWarning(The 'seed' parameter is currently ignored by Gradient Boosting.) +super.setSeed(value) + } + + // Parameters from GBTParams: + + override def setMaxIter(value: Int): this.type = super.setMaxIter(value) + + override def setLearningRate(value: Double): this.type = super.setLearningRate(value) + + // Parameters for GBTClassifier: + + /** + * Loss function which GBT tries to minimize. (case-insensitive) + * Supported: LogLoss + * (default = LogLoss) + * @group param + */ + val loss: Param[String] = new Param[String](this, loss, Loss function which GBT tries to + --- End diff -- `loss` - `lossType`? `loss` may be too general. If `loss` appears in another algorithm, they should have similar semantic. For example, if we put
[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5626#discussion_r28987606 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala --- @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import scala.collection.mutable + +import org.apache.spark.annotation.AlphaComponent +import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor} +import org.apache.spark.ml.impl.tree._ +import org.apache.spark.ml.param.{Params, ParamMap} +import org.apache.spark.ml.tree.{DecisionTreeModel, TreeEnsembleModel} +import org.apache.spark.ml.util.MetadataUtils +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.{RandomForest = OldRandomForest} +import org.apache.spark.mllib.tree.configuration.{Algo = OldAlgo, Strategy = OldStrategy} +import org.apache.spark.mllib.tree.model.{RandomForestModel = OldRandomForestModel} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.DataFrame + + +/** + * :: AlphaComponent :: + * + * [[http://en.wikipedia.org/wiki/Random_forest Random Forest]] learning algorithm for + * classification. + * It supports both binary and multiclass labels, as well as both continuous and categorical + * features. + */ +@AlphaComponent +final class RandomForestClassifier + extends Predictor[Vector, RandomForestClassifier, RandomForestClassificationModel] + with RandomForestParams with TreeClassifierParams { + + // Override parameter setters from parent trait for Java API compatibility. + + // Parameters from TreeClassifierParams: + + override def setMaxDepth(value: Int): this.type = super.setMaxDepth(value) + + override def setMaxBins(value: Int): this.type = super.setMaxBins(value) + + override def setMinInstancesPerNode(value: Int): this.type = +super.setMinInstancesPerNode(value) + + override def setMinInfoGain(value: Double): this.type = super.setMinInfoGain(value) + + override def setMaxMemoryInMB(value: Int): this.type = super.setMaxMemoryInMB(value) + + override def setCacheNodeIds(value: Boolean): this.type = super.setCacheNodeIds(value) + + override def setCheckpointInterval(value: Int): this.type = super.setCheckpointInterval(value) + + override def setImpurity(value: String): this.type = super.setImpurity(value) + + // Parameters from TreeEnsembleParams: + + override def setSubsamplingRate(value: Double): this.type = super.setSubsamplingRate(value) + + override def setSeed(value: Long): this.type = super.setSeed(value) + + // Parameters from RandomForestParams: + + override def setNumTrees(value: Int): this.type = super.setNumTrees(value) + + override def setFeaturesPerNode(value: String): this.type = super.setFeaturesPerNode(value) + + override protected def train( + dataset: DataFrame, + paramMap: ParamMap): RandomForestClassificationModel = { +val categoricalFeatures: Map[Int, Int] = + MetadataUtils.getCategoricalFeatures(dataset.schema(paramMap(featuresCol))) +val numClasses: Int = MetadataUtils.getNumClasses(dataset.schema(paramMap(labelCol))) match { + case Some(n: Int) = n + case None = throw new IllegalArgumentException(RandomForestClassifier was given input + +s with invalid label column, without the number of classes specified.) + // TODO: Automatically index labels. +} +val oldDataset: RDD[LabeledPoint] = extractLabeledPoints(dataset, paramMap) +val strategy = + super.getOldStrategy(categoricalFeatures, numClasses, OldAlgo.Classification, getOldImpurity) +val oldModel = OldRandomForest.trainClassifier( + oldDataset, strategy, getNumTrees, getFeaturesPerNodeStr, getSeed.toInt)
[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5626#discussion_r28987621 --- Diff: mllib/src/main/scala/org/apache/spark/ml/impl/tree/treeParams.scala --- @@ -298,3 +302,200 @@ private[ml] object TreeRegressorParams { // These options should be lowercase. val supportedImpurities: Array[String] = Array(variance).map(_.toLowerCase) } + +/** + * :: DeveloperApi :: + * Parameters for Decision Tree-based ensemble algorithms. + * + * Note: Marked as private and DeveloperApi since this may be made public in the future. + */ +@DeveloperApi +private[ml] trait TreeEnsembleParams extends DecisionTreeParams { + + /** + * Fraction of the training data used for learning each decision tree. + * (default = 1.0) + * @group param + */ + final val subsamplingRate: DoubleParam = new DoubleParam(this, subsamplingRate, +Fraction of the training data used for learning each decision tree.) + + /** + * Random seed for bootstrapping and choosing feature subsets. + * @group param + */ + final val seed: LongParam = new LongParam(this, seed, +Random seed for bootstrapping and choosing feature subsets.) + + setDefault(subsamplingRate - 1.0, seed - Utils.random.nextLong()) + + /** @group setParam */ + def setSubsamplingRate(value: Double): this.type = { +require(value 0.0 value = 1.0, + sSubsampling rate must be in range (0,1]. Bad rate: $value) +set(subsamplingRate, value) +this + } + + /** @group getParam */ + def getSubsamplingRate: Double = getOrDefault(subsamplingRate) --- End diff -- Most getters should be final. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5626#discussion_r28987591 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import com.github.fommil.netlib.BLAS.{getInstance = blas} + +import org.apache.spark.Logging +import org.apache.spark.annotation.AlphaComponent +import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor} +import org.apache.spark.ml.impl.tree._ +import org.apache.spark.ml.param.{Param, Params, ParamMap} +import org.apache.spark.ml.regression.DecisionTreeRegressionModel +import org.apache.spark.ml.tree.{DecisionTreeModel, TreeEnsembleModel} +import org.apache.spark.ml.util.MetadataUtils +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.{GradientBoostedTrees = OldGBT} +import org.apache.spark.mllib.tree.configuration.{Algo = OldAlgo} +import org.apache.spark.mllib.tree.loss.{Loss = OldLoss, LogLoss = OldLogLoss} +import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel = OldGBTModel} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.DataFrame + + +/** + * :: AlphaComponent :: + * + * [[http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees (GBTs)]] + * learning algorithm for classification. + * It supports binary labels, as well as both continuous and categorical features. + * Note: Multiclass labels are not currently supported. + */ +@AlphaComponent +final class GBTClassifier + extends Predictor[Vector, GBTClassifier, GBTClassificationModel] + with GBTParams with TreeClassifierParams with Logging { + + // Override parameter setters from parent trait for Java API compatibility. + + // Parameters from TreeClassifierParams: + + override def setMaxDepth(value: Int): this.type = super.setMaxDepth(value) + + override def setMaxBins(value: Int): this.type = super.setMaxBins(value) + + override def setMinInstancesPerNode(value: Int): this.type = +super.setMinInstancesPerNode(value) + + override def setMinInfoGain(value: Double): this.type = super.setMinInfoGain(value) + + override def setMaxMemoryInMB(value: Int): this.type = super.setMaxMemoryInMB(value) + + override def setCacheNodeIds(value: Boolean): this.type = super.setCacheNodeIds(value) + + override def setCheckpointInterval(value: Int): this.type = super.setCheckpointInterval(value) + + /** + * The impurity setting is ignored for GBT models. + * Individual trees are built using impurity Variance. + */ + override def setImpurity(value: String): this.type = { +logWarning(GBTClassifier.setImpurity should NOT be used) +this + } + + // Parameters from TreeEnsembleParams: + + override def setSubsamplingRate(value: Double): this.type = super.setSubsamplingRate(value) + + override def setSeed(value: Long): this.type = { +logWarning(The 'seed' parameter is currently ignored by Gradient Boosting.) +super.setSeed(value) + } + + // Parameters from GBTParams: + + override def setMaxIter(value: Int): this.type = super.setMaxIter(value) + + override def setLearningRate(value: Double): this.type = super.setLearningRate(value) + + // Parameters for GBTClassifier: + + /** + * Loss function which GBT tries to minimize. (case-insensitive) + * Supported: LogLoss --- End diff -- Though the values are case insensitive, they should show up consistently in the doc. `logloss` or `logLoss`? The latter looks better to me. Btw, is `log` sufficient? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have
[GitHub] spark pull request: [SPARK-6113] [ml] Tree ensembles for Pipelines...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5626#discussion_r28987598 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import com.github.fommil.netlib.BLAS.{getInstance = blas} + +import org.apache.spark.Logging +import org.apache.spark.annotation.AlphaComponent +import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor} +import org.apache.spark.ml.impl.tree._ +import org.apache.spark.ml.param.{Param, Params, ParamMap} +import org.apache.spark.ml.regression.DecisionTreeRegressionModel +import org.apache.spark.ml.tree.{DecisionTreeModel, TreeEnsembleModel} +import org.apache.spark.ml.util.MetadataUtils +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.{GradientBoostedTrees = OldGBT} +import org.apache.spark.mllib.tree.configuration.{Algo = OldAlgo} +import org.apache.spark.mllib.tree.loss.{Loss = OldLoss, LogLoss = OldLogLoss} +import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel = OldGBTModel} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.DataFrame + + +/** + * :: AlphaComponent :: + * + * [[http://en.wikipedia.org/wiki/Gradient_boosting Gradient-Boosted Trees (GBTs)]] + * learning algorithm for classification. + * It supports binary labels, as well as both continuous and categorical features. + * Note: Multiclass labels are not currently supported. + */ +@AlphaComponent +final class GBTClassifier + extends Predictor[Vector, GBTClassifier, GBTClassificationModel] + with GBTParams with TreeClassifierParams with Logging { + + // Override parameter setters from parent trait for Java API compatibility. + + // Parameters from TreeClassifierParams: + + override def setMaxDepth(value: Int): this.type = super.setMaxDepth(value) + + override def setMaxBins(value: Int): this.type = super.setMaxBins(value) + + override def setMinInstancesPerNode(value: Int): this.type = +super.setMinInstancesPerNode(value) + + override def setMinInfoGain(value: Double): this.type = super.setMinInfoGain(value) + + override def setMaxMemoryInMB(value: Int): this.type = super.setMaxMemoryInMB(value) + + override def setCacheNodeIds(value: Boolean): this.type = super.setCacheNodeIds(value) + + override def setCheckpointInterval(value: Int): this.type = super.setCheckpointInterval(value) + + /** + * The impurity setting is ignored for GBT models. + * Individual trees are built using impurity Variance. + */ + override def setImpurity(value: String): this.type = { +logWarning(GBTClassifier.setImpurity should NOT be used) +this + } + + // Parameters from TreeEnsembleParams: + + override def setSubsamplingRate(value: Double): this.type = super.setSubsamplingRate(value) + + override def setSeed(value: Long): this.type = { +logWarning(The 'seed' parameter is currently ignored by Gradient Boosting.) +super.setSeed(value) + } + + // Parameters from GBTParams: + + override def setMaxIter(value: Int): this.type = super.setMaxIter(value) + + override def setLearningRate(value: Double): this.type = super.setLearningRate(value) + + // Parameters for GBTClassifier: + + /** + * Loss function which GBT tries to minimize. (case-insensitive) + * Supported: LogLoss + * (default = LogLoss) + * @group param + */ + val loss: Param[String] = new Param[String](this, loss, Loss function which GBT tries to + + minimize (case-insensitive). Supported options: LogLoss) + + setDefault(loss - logloss) + + /** @group setParam */ + def setLoss(value: String):
[GitHub] spark pull request: [SPARK-3468][WebUI] Timeline-View feature
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/2342#issuecomment-95675268 I almost addressed your feedbacks. Using this vs the approach in #5547. I think a good answer here is to use this vis.js library for the jobs page and then use a custom D3-based approach for the stage page, where we need to be careful about scalability to thousands of events (e.g. thousands of tasks). So with that in mind, I'd propose removing the stage functionality for now from this patch and only having the other pages. O.K. I respect the approach in #5547 for the stage page. I'm pending following some feedbacks. It would be good to have a visual line indicating the start of the application. Instead of a visual line, in current implementation, we cannot scroll before the time application started. It would be nice if you could use +scroll to zoom, so that we could remove the scroll lock. Is this possible with the library? vis.js doesn't support that feature directly so we need something trick. I'll implement this feature when I get good idea. It would be nice if I could mouse over a job and then have it highlight the corresponding job on the table below. Instead of this, when we click on event box on the timeline, we can move to the corresponding row in the jobs table. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1537 [WiP] Application Timeline Server i...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/5423#issuecomment-95675502 Yes, it says it timed out (two comments up) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7097][SQL]: Partitioned tables should o...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/5668 [SPARK-7097][SQL]: Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold This PR attempts to add support for better size estimation in case of partitioned tables so that only the referred partition's size are taken into consideration when testing against autoBroadCastJoinThreshold and deciding whether to create a broadcast join or shuffle hash join. You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark part_size Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5668.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5668 commit b0beb34d6a77c738660cb161306c947411d70ab5 Author: Yash Datta yash.da...@guavus.com Date: 2015-04-23T17:58:17Z SPARK-7097: Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [HOTFIX][SQL] Ignore flaky CachedTableSuite te...
Github user marmbrus closed the pull request at: https://github.com/apache/spark/pull/5639 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org