[jira] [Created] (SPARK-2745) Add Java friendly methods to Duration class
Tathagata Das created SPARK-2745: Summary: Add Java friendly methods to Duration class Key: SPARK-2745 URL: https://issues.apache.org/jira/browse/SPARK-2745 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2260) Spark submit standalone-cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2260. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1538 [https://github.com/apache/spark/pull/1538] Spark submit standalone-cluster mode is broken -- Key: SPARK-2260 URL: https://issues.apache.org/jira/browse/SPARK-2260 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Fix For: 1.1.0 Well, it is technically not officially supported... but we should still fix it. In particular, important configs such as spark.master and the application jar are not propagated to the worker nodes properly, due to obvious missing pieces in the code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2560) Create Spark SQL syntax reference
[ https://issues.apache.org/jira/browse/SPARK-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2560: Priority: Critical (was: Major) Target Version/s: 1.1.0 Create Spark SQL syntax reference - Key: SPARK-2560 URL: https://issues.apache.org/jira/browse/SPARK-2560 Project: Spark Issue Type: Documentation Components: SQL Reporter: Nicholas Chammas Priority: Critical Does Spark SQL support {{LEN()}}? How about {{LIMIT}}? And what about {{MY FAVOURITE SYNTAX}}? Right now there is no reference page to document this. [Hive has one.| https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select] Spark SQL should have one, too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2179) Public API for DataTypes and Schema
[ https://issues.apache.org/jira/browse/SPARK-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2179. - Resolution: Fixed Fix Version/s: 1.1.0 Public API for DataTypes and Schema --- Key: SPARK-2179 URL: https://issues.apache.org/jira/browse/SPARK-2179 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai Priority: Critical Fix For: 1.1.0 We want something like the following: * Expose DataType in the SQL package and lock down all the internal details (TypeTags, etc) * Programatic API for viewing the schema of an RDD as a StructType * Method that creates a schema RDD given (RDD[A], StructType, A = Row) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2543) Allow user to set maximum Kryo buffer size
[ https://issues.apache.org/jira/browse/SPARK-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2543: --- Summary: Allow user to set maximum Kryo buffer size (was: Resizable serialization buffers for kryo) Allow user to set maximum Kryo buffer size -- Key: SPARK-2543 URL: https://issues.apache.org/jira/browse/SPARK-2543 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: koert kuipers Assignee: Koert Kuipers Priority: Minor Kryo supports resizing serialization output buffers with the maxBufferSize parameter of KryoOutput. I suggest we expose this through the config spark.kryoserializer.buffer.max.mb For pull request see: https://github.com/apache/spark/pull/735 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2178) createSchemaRDD is not thread safe
[ https://issues.apache.org/jira/browse/SPARK-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2178: Target Version/s: 1.2.0 (was: 1.1.0) createSchemaRDD is not thread safe -- Key: SPARK-2178 URL: https://issues.apache.org/jira/browse/SPARK-2178 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust This is because implicit type tags are not thread safe. We could fix this with compile time macros (which could also make the conversion a lot faster). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2543) Allow user to set maximum Kryo buffer size
[ https://issues.apache.org/jira/browse/SPARK-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2543. Resolution: Fixed Fix Version/s: 1.1.0 Target Version/s: 1.1.0 Fixed via this pull request: https://github.com/apache/spark/pull/735/files Allow user to set maximum Kryo buffer size -- Key: SPARK-2543 URL: https://issues.apache.org/jira/browse/SPARK-2543 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: koert kuipers Assignee: Koert Kuipers Priority: Minor Fix For: 1.1.0 Kryo supports resizing serialization output buffers with the maxBufferSize parameter of KryoOutput. I suggest we expose this through the config spark.kryoserializer.buffer.max.mb For pull request see: https://github.com/apache/spark/pull/735 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2746) Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user
Reynold Xin created SPARK-2746: -- Summary: Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user Key: SPARK-2746 URL: https://issues.apache.org/jira/browse/SPARK-2746 Project: Spark Issue Type: Bug Components: Build Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical dev/run-tests always sets SBT_MAVEN_PROFILES, which is desired. As a matter of fact, Jenkins is failing for older Hadoop versions because the YARN profile is always on. {code} export SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2746) Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user
[ https://issues.apache.org/jira/browse/SPARK-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079039#comment-14079039 ] Apache Spark commented on SPARK-2746: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/1655 Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user - Key: SPARK-2746 URL: https://issues.apache.org/jira/browse/SPARK-2746 Project: Spark Issue Type: Bug Components: Build Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical dev/run-tests always sets SBT_MAVEN_PROFILES, which is desired. As a matter of fact, Jenkins is failing for older Hadoop versions because the YARN profile is always on. {code} export SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2747) git diff --dirstat can miss sql changes and not run Hive tests
Reynold Xin created SPARK-2747: -- Summary: git diff --dirstat can miss sql changes and not run Hive tests Key: SPARK-2747 URL: https://issues.apache.org/jira/browse/SPARK-2747 Project: Spark Issue Type: Bug Components: Build Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical dev/run-tests use git diff --dirstat master to check whether sql is changed. However, --dirstat won't show sql if sql's change is negligible (e.g. 1k loc change in core, and only 1 loc change in hive). We should use git diff --name-only master instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2747) git diff --dirstat can miss sql changes and not run Hive tests
[ https://issues.apache.org/jira/browse/SPARK-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079073#comment-14079073 ] Apache Spark commented on SPARK-2747: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/1656 git diff --dirstat can miss sql changes and not run Hive tests -- Key: SPARK-2747 URL: https://issues.apache.org/jira/browse/SPARK-2747 Project: Spark Issue Type: Bug Components: Build Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical dev/run-tests use git diff --dirstat master to check whether sql is changed. However, --dirstat won't show sql if sql's change is negligible (e.g. 1k loc change in core, and only 1 loc change in hive). We should use git diff --name-only master instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file
[ https://issues.apache.org/jira/browse/SPARK-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079114#comment-14079114 ] Apache Spark commented on SPARK-2641: - User 'kjsingh' has created a pull request for this issue: https://github.com/apache/spark/pull/1657 Spark submit doesn't pick up executor instances from properties file Key: SPARK-2641 URL: https://issues.apache.org/jira/browse/SPARK-2641 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kanwaljit Singh When running spark-submit in Yarn cluster mode, we provide properties file using --properties-file option. spark.executor.instances=5 spark.executor.memory=2120m spark.executor.cores=3 The submitted job picks up the cores and memory, but not the correct instances. I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments: // Use properties file as fallback for values which have a direct analog to // arguments in this script. master = Option(master).getOrElse(defaultProperties.get(spark.master).orNull) executorMemory = Option(executorMemory) .getOrElse(defaultProperties.get(spark.executor.memory).orNull) executorCores = Option(executorCores) .getOrElse(defaultProperties.get(spark.executor.cores).orNull) totalExecutorCores = Option(totalExecutorCores) .getOrElse(defaultProperties.get(spark.cores.max).orNull) name = Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull) jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull) Along with these defaults, we should also set default for instances: numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull) PS: spark.executor.instances is also not mentioned on http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log
Sean Owen created SPARK-2748: Summary: Loss of precision for small arguments to Math.exp, Math.log Key: SPARK-2748 URL: https://issues.apache.org/jira/browse/SPARK-2748 Project: Spark Issue Type: Bug Components: GraphX, MLlib Affects Versions: 1.0.1 Reporter: Sean Owen Priority: Minor In a few places in MLlib, an expression of the form log(1.0 + p) is evaluated. When p is so small that 1.0 + p == 1.0, the result is 0.0. However the correct answer is very near p. This is why Math.log1p exists. Similarly for one instance of exp(m) - 1 in GraphX; there's a special Math.expm1 method. While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible. Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 / (1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I don't think there's a JIRA on that one, so maybe this can serve as an umbrella for all of these related issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log
[ https://issues.apache.org/jira/browse/SPARK-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079207#comment-14079207 ] Apache Spark commented on SPARK-2748: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/1659 Loss of precision for small arguments to Math.exp, Math.log --- Key: SPARK-2748 URL: https://issues.apache.org/jira/browse/SPARK-2748 Project: Spark Issue Type: Bug Components: GraphX, MLlib Affects Versions: 1.0.1 Reporter: Sean Owen Priority: Minor In a few places in MLlib, an expression of the form log(1.0 + p) is evaluated. When p is so small that 1.0 + p == 1.0, the result is 0.0. However the correct answer is very near p. This is why Math.log1p exists. Similarly for one instance of exp(m) - 1 in GraphX; there's a special Math.expm1 method. While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible. Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 / (1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I don't think there's a JIRA on that one, so maybe this can serve as an umbrella for all of these related issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log
[ https://issues.apache.org/jira/browse/SPARK-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079208#comment-14079208 ] Sean Owen commented on SPARK-2748: -- PR: https://github.com/apache/spark/pull/1659 See also: https://github.com/apache/spark/pull/1652 Loss of precision for small arguments to Math.exp, Math.log --- Key: SPARK-2748 URL: https://issues.apache.org/jira/browse/SPARK-2748 Project: Spark Issue Type: Bug Components: GraphX, MLlib Affects Versions: 1.0.1 Reporter: Sean Owen Priority: Minor In a few places in MLlib, an expression of the form log(1.0 + p) is evaluated. When p is so small that 1.0 + p == 1.0, the result is 0.0. However the correct answer is very near p. This is why Math.log1p exists. Similarly for one instance of exp(m) - 1 in GraphX; there's a special Math.expm1 method. While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible. Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 / (1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I don't think there's a JIRA on that one, so maybe this can serve as an umbrella for all of these related issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
Sean Owen created SPARK-2749: Summary: Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep Key: SPARK-2749 URL: https://issues.apache.org/jira/browse/SPARK-2749 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.1 Reporter: Sean Owen Priority: Minor The Maven-based builds in the build matrix have been failing for a few days: https://amplab.cs.berkeley.edu/jenkins/view/Spark/ On inspection, it looks like the Spark SQL Java tests don't compile: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull I confirmed it by repeating the command vs master: mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on com.novocode:junit-interface Adding the junit:junit dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via com.novocode:junit-interface, since that is a bit SBT/Scala-specific (and I am not even sure it's needed). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
[ https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079231#comment-14079231 ] Apache Spark commented on SPARK-2749: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/1660 Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep --- Key: SPARK-2749 URL: https://issues.apache.org/jira/browse/SPARK-2749 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.1 Reporter: Sean Owen Priority: Minor The Maven-based builds in the build matrix have been failing for a few days: https://amplab.cs.berkeley.edu/jenkins/view/Spark/ On inspection, it looks like the Spark SQL Java tests don't compile: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull I confirmed it by repeating the command vs master: mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on com.novocode:junit-interface Adding the junit:junit dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via com.novocode:junit-interface, since that is a bit SBT/Scala-specific (and I am not even sure it's needed). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079260#comment-14079260 ] RJ Nowling commented on SPARK-2308: --- Thanks for the clarification. :) I'll run the additional tests to try to answer those questions. I'll also work on trying to implement MiniBatch KMeans as a flag for the current KMeans implementation -- that would be a nicer API. Add KMeans MiniBatch clustering algorithm to MLlib -- Key: SPARK-2308 URL: https://issues.apache.org/jira/browse/SPARK-2308 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: RJ Nowling Priority: Minor Attachments: many_small_centers.pdf, uneven_centers.pdf Mini-batch is a version of KMeans that uses a randomly-sampled subset of the data points in each iteration instead of the full set of data points, improving performance (and in some cases, accuracy). The mini-batch version is compatible with the KMeans|| initialization algorithm currently implemented in MLlib. I suggest adding KMeans Mini-batch as an alternative. I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2750) Add Https support for Web UI
WangTaoTheTonic created SPARK-2750: -- Summary: Add Https support for Web UI Key: SPARK-2750 URL: https://issues.apache.org/jira/browse/SPARK-2750 Project: Spark Issue Type: New Feature Components: Web UI Reporter: WangTaoTheTonic Now I try to add https support for web ui using Jetty ssl integration.Below is the plan: 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User can choose to use http or/and https way to acess them. We add some configuration items here, for example, SPARK_MASTER_WEBUI_PORT in system envs claim the https port for master ui. Different items would be added to control access way of dirrenet processes in system envs, JVM Properties and launch args. 2.User choose access way according to their configuration. If http port is assigned, then we start http service for web ui. If https port is assigned, we start https. If both two are assigned, we start two. If neither is assigned, we start http service at default port as same as now. 3.We should add some configuration items to state some args for ssl authentication, like keystore location and keystore password in 1-way ssl, truststore location in 2-way. User can also choose to switch between 1-way and 2-way. Now I nearly have implemented the functions mentioned above. Here are some questions: We know there are some hyper links between Master and Worker and some from Master to Spark UI(Driver UI). Now the link is their http addresses. So if we add https to them, what we should do when users click the links? Situation: 1.Master http port to Worker which opens https port only. 2.Master https port to Worker which opens http port only. Any feedback is welcome! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2315) drop, dropRight and dropWhile which take RDD input and return RDD
[ https://issues.apache.org/jira/browse/SPARK-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079300#comment-14079300 ] Erik Erlandson commented on SPARK-2315: --- Updated the PR with a proper lazy-transform implementation: http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/ drop, dropRight and dropWhile which take RDD input and return RDD - Key: SPARK-2315 URL: https://issues.apache.org/jira/browse/SPARK-2315 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Erik Erlandson Labels: features Last time I loaded in a text file, I found myself wanting to just skip the first element as it was a header. I wrote candidate methods drop, dropRight and dropWhile to satisfy this kind of need: val txt = sc.textFile(text_with_header.txt) val data = txt.drop(1) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2750) Add Https support for Web UI
[ https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WangTaoTheTonic updated SPARK-2750: --- Description: Now I try to add https support for web ui using Jetty ssl integration.Below is the plan: 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User can choose to use http or/and https way to acess them. We add some configuration items here, for example, SPARK_MASTER_WEBUI_HTTPS_PORT in system envs claim the https port for master ui. Different items would be added to control access way of dirrenet processes in system envs, JVM Properties and launch args. 2.User choose access way according to their configuration. If http port is assigned, then we start http service for web ui. If https port is assigned, we start https. If both two are assigned, we start two. If neither is assigned, we start http service at default port as same as now. 3.We should add some configuration items to state some args for ssl authentication, like keystore location and keystore password in 1-way ssl, truststore location in 2-way. User can also choose to switch between 1-way and 2-way. Now I nearly have implemented the functions mentioned above. Here are some questions: We know there are some hyper links between Master and Worker and some from Master to Spark UI(Driver UI). Now the link is their http addresses. So if we add https to them, what we should do when users click the links? Situation: 1.Master http port to Worker which opens https port only. 2.Master https port to Worker which opens http port only. Any feedback is welcome! was: Now I try to add https support for web ui using Jetty ssl integration.Below is the plan: 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User can choose to use http or/and https way to acess them. We add some configuration items here, for example, SPARK_MASTER_WEBUI_PORT in system envs claim the https port for master ui. Different items would be added to control access way of dirrenet processes in system envs, JVM Properties and launch args. 2.User choose access way according to their configuration. If http port is assigned, then we start http service for web ui. If https port is assigned, we start https. If both two are assigned, we start two. If neither is assigned, we start http service at default port as same as now. 3.We should add some configuration items to state some args for ssl authentication, like keystore location and keystore password in 1-way ssl, truststore location in 2-way. User can also choose to switch between 1-way and 2-way. Now I nearly have implemented the functions mentioned above. Here are some questions: We know there are some hyper links between Master and Worker and some from Master to Spark UI(Driver UI). Now the link is their http addresses. So if we add https to them, what we should do when users click the links? Situation: 1.Master http port to Worker which opens https port only. 2.Master https port to Worker which opens http port only. Any feedback is welcome! Add Https support for Web UI Key: SPARK-2750 URL: https://issues.apache.org/jira/browse/SPARK-2750 Project: Spark Issue Type: New Feature Components: Web UI Reporter: WangTaoTheTonic Labels: https, ssl, webui Original Estimate: 96h Remaining Estimate: 96h Now I try to add https support for web ui using Jetty ssl integration.Below is the plan: 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User can choose to use http or/and https way to acess them. We add some configuration items here, for example, SPARK_MASTER_WEBUI_HTTPS_PORT in system envs claim the https port for master ui. Different items would be added to control access way of dirrenet processes in system envs, JVM Properties and launch args. 2.User choose access way according to their configuration. If http port is assigned, then we start http service for web ui. If https port is assigned, we start https. If both two are assigned, we start two. If neither is assigned, we start http service at default port as same as now. 3.We should add some configuration items to state some args for ssl authentication, like keystore location and keystore password in 1-way ssl, truststore location in 2-way. User can also choose to switch between 1-way and 2-way. Now I nearly have implemented the functions mentioned above. Here are some questions: We know there are some hyper links between Master and Worker and some from Master to Spark UI(Driver UI). Now the link is their http addresses. So if we add https to them, what we should do when users click the links? Situation: 1.Master http port to Worker which opens https port only. 2.Master https port to Worker which opens http port only. Any
[jira] [Created] (SPARK-2752) spark sql cli should not exit when get a exception
wangfei created SPARK-2752: -- Summary: spark sql cli should not exit when get a exception Key: SPARK-2752 URL: https://issues.apache.org/jira/browse/SPARK-2752 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: wangfei Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2752) spark sql cli should not exit when get a exception
[ https://issues.apache.org/jira/browse/SPARK-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079370#comment-14079370 ] Apache Spark commented on SPARK-2752: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/1661 spark sql cli should not exit when get a exception -- Key: SPARK-2752 URL: https://issues.apache.org/jira/browse/SPARK-2752 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: wangfei Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2753) Is it supposed --archives option in yarn cluster mode to uncompress file?
[ https://issues.apache.org/jira/browse/SPARK-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] José Manuel Abuín Mosquera updated SPARK-2753: -- Description: Hi all, this is my first sent issue, I googled and searche dinto the Spark code and arrived here. When passing as argument to --archives a tar.gz or a .zip file, Spark uploads it to the distributed cache, but it is not uncompressing it. According the documentation, it is supposed to uncompress it, is this a bug?? Launching command is: /opt/spark-1.0.1/bin/spark-submit --class ProlnatSpark --master yarn-cluster --num-executors 32 --driver-library-path /opt/hadoop/hadoop-2.2.0/lib/native/ --driver-memory 390m --executor-memory 890m --executor-cores 1 --archives=Diccionarios.tar.gz --verbose ProlnatSpark.jar Wikipedias/WikipediaPlain.txt saidaWikipediaSpark In files /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala and /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala doesn't seem to uncompress the files. I hope this helps, thank you very much :) was: Hi all, this is my first sent issue, I googled and searche dinto the Spark code and arrived here. When passing as argument to --archives a tar.gz or a .zip file, Spark uploads it to the distributed cache, but it is not uncompressing it. According the documentation, it is supposed to uncompress it, is this a bug?? Launching command is: /opt/spark-1.0.1/bin/spark-submit --class ProlnatSpark --master yarn-cluster --num-executors 32 --driver-library-path /opt/hadoop/hadoop-2.2.0/lib/native/ --driver-memory 390m --executor-memory 890m --executor-cores 1 --archives=Diccionarios.tar.gz --verbose ProlnatSpark.jar Wikipedias/WikipediaPlain.txt saidaWikipediaSpark In files /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala and /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala doesn't seem to uncompress the files. I hope this helps, than you very much :) Is it supposed --archives option in yarn cluster mode to uncompress file? - Key: SPARK-2753 URL: https://issues.apache.org/jira/browse/SPARK-2753 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Environment: CentOS release 6.5 (64 bits) and Hadoop 2.2.0 Reporter: José Manuel Abuín Mosquera Labels: archives, cache, distributed, yarn Hi all, this is my first sent issue, I googled and searche dinto the Spark code and arrived here. When passing as argument to --archives a tar.gz or a .zip file, Spark uploads it to the distributed cache, but it is not uncompressing it. According the documentation, it is supposed to uncompress it, is this a bug?? Launching command is: /opt/spark-1.0.1/bin/spark-submit --class ProlnatSpark --master yarn-cluster --num-executors 32 --driver-library-path /opt/hadoop/hadoop-2.2.0/lib/native/ --driver-memory 390m --executor-memory 890m --executor-cores 1 --archives=Diccionarios.tar.gz --verbose ProlnatSpark.jar Wikipedias/WikipediaPlain.txt saidaWikipediaSpark In files /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala and /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala doesn't seem to uncompress the files. I hope this helps, thank you very much :) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log
[ https://issues.apache.org/jira/browse/SPARK-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2748: - Target Version/s: 1.1.0 Loss of precision for small arguments to Math.exp, Math.log --- Key: SPARK-2748 URL: https://issues.apache.org/jira/browse/SPARK-2748 Project: Spark Issue Type: Bug Components: GraphX, MLlib Affects Versions: 1.0.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor In a few places in MLlib, an expression of the form log(1.0 + p) is evaluated. When p is so small that 1.0 + p == 1.0, the result is 0.0. However the correct answer is very near p. This is why Math.log1p exists. Similarly for one instance of exp(m) - 1 in GraphX; there's a special Math.expm1 method. While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible. Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 / (1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I don't think there's a JIRA on that one, so maybe this can serve as an umbrella for all of these related issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log
[ https://issues.apache.org/jira/browse/SPARK-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2748: - Assignee: Sean Owen Loss of precision for small arguments to Math.exp, Math.log --- Key: SPARK-2748 URL: https://issues.apache.org/jira/browse/SPARK-2748 Project: Spark Issue Type: Bug Components: GraphX, MLlib Affects Versions: 1.0.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor In a few places in MLlib, an expression of the form log(1.0 + p) is evaluated. When p is so small that 1.0 + p == 1.0, the result is 0.0. However the correct answer is very near p. This is why Math.log1p exists. Similarly for one instance of exp(m) - 1 in GraphX; there's a special Math.expm1 method. While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible. Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 / (1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I don't think there's a JIRA on that one, so maybe this can serve as an umbrella for all of these related issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2521) Broadcast RDD object once per TaskSet (instead of sending it for every task)
[ https://issues.apache.org/jira/browse/SPARK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2521. Resolution: Fixed Fix Version/s: 1.1.0 Broadcast RDD object once per TaskSet (instead of sending it for every task) Key: SPARK-2521 URL: https://issues.apache.org/jira/browse/SPARK-2521 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.1.0 Currently (as of Spark 1.0.1), Spark sends RDD object (which contains closures) using Akka along with the task itself to the executors. This is inefficient because all tasks in the same stage use the same RDD object, but we have to send RDD object multiple times to the executors. This is especially bad when a closure references some variable that is very large. The current design led to users having to explicitly broadcast large variables. The patch uses broadcast to send RDD objects and the closures to executors, and use Akka to only send a reference to the broadcast RDD/closure along with the partition specific information for the task. For those of you who know more about the internals, Spark already relies on broadcast to send the Hadoop JobConf every time it uses the Hadoop input, because the JobConf is large. The user-facing impact of the change include: Users won't need to decide what to broadcast anymore, unless they would want to use a large object multiple times in different operations Task size will get smaller, resulting in faster scheduling and higher task dispatch throughput. In addition, the change will simplify some internals of Spark, eliminating the need to maintain task caches and the complex logic to broadcast JobConf (which also led to a deadlock recently). A simple way to test this: {code} val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a); sc.parallelize(1 to 1000, 1000).map { x = a; x }.groupBy { x = a; x }.count Numbers on 3 r3.8xlarge instances on EC2 master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2747) git diff --dirstat can miss sql changes and not run Hive tests
[ https://issues.apache.org/jira/browse/SPARK-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2747. Resolution: Fixed Fix Version/s: 1.1.0 git diff --dirstat can miss sql changes and not run Hive tests -- Key: SPARK-2747 URL: https://issues.apache.org/jira/browse/SPARK-2747 Project: Spark Issue Type: Bug Components: Build Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.1.0 dev/run-tests use git diff --dirstat master to check whether sql is changed. However, --dirstat won't show sql if sql's change is negligible (e.g. 1k loc change in core, and only 1 loc change in hive). We should use git diff --name-only master instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3
[ https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079487#comment-14079487 ] Anand Avati commented on SPARK-2707: [~helena_e] - it turned out to be more than just a timeout issue. As described in SPARK-1812 and https://groups.google.com/forum/#!topic/akka-user/cI4CEKEJvfs, this is because of protobuf version mismatch. The combination of https://github.com/avati/spark/commit/f8b5e96fca20c13308cb2a9a6c18049bcdd0a7ba and https://github.com/avati/spark/commit/722aee26399b9bf4b725d17f5cfcfad99464af35 is making akka-2.3 work for me. Upgrade to Akka 2.3 --- Key: SPARK-2707 URL: https://issues.apache.org/jira/browse/SPARK-2707 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0 Reporter: Yardena Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray features directly in the same project. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2754) Document standalone-cluster mode now that it's working
Andrew Or created SPARK-2754: Summary: Document standalone-cluster mode now that it's working Key: SPARK-2754 URL: https://issues.apache.org/jira/browse/SPARK-2754 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.0.1 Reporter: Andrew Or Fix For: 1.1.0 This was previously broken before SPARK-2260, so we (attempted to) remove all documentation related to this mode. We should add it back now that we have fixed it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2744) The configuration spark.history.retainedApplications is invalid
[ https://issues.apache.org/jira/browse/SPARK-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079497#comment-14079497 ] Marcelo Vanzin commented on SPARK-2744: --- Are you sure that option means what you think it means? The History Server will list all applications. It will just retain a max number of them *in memory*. That option does not control how many applications are show, it controls how much memory the HS will need. The configuration spark.history.retainedApplications is invalid - Key: SPARK-2744 URL: https://issues.apache.org/jira/browse/SPARK-2744 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Labels: historyserver when I set it in spark-env.sh like this:export SPARK_HISTORY_OPTS=$SPARK_HISTORY_OPTS -Dspark.history.ui.port=5678 -Dspark.history.retainedApplications=1 , the web of historyserver retains more than one application -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2755) TorrentBroadcast cannot broadcast very large objects
Xiangrui Meng created SPARK-2755: Summary: TorrentBroadcast cannot broadcast very large objects Key: SPARK-2755 URL: https://issues.apache.org/jira/browse/SPARK-2755 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Xiangrui Meng TorrentBroadcast uses `Utils.serialize` to serialize an object into Array[Byte]. So it cannot handle data of size greater than Int.MaxValue bytes. Instead of serializing the object into Array[Byte] directly, we can use the stream version implemented in HttpBroadcast. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-1630. --- Resolution: Won't Fix Based on some discussion in https://github.com/apache/spark/pull/1551, we've decided to hold off on fixing this: this issue only affects users that are calling private APIs and the fix adds complexity and could mask bugs in other parts of the code. PythonRDDs don't handle nulls gracefully Key: SPARK-1630 URL: https://issues.apache.org/jira/browse/SPARK-1630 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0, 0.9.1 Reporter: Kalpit Shah Assignee: Davies Liu Original Estimate: 2h Remaining Estimate: 2h If PythonRDDs receive a null element in iterators, they currently NPE. It would be better do log a DEBUG message and skip the write of NULL elements. Here are the 2 stack traces : 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread Thread[stdin writer for python,5,main] java.lang.NullPointerException at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) - Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.writeToFile. : java.lang.NullPointerException at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2316) StorageStatusListener should avoid O(blocks) operations
[ https://issues.apache.org/jira/browse/SPARK-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2316: - Fix Version/s: 1.1.0 StorageStatusListener should avoid O(blocks) operations --- Key: SPARK-2316 URL: https://issues.apache.org/jira/browse/SPARK-2316 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0 Reporter: Patrick Wendell Assignee: Andrew Or Priority: Critical Fix For: 1.1.0 In the case where jobs are frequently causing dropped blocks the storage status listener can bottleneck. This is slow for a few reasons, one being that we use Scala collection operations, the other being that we operations that are O(number of blocks). I think using a few indices here could make this much faster. {code} at java.lang.Integer.valueOf(Integer.java:642) at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70) at org.apache.spark.storage.StorageUtils$$anonfun$9.apply(StorageUtils.scala:82) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327) at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105) at org.apache.spark.storage.StorageUtils$.rddInfoFromStorageStatus(StorageUtils.scala:82) at org.apache.spark.ui.storage.StorageListener.updateRDDInfo(StorageTab.scala:56) at org.apache.spark.ui.storage.StorageListener.onTaskEnd(StorageTab.scala:67) - locked 0xa27ebe30 (a org.apache.spark.ui.storage.StorageListener) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2736) Ceeate Pyspark RDD from Apache Avro File
[ https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2736: -- Assignee: Kan Zhang (was: Josh Rosen) Ceeate Pyspark RDD from Apache Avro File Key: SPARK-2736 URL: https://issues.apache.org/jira/browse/SPARK-2736 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Eric Garcia Assignee: Kan Zhang Priority: Minor Original Estimate: 4h Remaining Estimate: 4h There is a partially working example Avro Converter at this pull request: https://github.com/apache/spark/pull/1536 It does not fully implement all types in the Avro format and could be cleaned up a little bit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-2736) Ceeate Pyspark RDD from Apache Avro File
[ https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-2736: - Assignee: Josh Rosen Ceeate Pyspark RDD from Apache Avro File Key: SPARK-2736 URL: https://issues.apache.org/jira/browse/SPARK-2736 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Eric Garcia Assignee: Josh Rosen Priority: Minor Original Estimate: 4h Remaining Estimate: 4h There is a partially working example Avro Converter at this pull request: https://github.com/apache/spark/pull/1536 It does not fully implement all types in the Avro format and could be cleaned up a little bit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2544) Improve ALS algorithm resource usage
[ https://issues.apache.org/jira/browse/SPARK-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2544. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 929 [https://github.com/apache/spark/pull/929] Improve ALS algorithm resource usage Key: SPARK-2544 URL: https://issues.apache.org/jira/browse/SPARK-2544 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guoqiang Li Assignee: Guoqiang Li Fix For: 1.1.0 The following problems in ALS 1. The RDD of products and users dependencies are too long 2. The shuffle files are too large. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079704#comment-14079704 ] Apache Spark commented on SPARK-2341: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/1663 loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Assignee: Sean Owen Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2736) Ceeate Pyspark RDD from Apache Avro File
[ https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079761#comment-14079761 ] Michael Armbrust commented on SPARK-2736: - Another thing to consider is that Avro would be an idea fit for SchemaRDDs and then we could reuse the java/python bridge code that is already there. Ceeate Pyspark RDD from Apache Avro File Key: SPARK-2736 URL: https://issues.apache.org/jira/browse/SPARK-2736 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Eric Garcia Assignee: Kan Zhang Priority: Minor Original Estimate: 4h Remaining Estimate: 4h There is a partially working example Avro Converter at this pull request: https://github.com/apache/spark/pull/1536 It does not fully implement all types in the Avro format and could be cleaned up a little bit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2746) Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user
[ https://issues.apache.org/jira/browse/SPARK-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2746: --- Description: dev/run-tests always sets SBT_MAVEN_PROFILES, which is not desired. As a matter of fact, Jenkins is failing for older Hadoop versions because the YARN profile is always on. {code} export SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 {code} was: dev/run-tests always sets SBT_MAVEN_PROFILES, which is desired. As a matter of fact, Jenkins is failing for older Hadoop versions because the YARN profile is always on. {code} export SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 {code} Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user - Key: SPARK-2746 URL: https://issues.apache.org/jira/browse/SPARK-2746 Project: Spark Issue Type: Bug Components: Build Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.1.0 dev/run-tests always sets SBT_MAVEN_PROFILES, which is not desired. As a matter of fact, Jenkins is failing for older Hadoop versions because the YARN profile is always on. {code} export SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2746) Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user
[ https://issues.apache.org/jira/browse/SPARK-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2746. Resolution: Fixed Fix Version/s: 1.1.0 Target Version/s: 1.1.0 Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user - Key: SPARK-2746 URL: https://issues.apache.org/jira/browse/SPARK-2746 Project: Spark Issue Type: Bug Components: Build Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.1.0 dev/run-tests always sets SBT_MAVEN_PROFILES, which is desired. As a matter of fact, Jenkins is failing for older Hadoop versions because the YARN profile is always on. {code} export SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2664) Deal with `--conf` options in spark-submit that relate to flags
[ https://issues.apache.org/jira/browse/SPARK-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079775#comment-14079775 ] Apache Spark commented on SPARK-2664: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/1665 Deal with `--conf` options in spark-submit that relate to flags --- Key: SPARK-2664 URL: https://issues.apache.org/jira/browse/SPARK-2664 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Sandy Ryza Priority: Blocker If someone sets a spark conf that relates to an existing flag `--master`, we should set it correctly like we do with the defaults file. Otherwise it can have confusing semantics. I noticed this after merging it, otherwise I would have mentioned it in the review. I think it's as simple as modifying loadDefaults to check the user-supplied options also. We might change it to loadUserProperties since it's no longer just the defaults file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2735) Remove deprecation in jekyll for pygment in _config.yml
[ https://issues.apache.org/jira/browse/SPARK-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079778#comment-14079778 ] Apache Spark commented on SPARK-2735: - User 'RAbraham' has created a pull request for this issue: https://github.com/apache/spark/pull/1666 Remove deprecation in jekyll for pygment in _config.yml --- Key: SPARK-2735 URL: https://issues.apache.org/jira/browse/SPARK-2735 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Rajiv Abraham Priority: Trivial Original Estimate: 1h Remaining Estimate: 1h NOTE: Creating this issue for the patch I am submitting soon. This will be my first pull request. So please let me know if I have missed something Change: Remove following deprecation warning in 'jekyll build' for pygments. Deprecation: The 'pygments' configuration option has been renamed to 'highlighter'. Please update your config file accordingly. The allowed values are 'rouge', 'pygments' or null. Reference:https://github.com/mmistakes/hpstr-jekyll-theme/issues/25. Rajiv -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2741) Publish version of spark assembly which does not contain Hive
[ https://issues.apache.org/jira/browse/SPARK-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079795#comment-14079795 ] Brock Noland commented on SPARK-2741: - https://github.com/apache/spark/pull/1667 Publish version of spark assembly which does not contain Hive - Key: SPARK-2741 URL: https://issues.apache.org/jira/browse/SPARK-2741 Project: Spark Issue Type: Task Reporter: Brock Noland Assignee: Patrick Wendell Attachments: SPARK-2741.patch The current spark assembly contains Hive. This conflicts with Hive + Spark which is attempting to use it's own version of Hive. We'll need to publish a version of the assembly which does not contain the Hive jars. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2741) Publish version of spark assembly which does not contain Hive
[ https://issues.apache.org/jira/browse/SPARK-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079804#comment-14079804 ] Apache Spark commented on SPARK-2741: - User 'brockn' has created a pull request for this issue: https://github.com/apache/spark/pull/1667 Publish version of spark assembly which does not contain Hive - Key: SPARK-2741 URL: https://issues.apache.org/jira/browse/SPARK-2741 Project: Spark Issue Type: Task Reporter: Brock Noland Assignee: Patrick Wendell Attachments: SPARK-2741.patch The current spark assembly contains Hive. This conflicts with Hive + Spark which is attempting to use it's own version of Hive. We'll need to publish a version of the assembly which does not contain the Hive jars. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2711) Create a ShuffleMemoryManager that allocates across spilling collections in the same task
[ https://issues.apache.org/jira/browse/SPARK-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2711: - Priority: Critical (was: Major) Create a ShuffleMemoryManager that allocates across spilling collections in the same task - Key: SPARK-2711 URL: https://issues.apache.org/jira/browse/SPARK-2711 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Assignee: Matei Zaharia Priority: Critical Right now if there are two ExternalAppendOnlyMaps, they don't compete correctly for memory. This can happen e.g. in a task that is both reducing data from its parent RDD and writing it out to files for a future shuffle, for instance if you do rdd.groupByKey(...).map(...).groupByKey(...) (another key). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2523) For partitioned Hive tables, partition-specific ObjectInspectors should be used.
[ https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079881#comment-14079881 ] Apache Spark commented on SPARK-2523: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/1669 For partitioned Hive tables, partition-specific ObjectInspectors should be used. Key: SPARK-2523 URL: https://issues.apache.org/jira/browse/SPARK-2523 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Fix For: 1.1.0 In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2024) Add saveAsSequenceFile to PySpark
[ https://issues.apache.org/jira/browse/SPARK-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2024. --- Resolution: Fixed Fix Version/s: 1.1.0 Add saveAsSequenceFile to PySpark - Key: SPARK-2024 URL: https://issues.apache.org/jira/browse/SPARK-2024 Project: Spark Issue Type: New Feature Components: PySpark Reporter: Matei Zaharia Assignee: Kan Zhang Fix For: 1.1.0 After SPARK-1416 we will be able to read SequenceFiles from Python, but it remains to write them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2103) Java + Kafka + Spark Streaming NoSuchMethodError in java.lang.Object.init
[ https://issues.apache.org/jira/browse/SPARK-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2103: - Target Version/s: 1.1.0 Java + Kafka + Spark Streaming NoSuchMethodError in java.lang.Object.init --- Key: SPARK-2103 URL: https://issues.apache.org/jira/browse/SPARK-2103 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0 Reporter: Sean Owen This has come up a few times, from user venki-kratos: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-in-KafkaReciever-td2209.html and I ran into it a few weeks ago: http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3ccamassdlzs6ihctxepusphryxxa-wp26zgbxx83sm6niro0q...@mail.gmail.com%3E and yesterday user mpieck: {quote} When I use the createStream method from the example class like this: KafkaUtils.createStream(jssc, zookeeper:port, test, topicMap); everything is working fine, but when I explicitely specify message decoder classes used in this method with another overloaded createStream method: KafkaUtils.createStream(jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, props, topicMap, StorageLevels.MEMORY_AND_DISK_2); the applications stops with an error: 14/06/10 22:28:06 ERROR kafka.KafkaReceiver: Error receiving data java.lang.NoSuchMethodException: java.lang.Object.init(kafka.utils.VerifiableProperties) at java.lang.Class.getConstructor0(Unknown Source) at java.lang.Class.getConstructor(Unknown Source) at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:108) at org.apache.spark.streaming.dstream.NetworkReceiver.start(NetworkInputDStream.scala:126) {quote} Something is making it try to instantiate java.lang.Object as if it's a Decoder class. I suspect that the problem is to do with https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala#L148 {code} implicit val keyCmd: Manifest[U] = implicitly[Manifest[AnyRef]].asInstanceOf[Manifest[U]] implicit val valueCmd: Manifest[T] = implicitly[Manifest[AnyRef]].asInstanceOf[Manifest[T]] {code} ... where U and T are key/value Decoder types. I don't know enough Scala to fully understand this, but is it possible this causes the reflective call later to lose the type and try to instantiate Object? The AnyRef made me wonder. I am sorry to say I don't have a PR to suggest at this point. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide
[ https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2419: - Target Version/s: 1.1.0 Misc updates to streaming programming guide --- Key: SPARK-2419 URL: https://issues.apache.org/jira/browse/SPARK-2419 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das This JIRA collects together a number of small issues that should be added to the streaming programming guide - Receivers consume an executor slot and highlight the fact the # cores # receivers is necessary - Classes of spark-streaming-XYZ cannot be access from Spark Shell - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and its dependencies to be packaged with application JAR - Ordering and parallelism of the output operations - Receiver's should be serializable - Add more information on how socketStream: input stream = iterator function. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide
[ https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2419: - Description: This JIRA collects together a number of small issues that should be added to the streaming programming guide - Receivers consume an executor slot and highlight the fact the # cores # receivers is necessary - Classes of spark-streaming-XYZ cannot be access from Spark Shell - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and its dependencies to be packaged with application JAR - Ordering and parallelism of the output operations - Receiver's should be serializable - Add more information on how socketStream: input stream = iterator function. - New Flume and Kinesis stuff. was: This JIRA collects together a number of small issues that should be added to the streaming programming guide - Receivers consume an executor slot and highlight the fact the # cores # receivers is necessary - Classes of spark-streaming-XYZ cannot be access from Spark Shell - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and its dependencies to be packaged with application JAR - Ordering and parallelism of the output operations - Receiver's should be serializable - Add more information on how socketStream: input stream = iterator function. Misc updates to streaming programming guide --- Key: SPARK-2419 URL: https://issues.apache.org/jira/browse/SPARK-2419 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das This JIRA collects together a number of small issues that should be added to the streaming programming guide - Receivers consume an executor slot and highlight the fact the # cores # receivers is necessary - Classes of spark-streaming-XYZ cannot be access from Spark Shell - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and its dependencies to be packaged with application JAR - Ordering and parallelism of the output operations - Receiver's should be serializable - Add more information on how socketStream: input stream = iterator function. - New Flume and Kinesis stuff. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2736) Create Pyspark RDD from Apache Avro File
[ https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kan Zhang updated SPARK-2736: - Summary: Create Pyspark RDD from Apache Avro File (was: Ceeate Pyspark RDD from Apache Avro File) Create Pyspark RDD from Apache Avro File Key: SPARK-2736 URL: https://issues.apache.org/jira/browse/SPARK-2736 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Eric Garcia Assignee: Kan Zhang Priority: Minor Original Estimate: 4h Remaining Estimate: 4h There is a partially working example Avro Converter at this pull request: https://github.com/apache/spark/pull/1536 It does not fully implement all types in the Avro format and could be cleaned up a little bit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2381) streaming receiver crashed,but seems nothing happened
[ https://issues.apache.org/jira/browse/SPARK-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079978#comment-14079978 ] Tathagata Das commented on SPARK-2381: -- Any updates on this? If not, then I am inclined to close to this JIRA. streaming receiver crashed,but seems nothing happened - Key: SPARK-2381 URL: https://issues.apache.org/jira/browse/SPARK-2381 Project: Spark Issue Type: Bug Components: Streaming Reporter: sunsc when we submit a streaming job and if receivers doesn't start normally, the application should stop itself. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2736) Create PySpark RDD from Apache Avro File
[ https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kan Zhang updated SPARK-2736: - Summary: Create PySpark RDD from Apache Avro File (was: Create Pyspark RDD from Apache Avro File) Create PySpark RDD from Apache Avro File Key: SPARK-2736 URL: https://issues.apache.org/jira/browse/SPARK-2736 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Eric Garcia Assignee: Kan Zhang Priority: Minor Original Estimate: 4h Remaining Estimate: 4h There is a partially working example Avro Converter at this pull request: https://github.com/apache/spark/pull/1536 It does not fully implement all types in the Avro format and could be cleaned up a little bit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2012) PySpark StatCounter with numpy arrays
[ https://issues.apache.org/jira/browse/SPARK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079984#comment-14079984 ] Jeremy Freeman commented on SPARK-2012: --- [~davies] cool, that definitely makes sense to me, shall I put a PR together done that way? PySpark StatCounter with numpy arrays - Key: SPARK-2012 URL: https://issues.apache.org/jira/browse/SPARK-2012 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.0 Reporter: Jeremy Freeman Priority: Minor In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy arrays just as with an RDD of scalars, which was very useful (e.g. for computing stats on a set of vectors in ML analyses). In 1.0.0 this broke because the added functionality for computing the minimum and maximum, as implemented, doesn't work on arrays. I have a PR ready that re-enables this functionality by having StatCounter use the numpy element-wise functions maximum and minimum, which work on both numpy arrays and scalars (and I've added new tests for this capability). However, I realize this adds a dependency on NumPy outside of MLLib. If that's not ok, maybe it'd be worth adding this functionality as a util within PySpark MLLib? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2012) PySpark StatCounter with numpy arrays
[ https://issues.apache.org/jira/browse/SPARK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080001#comment-14080001 ] Davies Liu commented on SPARK-2012: --- Yes, plz! PySpark StatCounter with numpy arrays - Key: SPARK-2012 URL: https://issues.apache.org/jira/browse/SPARK-2012 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.0 Reporter: Jeremy Freeman Priority: Minor In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy arrays just as with an RDD of scalars, which was very useful (e.g. for computing stats on a set of vectors in ML analyses). In 1.0.0 this broke because the added functionality for computing the minimum and maximum, as implemented, doesn't work on arrays. I have a PR ready that re-enables this functionality by having StatCounter use the numpy element-wise functions maximum and minimum, which work on both numpy arrays and scalars (and I've added new tests for this capability). However, I realize this adds a dependency on NumPy outside of MLLib. If that's not ok, maybe it'd be worth adding this functionality as a util within PySpark MLLib? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-546) Support full outer join and multiple join in a single shuffle
[ https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-546: Component/s: Streaming Spark Core Support full outer join and multiple join in a single shuffle - Key: SPARK-546 URL: https://issues.apache.org/jira/browse/SPARK-546 Project: Spark Issue Type: Improvement Components: Spark Core, Streaming Reporter: Reynold Xin RDD[(K,V)] now supports left/right outer join but not full outer join. Also it'd be nice to provide a way for users to join multiple RDDs on the same key in a single shuffle. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1730) Make receiver store data reliably to avoid data-loss on executor failures
[ https://issues.apache.org/jira/browse/SPARK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1730: - Assignee: Hari Shreedharan (was: Tathagata Das) Make receiver store data reliably to avoid data-loss on executor failures - Key: SPARK-1730 URL: https://issues.apache.org/jira/browse/SPARK-1730 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Assignee: Hari Shreedharan -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide
[ https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2419: - Priority: Major (was: Critical) Misc updates to streaming programming guide --- Key: SPARK-2419 URL: https://issues.apache.org/jira/browse/SPARK-2419 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das This JIRA collects together a number of small issues that should be added to the streaming programming guide - Receivers consume an executor slot and highlight the fact the # cores # receivers is necessary - Classes of spark-streaming-XYZ cannot be access from Spark Shell - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and its dependencies to be packaged with application JAR - Ordering and parallelism of the output operations - Receiver's should be serializable - Add more information on how socketStream: input stream = iterator function. - New Flume and Kinesis stuff. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide
[ https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2419: - Priority: Critical (was: Major) Misc updates to streaming programming guide --- Key: SPARK-2419 URL: https://issues.apache.org/jira/browse/SPARK-2419 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical This JIRA collects together a number of small issues that should be added to the streaming programming guide - Receivers consume an executor slot and highlight the fact the # cores # receivers is necessary - Classes of spark-streaming-XYZ cannot be access from Spark Shell - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and its dependencies to be packaged with application JAR - Ordering and parallelism of the output operations - Receiver's should be serializable - Add more information on how socketStream: input stream = iterator function. - New Flume and Kinesis stuff. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2463) Creating multiple StreamingContexts from shell generates duplicate Streaming tabs in UI
[ https://issues.apache.org/jira/browse/SPARK-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2463: - Target Version/s: 1.2.0 Creating multiple StreamingContexts from shell generates duplicate Streaming tabs in UI --- Key: SPARK-2463 URL: https://issues.apache.org/jira/browse/SPARK-2463 Project: Spark Issue Type: Bug Components: Streaming, Web UI Affects Versions: 1.0.1 Reporter: Nicholas Chammas Start a {{StreamingContext}} from the interactive shell and then stop it. Go to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for Streaming. Now from the same shell, create and start a new {{StreamingContext}}. There will now be a duplicate tab for Streaming in the UI. Repeating this process generates additional Streaming tabs. They all link to the same information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1312) Batch should read based on the batch interval provided in the StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1312: - Target Version/s: 1.2.0 Assignee: Tathagata Das Batch should read based on the batch interval provided in the StreamingContext -- Key: SPARK-1312 URL: https://issues.apache.org/jira/browse/SPARK-1312 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0 Reporter: Sanjay Awatramani Assignee: Tathagata Das Priority: Minor Labels: sliding, streaming, window This problem primarily affects sliding window operations in spark streaming. Consider the following scenario: - a DStream is created from any source. (I've checked with file and socket) - No actions are applied to this DStream - Sliding Window operation is applied to this DStream and an action is applied to the sliding window. In this case, Spark will not even read the input stream in the batch in which the sliding interval isn't a multiple of batch interval. Put another way, it won't read the input when it doesn't have to apply the window function. This is happening because all transformations in Spark are lazy. How to fix this or workaround it (see line#3): JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(1 * 60 * 1000)); JavaDStreamString inputStream = stcObj.textFileStream(/Input); inputStream.print(); // This is the workaround JavaDStreamString objWindow = inputStream.window(new Duration(windowLen*60*1000), new Duration(slideInt*60*1000)); objWindow.dstream().saveAsTextFiles(/Output, ); The Window operations example on the streaming guide implies that Spark will read the stream in every batch, which is not happening because of the lazy transformations. Wherever sliding window would be used, in most of the cases, no actions will be taken on the pre-window batch, hence my gut feeling was that Streaming would read every batch if any actions are being taken in the windowed stream. As per Tathagata, Ideally every batch should read based on the batch interval provided in the StreamingContext. Refer the original thread on http://apache-spark-user-list.1001560.n3.nabble.com/Sliding-Window-operations-do-not-work-as-documented-tp2999.html for more details, including Tathagata's conclusion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080035#comment-14080035 ] Ted Malaska commented on SPARK-2447: The build is fixed and the pull request is updated Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080055#comment-14080055 ] Ted Malaska commented on SPARK-2447: OK had a status meeting with TD. 1. 2447 will be pushed past 1.1 2. Focus on these tasks 2.1. Java 2.2. More unit testing 2.3. Partitioned Put 2.4. Partitioned Sorted Get 2.5. BulkCheckPut 2.6. BulkLoad Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1642) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083
[ https://issues.apache.org/jira/browse/SPARK-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1642: - Target Version/s: 1.2.0 (was: 1.1.0) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083 --- Key: SPARK-1642 URL: https://issues.apache.org/jira/browse/SPARK-1642 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Ted Malaska Assignee: Ted Malaska Priority: Minor This will add support for SSL encryption between Flume AvroSink and Spark Streaming. It is based on FLUME-2083 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2447: - Target Version/s: 1.2.0 (was: 1.1.0) Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-944) Give example of writing to HBase from Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-944: Target Version/s: 1.2.0 (was: 1.1.0) Give example of writing to HBase from Spark Streaming - Key: SPARK-944 URL: https://issues.apache.org/jira/browse/SPARK-944 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Patrick Wendell Assignee: Tathagata Das Attachments: MetricAggregatorHBase.scala -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2492) KafkaReceiver minor changes to align with Kafka 0.8
[ https://issues.apache.org/jira/browse/SPARK-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2492: - Target Version/s: 1.1.0 Fix Version/s: (was: 1.1.0) KafkaReceiver minor changes to align with Kafka 0.8 Key: SPARK-2492 URL: https://issues.apache.org/jira/browse/SPARK-2492 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Saisai Shao Assignee: Saisai Shao Priority: Minor Update to delete Zookeeper metadata when Kafka's parameter auto.offset.reset is set to smallest, which is aligned with Kafka 0.8's ConsoleConsumer. Also use Kafka offered API without directly using zkClient. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2507) Compile error of streaming project with 2.0.0-cdh4.6.0
[ https://issues.apache.org/jira/browse/SPARK-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080111#comment-14080111 ] Tathagata Das commented on SPARK-2507: -- This was solved in PR https://github.com/apache/spark/pull/153 Compile error of streaming project with 2.0.0-cdh4.6.0 -- Key: SPARK-2507 URL: https://issues.apache.org/jira/browse/SPARK-2507 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0, 0.9.1, 1.0.0 Environment: RedHat 5.3 2.0.0-cdh4.6.0 enable yarn java version 1.6.0_45 Reporter: James Z.M. Gao Priority: Minor Hi, When compiling with {quote} ./make-distribution.sh --hadoop 2.0.0-cdh4.6.0 --with-yarn --tgz {quote} I have the following errors on streaming java api: {quote} Version is 0.9.0-incubating Making spark-0.9.0-incubating-hadoop_2.0.0-cdh4.6.0-bin.tar.gz Hadoop version set to 2.0.0-cdh4.6.0 YARN enabled [info] Loading project definition from /root/spark-source/project/project [info] Loading project definition from /root/spark-source/project [info] Set current project to root (in build file:/root/spark-source/) [info] Compiling 1 Scala source to /root/spark-source/streaming/target/scala-2.10/classes... [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:57: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] dstream.filter((x = f(x).booleanValue())) [error] ^ [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:60: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] def cache(): JavaPairDStream[K, V] = dstream.cache() [error] ^ [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:63: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] def persist(): JavaPairDStream[K, V] = dstream.persist() [error] ^ .. [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:669: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, (com.google.common.base.Optional[V], W))] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,(com.google.common.base.Optional[V], W)] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] joinResult.mapValues{case (v, w) = (JavaUtils.optionToOptional(v), w)} [error] ^ [error] 44 errors found [error] (streaming/compile:compile) Compilation failed {quote} Here is a simple PR fix this problem: https://github.com/apache/spark/pull/153 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2756) Decision Tree bugs
Joseph K. Bradley created SPARK-2756: Summary: Decision Tree bugs Key: SPARK-2756 URL: https://issues.apache.org/jira/browse/SPARK-2756 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Joseph K. Bradley 2 bugs: Bug 1: Indexing is inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true). * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where ** featureValue was from arr (so it was a feature value) ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1) * The rest of the code indexed agg as (node, feature, binIndex, label). Bug 2: calculateGainForSplit (for classification): * It returns dummy prediction values when either the right or left children had 0 weight. These are incorrect for multiclass classification. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2756) Decision Tree bugs
[ https://issues.apache.org/jira/browse/SPARK-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2756: - Assignee: Joseph K. Bradley Decision Tree bugs -- Key: SPARK-2756 URL: https://issues.apache.org/jira/browse/SPARK-2756 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley 2 bugs: Bug 1: Indexing is inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true). * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where ** featureValue was from arr (so it was a feature value) ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1) * The rest of the code indexed agg as (node, feature, binIndex, label). Bug 2: calculateGainForSplit (for classification): * It returns dummy prediction values when either the right or left children had 0 weight. These are incorrect for multiclass classification. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2756) Decision Tree bugs
[ https://issues.apache.org/jira/browse/SPARK-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080134#comment-14080134 ] Apache Spark commented on SPARK-2756: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/1673 Decision Tree bugs -- Key: SPARK-2756 URL: https://issues.apache.org/jira/browse/SPARK-2756 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley 2 bugs: Bug 1: Indexing is inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true). * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where ** featureValue was from arr (so it was a feature value) ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1) * The rest of the code indexed agg as (node, feature, binIndex, label). Bug 2: calculateGainForSplit (for classification): * It returns dummy prediction values when either the right or left children had 0 weight. These are incorrect for multiclass classification. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2756) Decision Tree bugs
[ https://issues.apache.org/jira/browse/SPARK-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080136#comment-14080136 ] Joseph K. Bradley commented on SPARK-2756: -- Submitted [https://github.com/apache/spark/pull/1673] with bug fixes. Decision Tree bugs -- Key: SPARK-2756 URL: https://issues.apache.org/jira/browse/SPARK-2756 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley 2 bugs: Bug 1: Indexing is inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true). * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where ** featureValue was from arr (so it was a feature value) ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1) * The rest of the code indexed agg as (node, feature, binIndex, label). Bug 2: calculateGainForSplit (for classification): * It returns dummy prediction values when either the right or left children had 0 weight. These are incorrect for multiclass classification. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-2756) Decision Tree bugs
[ https://issues.apache.org/jira/browse/SPARK-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-2756: - Comment: was deleted (was: Submitted [https://github.com/apache/spark/pull/1673] with bug fixes.) Decision Tree bugs -- Key: SPARK-2756 URL: https://issues.apache.org/jira/browse/SPARK-2756 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley 2 bugs: Bug 1: Indexing is inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true). * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where ** featureValue was from arr (so it was a feature value) ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1) * The rest of the code indexed agg as (node, feature, binIndex, label). Bug 2: calculateGainForSplit (for classification): * It returns dummy prediction values when either the right or left children had 0 weight. These are incorrect for multiclass classification. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2507) Compile error of streaming project with 2.0.0-cdh4.6.0
[ https://issues.apache.org/jira/browse/SPARK-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2507: - Affects Version/s: 1.0.2 1.0.1 Compile error of streaming project with 2.0.0-cdh4.6.0 -- Key: SPARK-2507 URL: https://issues.apache.org/jira/browse/SPARK-2507 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.0.1, 1.0.2 Environment: RedHat 5.3 2.0.0-cdh4.6.0 enable yarn java version 1.6.0_45 Reporter: James Z.M. Gao Priority: Minor Hi, When compiling with {quote} ./make-distribution.sh --hadoop 2.0.0-cdh4.6.0 --with-yarn --tgz {quote} I have the following errors on streaming java api: {quote} Version is 0.9.0-incubating Making spark-0.9.0-incubating-hadoop_2.0.0-cdh4.6.0-bin.tar.gz Hadoop version set to 2.0.0-cdh4.6.0 YARN enabled [info] Loading project definition from /root/spark-source/project/project [info] Loading project definition from /root/spark-source/project [info] Set current project to root (in build file:/root/spark-source/) [info] Compiling 1 Scala source to /root/spark-source/streaming/target/scala-2.10/classes... [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:57: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] dstream.filter((x = f(x).booleanValue())) [error] ^ [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:60: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] def cache(): JavaPairDStream[K, V] = dstream.cache() [error] ^ [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:63: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] def persist(): JavaPairDStream[K, V] = dstream.persist() [error] ^ .. [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:669: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, (com.google.common.base.Optional[V], W))] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,(com.google.common.base.Optional[V], W)] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] joinResult.mapValues{case (v, w) = (JavaUtils.optionToOptional(v), w)} [error] ^ [error] 44 errors found [error] (streaming/compile:compile) Compilation failed {quote} Here is a simple PR fix this problem: https://github.com/apache/spark/pull/153 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2507) Compile error of streaming project with 2.0.0-cdh4.6.0
[ https://issues.apache.org/jira/browse/SPARK-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-2507. -- Resolution: Fixed Compile error of streaming project with 2.0.0-cdh4.6.0 -- Key: SPARK-2507 URL: https://issues.apache.org/jira/browse/SPARK-2507 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.0.1, 1.0.2 Environment: RedHat 5.3 2.0.0-cdh4.6.0 enable yarn java version 1.6.0_45 Reporter: James Z.M. Gao Priority: Minor Hi, When compiling with {quote} ./make-distribution.sh --hadoop 2.0.0-cdh4.6.0 --with-yarn --tgz {quote} I have the following errors on streaming java api: {quote} Version is 0.9.0-incubating Making spark-0.9.0-incubating-hadoop_2.0.0-cdh4.6.0-bin.tar.gz Hadoop version set to 2.0.0-cdh4.6.0 YARN enabled [info] Loading project definition from /root/spark-source/project/project [info] Loading project definition from /root/spark-source/project [info] Set current project to root (in build file:/root/spark-source/) [info] Compiling 1 Scala source to /root/spark-source/streaming/target/scala-2.10/classes... [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:57: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] dstream.filter((x = f(x).booleanValue())) [error] ^ [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:60: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] def cache(): JavaPairDStream[K, V] = dstream.cache() [error] ^ [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:63: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] def persist(): JavaPairDStream[K, V] = dstream.persist() [error] ^ .. [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:669: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, (com.google.common.base.Optional[V], W))] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,(com.google.common.base.Optional[V], W)] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] joinResult.mapValues{case (v, w) = (JavaUtils.optionToOptional(v), w)} [error] ^ [error] 44 errors found [error] (streaming/compile:compile) Compilation failed {quote} Here is a simple PR fix this problem: https://github.com/apache/spark/pull/153 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2507) Compile error of streaming project with 2.0.0-cdh4.6.0
[ https://issues.apache.org/jira/browse/SPARK-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2507: - Target Version/s: 1.1.0 (was: 0.9.0, 0.9.1, 1.0.0) Compile error of streaming project with 2.0.0-cdh4.6.0 -- Key: SPARK-2507 URL: https://issues.apache.org/jira/browse/SPARK-2507 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.0.1, 1.0.2 Environment: RedHat 5.3 2.0.0-cdh4.6.0 enable yarn java version 1.6.0_45 Reporter: James Z.M. Gao Priority: Minor Hi, When compiling with {quote} ./make-distribution.sh --hadoop 2.0.0-cdh4.6.0 --with-yarn --tgz {quote} I have the following errors on streaming java api: {quote} Version is 0.9.0-incubating Making spark-0.9.0-incubating-hadoop_2.0.0-cdh4.6.0-bin.tar.gz Hadoop version set to 2.0.0-cdh4.6.0 YARN enabled [info] Loading project definition from /root/spark-source/project/project [info] Loading project definition from /root/spark-source/project [info] Set current project to root (in build file:/root/spark-source/) [info] Compiling 1 Scala source to /root/spark-source/streaming/target/scala-2.10/classes... [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:57: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] dstream.filter((x = f(x).booleanValue())) [error] ^ [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:60: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] def cache(): JavaPairDStream[K, V] = dstream.cache() [error] ^ [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:63: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] def persist(): JavaPairDStream[K, V] = dstream.persist() [error] ^ .. [error] /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:669: type mismatch; [error] found : org.apache.spark.streaming.dstream.DStream[(K, (com.google.common.base.Optional[V], W))] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,(com.google.common.base.Optional[V], W)] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] joinResult.mapValues{case (v, w) = (JavaUtils.optionToOptional(v), w)} [error] ^ [error] 44 errors found [error] (streaming/compile:compile) Compilation failed {quote} Here is a simple PR fix this problem: https://github.com/apache/spark/pull/153 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2757) Add Mima test for Spark Sink after 1.10 is released
Hari Shreedharan created SPARK-2757: --- Summary: Add Mima test for Spark Sink after 1.10 is released Key: SPARK-2757 URL: https://issues.apache.org/jira/browse/SPARK-2757 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: Hari Shreedharan Fix For: 1.2.0 We are adding it in 1.1.0, so it is excluded from Mima right now. Once we release 1.1.0, we should add it to Mima so we do binary compat checks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080166#comment-14080166 ] Ted Malaska commented on SPARK-2447: Hey Matei, Lets do a webex or something in the near future. I would love to get more of your input. Here are my answers to you questions above: 1. Yes I can do Python 2. Yes I can do that. So to be clear the bulkGet and scan will return a fixed (Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte], Long)]) for (rowKey, Array[columnFamily, column, value, timestamp)]) 2.1 As for the bulkPut/Increment/Delete/CheckPut I think we need to give the user freedom to interact with the raw API. I have no problem building a simpler interface for the 80% use case but I don't want to fail the 20%. 3. The lowest version is 0.96 The release is there was a major API change from 0.94 to 0.96+. So if we need to support 0.94 and below we need to make a different code base. Let me know if this answers you questions and let me know if there is anything else I can do. I have learned so much from TD and I have grown so much from this process. Ted Malaska Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2758) UnionRDD's UnionPartition should not reference parent RDDs
Reynold Xin created SPARK-2758: -- Summary: UnionRDD's UnionPartition should not reference parent RDDs Key: SPARK-2758 URL: https://issues.apache.org/jira/browse/SPARK-2758 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0, 1.0.1, 1.0.2 Reporter: Reynold Xin Assignee: Reynold Xin UnionPartition has a non-transient field referencing the parent RDD, to be used in compute (iterator). That causes some trouble with task size because partition objects are supposed to be small. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080179#comment-14080179 ] Patrick Wendell commented on SPARK-2447: This is not entirely a duplicate, but it's similar to SPARK-1127 Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2758) UnionRDD's UnionPartition should not reference parent RDDs
[ https://issues.apache.org/jira/browse/SPARK-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080184#comment-14080184 ] Apache Spark commented on SPARK-2758: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/1675 UnionRDD's UnionPartition should not reference parent RDDs -- Key: SPARK-2758 URL: https://issues.apache.org/jira/browse/SPARK-2758 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0, 1.0.1, 1.0.2 Reporter: Reynold Xin Assignee: Reynold Xin UnionPartition has a non-transient field referencing the parent RDD, to be used in compute (iterator). That causes some trouble with task size because partition objects are supposed to be small. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2706) Enable Spark to support Hive 0.13
[ https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-2706: -- Attachment: spark-hive.err This file shows the error I got after applying tentative patch. Enable Spark to support Hive 0.13 - Key: SPARK-2706 URL: https://issues.apache.org/jira/browse/SPARK-2706 Project: Spark Issue Type: Dependency upgrade Components: SQL Affects Versions: 1.0.1 Reporter: Chunjun Xiao Attachments: spark-hive.err It seems Spark cannot work with Hive 0.13 well. When I compiled Spark with Hive 0.13.1, I got some error messages, as attached below. So, when can Spark be enabled to support Hive 0.13? Compiling Error: {quote} [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180: type mismatch; found : String required: Array[String] [ERROR] val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264: overloaded method constructor TableDesc with alternatives: (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc and ()org.apache.hadoop.hive.ql.plan.TableDesc cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in value tableDesc)(in value tableDesc)], java.util.Properties) [ERROR] val tableDesc = new TableDesc( [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140: value getPartitionPath is not a member of org.apache.hadoop.hive.ql.metadata.Partition [ERROR] val partPath = partition.getPartitionPath [ERROR]^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132: value appendReadColumnNames is not a member of object org.apache.hadoop.hive.serde2.ColumnProjectionUtils [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, attributes.map(_.name)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132: type mismatch; found : org.apache.hadoop.fs.Path required: String [ERROR] SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179: value getExternalTmpFileURI is not a member of org.apache.hadoop.hive.ql.Context [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] case bd: BigDecimal = new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] 8 errors found [DEBUG] Compilation failed (CompilerInterface) [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.579s] [INFO] Spark Project Core SUCCESS [2:39.805s] [INFO] Spark Project Bagel ... SUCCESS [21.148s] [INFO] Spark Project GraphX .. SUCCESS [59.950s] [INFO] Spark Project ML Library .. SUCCESS [1:08.771s] [INFO] Spark Project Streaming ... SUCCESS [1:17.759s] [INFO] Spark Project Tools ... SUCCESS [15.405s] [INFO] Spark Project Catalyst SUCCESS [1:17.405s] [INFO] Spark Project SQL . SUCCESS [1:11.094s] [INFO] Spark Project Hive FAILURE [11.121s] [INFO] Spark Project REPL SKIPPED [INFO] Spark Project YARN Parent POM . SKIPPED [INFO] Spark Project YARN Stable API . SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter
[jira] [Updated] (SPARK-2706) Enable Spark to support Hive 0.13
[ https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-2706: -- Attachment: spark-2706-v1.txt Tentative patch. I copied Hive 0.13.1 artifacts to local maven repo manually. Enable Spark to support Hive 0.13 - Key: SPARK-2706 URL: https://issues.apache.org/jira/browse/SPARK-2706 Project: Spark Issue Type: Dependency upgrade Components: SQL Affects Versions: 1.0.1 Reporter: Chunjun Xiao Attachments: spark-2706-v1.txt, spark-hive.err It seems Spark cannot work with Hive 0.13 well. When I compiled Spark with Hive 0.13.1, I got some error messages, as attached below. So, when can Spark be enabled to support Hive 0.13? Compiling Error: {quote} [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180: type mismatch; found : String required: Array[String] [ERROR] val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264: overloaded method constructor TableDesc with alternatives: (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc and ()org.apache.hadoop.hive.ql.plan.TableDesc cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in value tableDesc)(in value tableDesc)], java.util.Properties) [ERROR] val tableDesc = new TableDesc( [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140: value getPartitionPath is not a member of org.apache.hadoop.hive.ql.metadata.Partition [ERROR] val partPath = partition.getPartitionPath [ERROR]^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132: value appendReadColumnNames is not a member of object org.apache.hadoop.hive.serde2.ColumnProjectionUtils [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, attributes.map(_.name)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132: type mismatch; found : org.apache.hadoop.fs.Path required: String [ERROR] SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179: value getExternalTmpFileURI is not a member of org.apache.hadoop.hive.ql.Context [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] case bd: BigDecimal = new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] 8 errors found [DEBUG] Compilation failed (CompilerInterface) [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.579s] [INFO] Spark Project Core SUCCESS [2:39.805s] [INFO] Spark Project Bagel ... SUCCESS [21.148s] [INFO] Spark Project GraphX .. SUCCESS [59.950s] [INFO] Spark Project ML Library .. SUCCESS [1:08.771s] [INFO] Spark Project Streaming ... SUCCESS [1:17.759s] [INFO] Spark Project Tools ... SUCCESS [15.405s] [INFO] Spark Project Catalyst SUCCESS [1:17.405s] [INFO] Spark Project SQL . SUCCESS [1:11.094s] [INFO] Spark Project Hive FAILURE [11.121s] [INFO] Spark Project REPL SKIPPED [INFO] Spark Project YARN Parent POM . SKIPPED [INFO] Spark Project YARN Stable API . SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark
[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080250#comment-14080250 ] Erik Erlandson commented on SPARK-1021: --- I deferred the compute of the partition bounds this way, and seems to work properly in my testing and the unit tests: https://github.com/erikerlandson/spark/compare/erikerlandson:rdd_drop_master...spark-1021 sortByKey() launches a cluster job when it shouldn't Key: SPARK-1021 URL: https://issues.apache.org/jira/browse/SPARK-1021 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.0, 0.9.0 Reporter: Andrew Ash Assignee: Mark Hamstra Labels: starter The sortByKey() method is listed as a transformation, not an action, in the documentation. But it launches a cluster job regardless. http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html Some discussion on the mailing list suggested that this is a problem with the rdd.count() call inside Partitioner.scala's rangeBounds method. https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 Josh Rosen suggests that rangeBounds should be made into a lazy variable: {quote} I wonder whether making RangePartitoner .rangeBounds into a lazy val would fix this (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). We'd need to make sure that rangeBounds() is never called before an action is performed. This could be tricky because it's called in the RangePartitioner.equals() method. Maybe it's sufficient to just compare the number of partitions, the ids of the RDDs used to create the RangePartitioner, and the sort ordering. This still supports the case where I range-partition one RDD and pass the same partitioner to a different RDD. It breaks support for the case where two range partitioners created on different RDDs happened to have the same rangeBounds(), but it seems unlikely that this would really harm performance since it's probably unlikely that the range partitioners are equal by chance. {quote} Can we please make this happen? I'll send a PR on GitHub to start the discussion and testing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2741) Publish version of spark assembly which does not contain Hive
[ https://issues.apache.org/jira/browse/SPARK-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2741. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1667 [https://github.com/apache/spark/pull/1667] Publish version of spark assembly which does not contain Hive - Key: SPARK-2741 URL: https://issues.apache.org/jira/browse/SPARK-2741 Project: Spark Issue Type: Task Reporter: Brock Noland Assignee: Patrick Wendell Fix For: 1.1.0 Attachments: SPARK-2741.patch The current spark assembly contains Hive. This conflicts with Hive + Spark which is attempting to use it's own version of Hive. We'll need to publish a version of the assembly which does not contain the Hive jars. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2741) Publish version of spark assembly which does not contain Hive
[ https://issues.apache.org/jira/browse/SPARK-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2741: --- Assignee: Brock Noland (was: Patrick Wendell) Publish version of spark assembly which does not contain Hive - Key: SPARK-2741 URL: https://issues.apache.org/jira/browse/SPARK-2741 Project: Spark Issue Type: Task Reporter: Brock Noland Assignee: Brock Noland Fix For: 1.1.0 Attachments: SPARK-2741.patch The current spark assembly contains Hive. This conflicts with Hive + Spark which is attempting to use it's own version of Hive. We'll need to publish a version of the assembly which does not contain the Hive jars. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2741) Publish version of spark assembly which does not contain Hive
[ https://issues.apache.org/jira/browse/SPARK-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080264#comment-14080264 ] Brock Noland commented on SPARK-2741: - Thanks guys!! Publish version of spark assembly which does not contain Hive - Key: SPARK-2741 URL: https://issues.apache.org/jira/browse/SPARK-2741 Project: Spark Issue Type: Task Reporter: Brock Noland Assignee: Brock Noland Fix For: 1.1.0 Attachments: SPARK-2741.patch The current spark assembly contains Hive. This conflicts with Hive + Spark which is attempting to use it's own version of Hive. We'll need to publish a version of the assembly which does not contain the Hive jars. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1647) Prevent data loss when Streaming driver goes down
[ https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1647: - Fix Version/s: (was: 1.1.0) Prevent data loss when Streaming driver goes down - Key: SPARK-1647 URL: https://issues.apache.org/jira/browse/SPARK-1647 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently when the driver goes down, any uncheckpointed data is lost from within spark. If the system from which messages are pulled can replay messages, the data may be available - but for some systems, like Flume this is not the case. Also, all windowing information is lost for windowing functions. We must persist raw data somehow, and be able to replay this data if required. We also must persist windowing information with the data itself. This will likely require quite a bit of work to complete and probably will have to be split into several sub-jiras. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
[ https://issues.apache.org/jira/browse/SPARK-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1478: - Target Version/s: 1.2.0 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915 --- Key: SPARK-1478 URL: https://issues.apache.org/jira/browse/SPARK-1478 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Ted Malaska Assignee: Ted Malaska Priority: Minor Flume-1915 added support for compression over the wire from avro sink to avro source. I would like to add this functionality to the FlumeReceiver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1600) flaky test case in streaming.CheckpointSuite
[ https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1600: - Fix Version/s: (was: 1.1.0) flaky test case in streaming.CheckpointSuite Key: SPARK-1600 URL: https://issues.apache.org/jira/browse/SPARK-1600 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0, 0.9.1, 1.0.0 Reporter: Nan Zhu the case recovery with file input stream.recovery with file input stream sometimes fails when the Jenkins is very busy with an unrelated change I have met it for 3 times, I also saw it in other places, the latest example is in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/ where the modification is just in YARN related files I once reported in dev mail list: http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1600) flaky test case in streaming.CheckpointSuite
[ https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1600: - Target Version/s: 1.2.0 flaky test case in streaming.CheckpointSuite Key: SPARK-1600 URL: https://issues.apache.org/jira/browse/SPARK-1600 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0, 0.9.1, 1.0.0 Reporter: Nan Zhu the case recovery with file input stream.recovery with file input stream sometimes fails when the Jenkins is very busy with an unrelated change I have met it for 3 times, I also saw it in other places, the latest example is in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/ where the modification is just in YARN related files I once reported in dev mail list: http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1409) Flaky Test: actor input stream test in org.apache.spark.streaming.InputStreamsSuite
[ https://issues.apache.org/jira/browse/SPARK-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1409: - Target Version/s: 1.2.0 Flaky Test: actor input stream test in org.apache.spark.streaming.InputStreamsSuite - Key: SPARK-1409 URL: https://issues.apache.org/jira/browse/SPARK-1409 Project: Spark Issue Type: Bug Components: Streaming Reporter: Michael Armbrust Assignee: Tathagata Das Here are just a few cases: https://travis-ci.org/apache/spark/jobs/22151827 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13709/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1409) Flaky Test: actor input stream test in org.apache.spark.streaming.InputStreamsSuite
[ https://issues.apache.org/jira/browse/SPARK-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1409: - Fix Version/s: (was: 1.1.0) Flaky Test: actor input stream test in org.apache.spark.streaming.InputStreamsSuite - Key: SPARK-1409 URL: https://issues.apache.org/jira/browse/SPARK-1409 Project: Spark Issue Type: Bug Components: Streaming Reporter: Michael Armbrust Assignee: Tathagata Das Here are just a few cases: https://travis-ci.org/apache/spark/jobs/22151827 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13709/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2759) The ability to read binary files into Spark
Kevin Mader created SPARK-2759: -- Summary: The ability to read binary files into Spark Key: SPARK-2759 URL: https://issues.apache.org/jira/browse/SPARK-2759 Project: Spark Issue Type: Improvement Components: Input/Output, Java API, Spark Core Reporter: Kevin Mader For reading images, compressed files, or other custom formats it would be useful to have methods that could read the files in as a byte array or DataInputStream so other functions could then process the data. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2760) Caching tables from multiple databases does not work
Michael Armbrust created SPARK-2760: --- Summary: Caching tables from multiple databases does not work Key: SPARK-2760 URL: https://issues.apache.org/jira/browse/SPARK-2760 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2734) DROP TABLE should also uncache table
[ https://issues.apache.org/jira/browse/SPARK-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2734. - Resolution: Fixed Fix Version/s: 1.1.0 DROP TABLE should also uncache table Key: SPARK-2734 URL: https://issues.apache.org/jira/browse/SPARK-2734 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical Fix For: 1.1.0 Steps to reproduce: {code} hql(CREATE TABLE test(a INT)) hql(CACHE TABLE test) hql(DROP TABLE test) hql(SELECT * FROM test) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1580) [MLlib] ALS: Estimate communication and computation costs given a partitioner
[ https://issues.apache.org/jira/browse/SPARK-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tor Myklebust updated SPARK-1580: - Summary: [MLlib] ALS: Estimate communication and computation costs given a partitioner (was: ALS: Estimate communication and computation costs given a partitioner) [MLlib] ALS: Estimate communication and computation costs given a partitioner - Key: SPARK-1580 URL: https://issues.apache.org/jira/browse/SPARK-1580 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Tor Myklebust Priority: Minor It would be nice to be able to estimate the amount of work needed to solve an ALS problem. The chief components of this work are computation time---time spent forming and solving the least squares problems---and communication cost---the number of bytes sent across the network. Communication cost depends heavily on how the users and products are partitioned. We currently do not try to cluster users or products so that fewer feature vectors need to be communicated. This is intended as a first step toward that end---we ought to be able to tell whether one partitioning is better than another. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2341. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1663 [https://github.com/apache/spark/pull/1663] loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Assignee: Sean Owen Priority: Minor Labels: easyfix Fix For: 1.1.0 Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2654) Leveled logging in PySpark
[ https://issues.apache.org/jira/browse/SPARK-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080311#comment-14080311 ] Michael Yannakopoulos commented on SPARK-2654: -- Hi Davies, I can work on this issue. Please fill me in about the logging system that pyspark uses and from where I can start to contribute. Thanks! Leveled logging in PySpark -- Key: SPARK-2654 URL: https://issues.apache.org/jira/browse/SPARK-2654 Project: Spark Issue Type: Improvement Reporter: Davies Liu Add more leveled logging in PySpark, the logging level should be easy controlled by configuration and command line arguments. -- This message was sent by Atlassian JIRA (v6.2#6252)