[jira] [Commented] (SPARK-6223) Avoid Build warning- enable implicit value scala.language.existentials visible
[ https://issues.apache.org/jira/browse/SPARK-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352837#comment-14352837 ] Vinod KC commented on SPARK-6223: - I'm working on this Avoid Build warning- enable implicit value scala.language.existentials visible -- Key: SPARK-6223 URL: https://issues.apache.org/jira/browse/SPARK-6223 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Vinod KC Priority: Trivial spark/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala:316: inferred existential type Option[(Class[_$4], org.apache.spark.sql.sources.BaseRelation)] forSome { type _$4 }, which cannot be expressed by wildcards, should be enabled by making the implicit value scala.language.existentials visible. This can be achieved by adding the import clause 'import scala.language.existentials' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6224) Also collect NamedExpressions in PhysicalOperation
[ https://issues.apache.org/jira/browse/SPARK-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-6224. -- Resolution: Not a Problem Also collect NamedExpressions in PhysicalOperation -- Key: SPARK-6224 URL: https://issues.apache.org/jira/browse/SPARK-6224 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor Currently in PhysicalOperation, only Alias expressions are collected. Similarly, NamedExpression can be collected for substitution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6224) Also collect NamedExpressions in PhysicalOperation
Liang-Chi Hsieh created SPARK-6224: -- Summary: Also collect NamedExpressions in PhysicalOperation Key: SPARK-6224 URL: https://issues.apache.org/jira/browse/SPARK-6224 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor Currently in PhysicalOperation, only Alias expressions are collected. Similarly, NamedExpression can be collected for substitution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6225) Resolve most build warnings, 1.3.0 edition
Sean Owen created SPARK-6225: Summary: Resolve most build warnings, 1.3.0 edition Key: SPARK-6225 URL: https://issues.apache.org/jira/browse/SPARK-6225 Project: Spark Issue Type: Improvement Components: MLlib, Spark Core, SQL, Streaming Affects Versions: 1.3.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Post-1.3.0, I think it would be a good exercise to resolve a number of build warnings that have accumulated recently. See for example efforts begun at https://github.com/apache/spark/pull/4948 https://github.com/apache/spark/pull/4900 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6188) Instance types can be mislabeled when re-starting cluster with default arguments
[ https://issues.apache.org/jira/browse/SPARK-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6188. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4916 [https://github.com/apache/spark/pull/4916] Instance types can be mislabeled when re-starting cluster with default arguments Key: SPARK-6188 URL: https://issues.apache.org/jira/browse/SPARK-6188 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1 Reporter: Theodore Vasiloudis Priority: Minor Fix For: 1.4.0 This was discovered when investigating https://issues.apache.org/jira/browse/SPARK-5838. In short, when restarting a cluster that you launched with an alternative instance type, you have to provide the instance type(s) again in the /spark-ec2 -i key-file --region=ec2-region start cluster-name command. Otherwise it gets set to the default m1.large. This then affects the setup of the machines. I'll submit a pull request that takes cares of this, without the user needing to provide the instance type(s) again. EDIT: Example case where this becomes a problem: 1. User launches a cluster with instances with 1 disk, ex. m3.large. 2. The user stops the cluster. 3. When the user restarts the cluster with the start command without providing the instance type, the setup is performed using the default instance type, m1.large, which assumes 2 disks present in the machine. 4. The SPARK_LOCAL_DIRS is then set to mnt/spark,mnt2/spark. /mnt2 corresponds to the snapshot partition in a m3.large instance, which is only 8GB in size. When the user runs jobs that shuffle data, this partition fills up quickly, resulting in failed jobs due to No space left on device errors. Apart from this example one could come up with other examples where the setup of the machines is wrong, due to assuming that they are of type m1.large. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6224) Also collect NamedExpressions in PhysicalOperation
[ https://issues.apache.org/jira/browse/SPARK-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352945#comment-14352945 ] Apache Spark commented on SPARK-6224: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4949 Also collect NamedExpressions in PhysicalOperation -- Key: SPARK-6224 URL: https://issues.apache.org/jira/browse/SPARK-6224 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor Currently in PhysicalOperation, only Alias expressions are collected. Similarly, NamedExpression can be collected for substitution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5986) Model import/export for KMeansModel
[ https://issues.apache.org/jira/browse/SPARK-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353022#comment-14353022 ] Apache Spark commented on SPARK-5986: - User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/4951 Model import/export for KMeansModel --- Key: SPARK-5986 URL: https://issues.apache.org/jira/browse/SPARK-5986 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Xusen Yin Support save/load for KMeansModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6223) Avoid Build warning- enable implicit value scala.language.existentials visible
Vinod KC created SPARK-6223: --- Summary: Avoid Build warning- enable implicit value scala.language.existentials visible Key: SPARK-6223 URL: https://issues.apache.org/jira/browse/SPARK-6223 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Vinod KC Priority: Trivial spark/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala:316: inferred existential type Option[(Class[_$4], org.apache.spark.sql.sources.BaseRelation)] forSome { type _$4 }, which cannot be expressed by wildcards, should be enabled by making the implicit value scala.language.existentials visible. This can be achieved by adding the import clause 'import scala.language.existentials' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6223) Avoid Build warning- enable implicit value scala.language.existentials visible
[ https://issues.apache.org/jira/browse/SPARK-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352866#comment-14352866 ] Apache Spark commented on SPARK-6223: - User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/4948 Avoid Build warning- enable implicit value scala.language.existentials visible -- Key: SPARK-6223 URL: https://issues.apache.org/jira/browse/SPARK-6223 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Vinod KC Priority: Trivial spark/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala:316: inferred existential type Option[(Class[_$4], org.apache.spark.sql.sources.BaseRelation)] forSome { type _$4 }, which cannot be expressed by wildcards, should be enabled by making the implicit value scala.language.existentials visible. This can be achieved by adding the import clause 'import scala.language.existentials' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6225) Resolve most build warnings, 1.3.0 edition
[ https://issues.apache.org/jira/browse/SPARK-6225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352984#comment-14352984 ] Apache Spark commented on SPARK-6225: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4950 Resolve most build warnings, 1.3.0 edition -- Key: SPARK-6225 URL: https://issues.apache.org/jira/browse/SPARK-6225 Project: Spark Issue Type: Improvement Components: MLlib, Spark Core, SQL, Streaming Affects Versions: 1.3.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Post-1.3.0, I think it would be a good exercise to resolve a number of build warnings that have accumulated recently. See for example efforts begun at https://github.com/apache/spark/pull/4948 https://github.com/apache/spark/pull/4900 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353129#comment-14353129 ] Sean Owen commented on SPARK-3066: -- My anecdotal experience with it was that getting an order-of-magnitude speedup meant losing a small but noticeable amount of quality in the top recommendations. That is, you would fail to consider as candidates some of the items that were actually top recs. The most actionable test / implementation I have to show this for ALS is ... https://github.com/cloudera/oryx/blob/master/als-common/src/it/java/com/cloudera/oryx/als/common/candidate/LocationSensitiveHashIT.java This could let you run tests for a certain scale, certain degree of hashing, etc., if you wanted to. I've actually tried to avoid needing LSH just for speed in order to avoid this tradeoff. Anyway for papers? I found this pretty complex treatment: http://papers.nips.cc/paper/5329-asymmetric-lsh-alsh-for-sublinear-time-maximum-inner-product-search-mips.pdf This has a little info on the quality of LSH: https://fruct.org/sites/default/files/files/conference15/Ponomarev_LSH_P2P.pdf It's one of those things where I'm sure it can be done better than the basic ways I know to do it, but haven't yet found a killer paper. Support recommendAll in matrix factorization model -- Key: SPARK-3066 URL: https://issues.apache.org/jira/browse/SPARK-3066 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Debasish Das ALS returns a matrix factorization model, which we can use to predict ratings for individual queries as well as small batches. In practice, users may want to compute top-k recommendations offline for all users. It is very expensive but a common problem. We can do some optimization like 1) collect one side (either user or product) and broadcast it as a matrix 2) use level-3 BLAS to compute inner products 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6188) Instance types can be mislabeled when re-starting cluster with default arguments
[ https://issues.apache.org/jira/browse/SPARK-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6188: - Shepherd: (was: Josh Rosen) Assignee: Theodore Vasiloudis Instance types can be mislabeled when re-starting cluster with default arguments Key: SPARK-6188 URL: https://issues.apache.org/jira/browse/SPARK-6188 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1 Reporter: Theodore Vasiloudis Assignee: Theodore Vasiloudis Priority: Minor Fix For: 1.4.0 This was discovered when investigating https://issues.apache.org/jira/browse/SPARK-5838. In short, when restarting a cluster that you launched with an alternative instance type, you have to provide the instance type(s) again in the /spark-ec2 -i key-file --region=ec2-region start cluster-name command. Otherwise it gets set to the default m1.large. This then affects the setup of the machines. I'll submit a pull request that takes cares of this, without the user needing to provide the instance type(s) again. EDIT: Example case where this becomes a problem: 1. User launches a cluster with instances with 1 disk, ex. m3.large. 2. The user stops the cluster. 3. When the user restarts the cluster with the start command without providing the instance type, the setup is performed using the default instance type, m1.large, which assumes 2 disks present in the machine. 4. The SPARK_LOCAL_DIRS is then set to mnt/spark,mnt2/spark. /mnt2 corresponds to the snapshot partition in a m3.large instance, which is only 8GB in size. When the user runs jobs that shuffle data, this partition fills up quickly, resulting in failed jobs due to No space left on device errors. Apart from this example one could come up with other examples where the setup of the machines is wrong, due to assuming that they are of type m1.large. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6223) Avoid Build warning- enable implicit value scala.language.existentials visible
[ https://issues.apache.org/jira/browse/SPARK-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6223. -- Resolution: Duplicate I think this is a subset of a larger logical change, to clean up all similar build warnings at once. Avoid Build warning- enable implicit value scala.language.existentials visible -- Key: SPARK-6223 URL: https://issues.apache.org/jira/browse/SPARK-6223 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Vinod KC Priority: Trivial spark/sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala:316: inferred existential type Option[(Class[_$4], org.apache.spark.sql.sources.BaseRelation)] forSome { type _$4 }, which cannot be expressed by wildcards, should be enabled by making the implicit value scala.language.existentials visible. This can be achieved by adding the import clause 'import scala.language.existentials' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5986) Model import/export for KMeansModel
[ https://issues.apache.org/jira/browse/SPARK-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353117#comment-14353117 ] Xusen Yin commented on SPARK-5986: -- Get it. Do you mind assign SPARK-5991 to me? Thanks! Model import/export for KMeansModel --- Key: SPARK-5986 URL: https://issues.apache.org/jira/browse/SPARK-5986 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Xusen Yin Support save/load for KMeansModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6201) INSET should coerce types
[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353030#comment-14353030 ] Cheng Lian commented on SPARK-6201: --- Played Hive type implicit conversion a bit more and found that Hive actually converts integers to strings in your case: {code:sql} hive create table t1 as select '1.00' as c1; hive select * from t1 where c1 in (1.0); {code} If {{c1}} is converted to numeric, then the {{1.00}} should appear in the result. However, the result set is empty. References: # [Implicit type coercion support in existing database systems|http://chapeau.freevariable.com/2014/08/existing-system-coercion.html] by William Benton # [{{GenericUDFIn.initialize}}|https://github.com/apache/hive/blob/release-0.13.1/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIn.java#L84-L100] INSET should coerce types - Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. We should make SparkSQL conform to Hive's behavior, even though doing implicit coercion here is very confusing for comparing String and Int.* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353114#comment-14353114 ] Joseph K. Bradley commented on SPARK-3066: -- Oops, true, not an actual metric. LSH sounds reasonable. Do you know of use cases or how well it's been found to work for recommendation problems? Support recommendAll in matrix factorization model -- Key: SPARK-3066 URL: https://issues.apache.org/jira/browse/SPARK-3066 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Debasish Das ALS returns a matrix factorization model, which we can use to predict ratings for individual queries as well as small batches. In practice, users may want to compute top-k recommendations offline for all users. It is very expensive but a common problem. We can do some optimization like 1) collect one side (either user or product) and broadcast it as a matrix 2) use level-3 BLAS to compute inner products 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Amelot updated SPARK-6227: - Affects Version/s: 1.2.1 PCA and SVD for PySpark --- Key: SPARK-6227 URL: https://issues.apache.org/jira/browse/SPARK-6227 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.2.1 Reporter: Julien Amelot Priority: Minor The Dimensionality Reduction techniques are not available via Python (Scala + Java only). * Principal component analysis (PCA) * Singular value decomposition (SVD) Doc: http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5986) Model import/export for KMeansModel
[ https://issues.apache.org/jira/browse/SPARK-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353109#comment-14353109 ] Joseph K. Bradley commented on SPARK-5986: -- I'd recommend doing the 2 separately to make smaller PRs. The Python ones can be matched with a subtask from this JIRA: [https://issues.apache.org/jira/browse/SPARK-5991] Thanks! Model import/export for KMeansModel --- Key: SPARK-5986 URL: https://issues.apache.org/jira/browse/SPARK-5986 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Xusen Yin Support save/load for KMeansModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6227) PCA and SVD for PySpark
Julien Amelot created SPARK-6227: Summary: PCA and SVD for PySpark Key: SPARK-6227 URL: https://issues.apache.org/jira/browse/SPARK-6227 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Julien Amelot Priority: Minor The Dimensionality Reduction techniques are not available via Python (Scala + Java only). * Principal component analysis (PCA) * Singular value decomposition (SVD) Doc: http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+
[ https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353047#comment-14353047 ] Sean Owen commented on SPARK-5134: -- Yep, I confirmed that ... {code} [INFO] \- org.apache.spark:spark-core_2.10:jar:1.2.1:compile ... [INFO]+- org.apache.hadoop:hadoop-client:jar:2.2.0:compile [INFO]| +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile [INFO]| | +- commons-cli:commons-cli:jar:1.2:compile ... {code} Well, FWIW, although unintentional I do think there are upsides to this change. It would be good to codify that in the build, I suppose, by updating the default version number. How about updating to 2.2.0 to match what has actually happened? This would not entail activating the Hadoop build profiles by default or anything. [~rdub] would you care to do the honors? Bump default Hadoop version to 2+ - Key: SPARK-5134 URL: https://issues.apache.org/jira/browse/SPARK-5134 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor [~srowen] and I discussed bumping [the default hadoop version in the parent POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122] from {{1.0.4}} to something more recent. There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6226) Support model save/load in Python's KMeans
Joseph K. Bradley created SPARK-6226: Summary: Support model save/load in Python's KMeans Key: SPARK-6226 URL: https://issues.apache.org/jira/browse/SPARK-6226 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4734) [Streaming]limit the file Dstream size for each batch
[ https://issues.apache.org/jira/browse/SPARK-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4734. -- Resolution: Won't Fix I feel strongly that this sort of change introduces new problems and doesn't solve problems. It can't accelerate a streaming system's throughput; this is about smoothing peakiness, which is what message queueing systems are for. This would attempt to design a simple queueing system without the fault tolerance. In some cases, a larger batch duration also suffices to smooth peaks. [Streaming]limit the file Dstream size for each batch - Key: SPARK-4734 URL: https://issues.apache.org/jira/browse/SPARK-4734 Project: Spark Issue Type: New Feature Components: Streaming Reporter: 宿荣全 Priority: Minor Streaming scan new files form the HDFS and process those files in each batch process.Current streaming exist some problems: 1.When the number of files is very large(the count size of those files is very large) in some batch segement.The processing time required will become very long.The processing time maybe over slideDuration time.Eventually lead to dispatch the next batch process is delay. 2.when the size of total file Dstream is very large in one batch,those dstream data do shuffle after memory will be n times increasing occupation,app will be slow or even terminated by operating system. So if we set a upper limit value of input data for each batch to control the batch process time,the job dispatch delay and the process delay wil be alleviated. modification: Add a new parameter spark.streaming.segmentSizeThreshold in InputDStream (input data base class).the size of each batch process segments will be set in this parameter from [spark-defaults.conf] or setting in source. all implements class of InputDStream will do corresponding action be aimed at the segmentSizeThreshold. This patch is a modification about FileInputDStream ,so when find new files ,put those files's name and size in a queue and take elements package to a batch data with totail size segmentSizeThreshold in FileInputDStream.Please look source about detailed logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6230) Provide authentication and encryption for Spark's RPC
Marcelo Vanzin created SPARK-6230: - Summary: Provide authentication and encryption for Spark's RPC Key: SPARK-6230 URL: https://issues.apache.org/jira/browse/SPARK-6230 Project: Spark Issue Type: Sub-task Reporter: Marcelo Vanzin Make sure the RPC layer used by Spark supports the auth and encryption features of the network/common module. This kinda ignores akka; adding support for SASL to akka, while possible, seems to be at odds with the direction being taken in Spark, so let's restrict this to the new RPC layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353225#comment-14353225 ] Joseph K. Bradley commented on SPARK-3066: -- Thanks for the references! I'll take a look, but based on what you say, perhaps focusing on BLAS is the best path for now. Support recommendAll in matrix factorization model -- Key: SPARK-3066 URL: https://issues.apache.org/jira/browse/SPARK-3066 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Debasish Das ALS returns a matrix factorization model, which we can use to predict ratings for individual queries as well as small batches. In practice, users may want to compute top-k recommendations offline for all users. It is very expensive but a common problem. We can do some optimization like 1) collect one side (either user or product) and broadcast it as a matrix 2) use level-3 BLAS to compute inner products 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6201) INSET should coerce types
[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353248#comment-14353248 ] Jianshi Huang edited comment on SPARK-6201 at 3/9/15 5:40 PM: -- Implicit coercion outside the Numeric domain is quite evil. I don't think Hive's behavior makes sense here. Raising an exception is fine in this case. And if you want to make it Hive compliant, then pls think about adding an switch, say bq. spark.sql.strict_mode = true(default) / false Jianshi was (Author: huangjs): Implicit coercion outside the Numeric domain is quite evil. I don't think Hive's behavior makes sense here. Raising an exception is fine in this case. And if you want to make it Hive compliant, then pls think about adding an switch, say ``` spark.sql.strict_mode = true(default) / false ``` Jianshi INSET should coerce types - Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. We should make SparkSQL conform to Hive's behavior, even though doing implicit coercion here is very confusing for comparing String and Int.* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6201) INSET should coerce types
[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353248#comment-14353248 ] Jianshi Huang commented on SPARK-6201: -- Implicit coercion outside the Numeric domain is quite evil. I don't think Hive's behavior makes sense here. Raising an exception is fine in this case. And if you want to make it Hive compliant, then pls think about adding an switch, say spark.sql.strict_mode = true(default) / false Jianshi INSET should coerce types - Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. We should make SparkSQL conform to Hive's behavior, even though doing implicit coercion here is very confusing for comparing String and Int.* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6201) INSET should coerce types
[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353248#comment-14353248 ] Jianshi Huang edited comment on SPARK-6201 at 3/9/15 5:39 PM: -- Implicit coercion outside the Numeric domain is quite evil. I don't think Hive's behavior makes sense here. Raising an exception is fine in this case. And if you want to make it Hive compliant, then pls think about adding an switch, say ``` spark.sql.strict_mode = true(default) / false ``` Jianshi was (Author: huangjs): Implicit coercion outside the Numeric domain is quite evil. I don't think Hive's behavior makes sense here. Raising an exception is fine in this case. And if you want to make it Hive compliant, then pls think about adding an switch, say spark.sql.strict_mode = true(default) / false Jianshi INSET should coerce types - Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. We should make SparkSQL conform to Hive's behavior, even though doing implicit coercion here is very confusing for comparing String and Int.* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6201) INSET should coerce types
[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353030#comment-14353030 ] Cheng Lian edited comment on SPARK-6201 at 3/9/15 5:10 PM: --- Played Hive type implicit conversion a bit more and found that Hive actually converts integers to strings in your case: {code:sql} hive create table t1 as select '1.00' as c1; hive select * from t1 where c1 in (1.0); {code} If {{c1}} is converted to numeric, then the {{1.00}} should appear in the result. However, the result set is empty. For expression {{1.00 IN (1.0)}}, a {{GenericUDFIn}} instance is created and called with an argument list {{(1.00, 1.0}}. Then {{GenericUDFIn}} tries to convert all arguments into a common data type from left to right. Since double is allowed to be translated into string, {{1.0}} is converted into string {{1.0}}. References: # [Implicit type coercion support in existing database systems|http://chapeau.freevariable.com/2014/08/existing-system-coercion.html] by William Benton # [{{GenericUDFIn.initialize}}|https://github.com/apache/hive/blob/release-0.13.1/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIn.java#L84-L100] was (Author: lian cheng): Played Hive type implicit conversion a bit more and found that Hive actually converts integers to strings in your case: {code:sql} hive create table t1 as select '1.00' as c1; hive select * from t1 where c1 in (1.0); {code} If {{c1}} is converted to numeric, then the {{1.00}} should appear in the result. However, the result set is empty. References: # [Implicit type coercion support in existing database systems|http://chapeau.freevariable.com/2014/08/existing-system-coercion.html] by William Benton # [{{GenericUDFIn.initialize}}|https://github.com/apache/hive/blob/release-0.13.1/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIn.java#L84-L100] INSET should coerce types - Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. We should make SparkSQL conform to Hive's behavior, even though doing implicit coercion here is very confusing for comparing String and Int.* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6229) Support encryption in network/common module
Marcelo Vanzin created SPARK-6229: - Summary: Support encryption in network/common module Key: SPARK-6229 URL: https://issues.apache.org/jira/browse/SPARK-6229 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin After SASL support has been added to network/common, supporting encryption should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. Since the latter requires a valid kerberos login to work (and so doesn't really work with executors), encryption would require the use of DIGEST-MD5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6201) INSET should coerce types
[ https://issues.apache.org/jira/browse/SPARK-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353030#comment-14353030 ] Cheng Lian edited comment on SPARK-6201 at 3/9/15 5:13 PM: --- Played Hive type implicit conversion a bit more and found that Hive actually converts integers to strings in your case: {code:sql} hive create table t1 as select '1.00' as c1; hive select * from t1 where c1 in (1.0); {code} If {{c1}} is converted to numeric, then {{1.00}} should appear in the result. However, the result set is empty. For expression {{1.00 IN (1.0)}}, a {{GenericUDFIn}} instance is created and called with argument list {{(1.00, 1.0)}}. Then {{GenericUDFIn.initialize}} tries to convert all arguments into a common data type from left to right. Since double is allowed to be translated into string, {{1.0}} is converted into string {{1.0}}. References: # [Implicit type coercion support in existing database systems|http://chapeau.freevariable.com/2014/08/existing-system-coercion.html] by William Benton # [{{GenericUDFIn.initialize}}|https://github.com/apache/hive/blob/release-0.13.1/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIn.java#L84-L100] was (Author: lian cheng): Played Hive type implicit conversion a bit more and found that Hive actually converts integers to strings in your case: {code:sql} hive create table t1 as select '1.00' as c1; hive select * from t1 where c1 in (1.0); {code} If {{c1}} is converted to numeric, then the {{1.00}} should appear in the result. However, the result set is empty. For expression {{1.00 IN (1.0)}}, a {{GenericUDFIn}} instance is created and called with an argument list {{(1.00, 1.0}}. Then {{GenericUDFIn}} tries to convert all arguments into a common data type from left to right. Since double is allowed to be translated into string, {{1.0}} is converted into string {{1.0}}. References: # [Implicit type coercion support in existing database systems|http://chapeau.freevariable.com/2014/08/existing-system-coercion.html] by William Benton # [{{GenericUDFIn.initialize}}|https://github.com/apache/hive/blob/release-0.13.1/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIn.java#L84-L100] INSET should coerce types - Key: SPARK-6201 URL: https://issues.apache.org/jira/browse/SPARK-6201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Jianshi Huang Suppose we have the following table: {code} sqlc.jsonRDD(sc.parallelize(Seq({\a\: \1\}}, {\a\: \2\}}, {\a\: \3\}}))).registerTempTable(d) {code} The schema is {noformat} root |-- a: string (nullable = true) {noformat} Then, {code} sql(select * from d where (d.a = 1 or d.a = 2)).collect = Array([1], [2]) {code} where d.a and constants 1,2 will be casted to Double first and do the comparison as you can find it out in the plan: {noformat} Filter ((CAST(a#155, DoubleType) = CAST(1, DoubleType)) || (CAST(a#155, DoubleType) = CAST(2, DoubleType))) {noformat} However, if I use {code} sql(select * from d where d.a in (1,2)).collect {code} The result is empty. The physical plan shows it's using INSET: {noformat} == Physical Plan == Filter a#155 INSET (1,2) PhysicalRDD [a#155], MappedRDD[499] at map at JsonRDD.scala:47 {noformat} *It seems INSET implementation in SparkSQL doesn't coerce type implicitly, where Hive does. We should make SparkSQL conform to Hive's behavior, even though doing implicit coercion here is very confusing for comparing String and Int.* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3278) Isotonic regression
[ https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353252#comment-14353252 ] Vladimir Vladimirov commented on SPARK-3278: Had anyone benchmarked the performance of Spark Isotonic Regression implementation on big datasets (100 M, 1000M) ? Isotonic regression --- Key: SPARK-3278 URL: https://issues.apache.org/jira/browse/SPARK-3278 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Martin Zapletal Fix For: 1.3.0 Add isotonic regression for score calibration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5986) Model import/export for KMeansModel
[ https://issues.apache.org/jira/browse/SPARK-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353221#comment-14353221 ] Joseph K. Bradley commented on SPARK-5986: -- Subtask assigned Model import/export for KMeansModel --- Key: SPARK-5986 URL: https://issues.apache.org/jira/browse/SPARK-5986 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Xusen Yin Support save/load for KMeansModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6226) Support model save/load in Python's KMeans
[ https://issues.apache.org/jira/browse/SPARK-6226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6226: - Assignee: Xusen Yin Support model save/load in Python's KMeans -- Key: SPARK-6226 URL: https://issues.apache.org/jira/browse/SPARK-6226 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6228) Provide SASL support in network/common module
Marcelo Vanzin created SPARK-6228: - Summary: Provide SASL support in network/common module Key: SPARK-6228 URL: https://issues.apache.org/jira/browse/SPARK-6228 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin Currently, there's support for SASL in network/shuffle, but not in network/common. Moving the SASL code to network/common would enable other applications using that code to also support secure authentication and, later, encryption. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6219) Expand Python lint checks to check for compilation errors
[ https://issues.apache.org/jira/browse/SPARK-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353325#comment-14353325 ] Nicholas Chammas commented on SPARK-6219: - That's a good point, I haven't checked to see what's already covered in that way by unit tests. At the very least, I can say that this will catch stuff in spark-ec2 and examples that unit tests currently do not cover. Also, it runs very, very quickly. Expand Python lint checks to check for compilation errors -- Key: SPARK-6219 URL: https://issues.apache.org/jira/browse/SPARK-6219 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Priority: Minor An easy lint check for Python would be to make sure the stuff at least compiles. That will catch only the most egregious errors, but it should help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4911) Report the inputs and outputs of Spark jobs so that external systems can track data lineage
[ https://issues.apache.org/jira/browse/SPARK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353598#comment-14353598 ] Ted Malaska commented on SPARK-4911: Thanks [~sandyr]. Yes this would be very helpful. Today there is a good bit of information in the logs to get this information, but it is not standardized. I'm not 100% sure what is the best solution for this but it is needed. I would be happen to help review. Ted Malaska Report the inputs and outputs of Spark jobs so that external systems can track data lineage Key: SPARK-4911 URL: https://issues.apache.org/jira/browse/SPARK-4911 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza When Spark runs a job, it would be useful to log its filesystem inputs and outputs somewhere. This allows external tools to track which persisted datasets are derived from other persisted datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4600) org.apache.spark.graphx.VertexRDD.diff does not work
[ https://issues.apache.org/jira/browse/SPARK-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353627#comment-14353627 ] Ankur Dave commented on SPARK-4600: --- As I wrote in SPARK-6022, this is the documented behavior for diff: for each vertex present in _both_ A and B, it returns only those with differing values. Also, it's only guaranteed to work if the VertexRDDs share a common ancestor, which is not true in this test. We don't currently have a set difference operator, which would do what you expect. org.apache.spark.graphx.VertexRDD.diff does not work Key: SPARK-4600 URL: https://issues.apache.org/jira/browse/SPARK-4600 Project: Spark Issue Type: Bug Components: GraphX Environment: scala 2.10.4 spark 1.1.0 Reporter: Teppei Tosa Assignee: Brennon York Labels: graphx VertexRDD.diff doesn't work. For example : val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id = (id, id.toInt))) setA.collect.foreach(println(_)) // (0,0) // (1,1) val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id = (id, id.toInt))) setB.collect.foreach(println(_)) // (1,1) // (2,2) val diff = setA.diff(setB) diff.collect.foreach(println(_)) // printed none -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6190) create LargeByteBuffer abstraction for eliminating 2GB limit on blocks
[ https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353687#comment-14353687 ] Imran Rashid commented on SPARK-6190: - Another observation as I've dug into the implementation a little more. {{LargeByteBuf}} (the version which wraps netty's {{ByteBuf}}) isn't necessary on the *sending* side. The large messages are in {{ChunkFetchSuccess}}, which is either sending a file region, or something that is already available in a nio {{ByteBuffer}}. We do still need a {{LargeByteBuf}} to make it easy for the *receiving* end to get these messages, though. (Example decoder: https://github.com/apache/spark/blob/5e83a55daa30a19840214f77681248e112635bf6/network/common/src/main/java/org/apache/spark/network/protocol/FixedChunkLargeFrameDecoder.java) But that simplifies the api it needs to expose -- the encode / decoding can still be done on {{ByteBuf}}, and we just expose the bulk of the data in the {{LargeByteBuf}} via conversion to nio {{LargeByteBuffer}} and expose as a {{InputStream}}. Here is a WIP branch, that demonstrates the basics of transferring large blocks, though its got a fair amount of cleanup necessary before you look too closely. I'm pretty sure some of the functionality here is not needed (eg. {{LargeByteBuf#getInt}}, since the decoding can really just use one of the {{ByteBuf}} s). https://github.com/apache/spark/compare/master...squito:SPARK-6190_largeBB create LargeByteBuffer abstraction for eliminating 2GB limit on blocks -- Key: SPARK-6190 URL: https://issues.apache.org/jira/browse/SPARK-6190 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Imran Rashid Assignee: Imran Rashid Attachments: LargeByteBuffer.pdf A key component in eliminating the 2GB limit on blocks is creating a proper abstraction for storing more than 2GB. Currently spark is limited by a reliance on nio ByteBuffer and netty ByteBuf, both of which are limited at 2GB. This task will introduce the new abstraction and the relevant implementation and utilities, without effecting the existing implementation at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6211) Test Python Kafka API using Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353737#comment-14353737 ] Tathagata Das commented on SPARK-6211: -- That a good point. That requires the kafka-assembly. Fortunately, the external/kafka-assembly project already exists, that create the JAR. All you need to figure out how to add that the generated kafka-assembly.jar to the Java class path when pyspark is called for running tests. 1. bin/pyspark already calls bin/spark-submit, which can take the kafka-assembly jar as--jars path-to-jar. 2. The python tests are run with bin/python/run-tests, which calls bin/pyspark. Please take a look at those to figure out how we can pass on the kafka assembly with --jars for the kafka python tests. Does that make sense? Test Python Kafka API using Python unit tests - Key: SPARK-6211 URL: https://issues.apache.org/jira/browse/SPARK-6211 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Tathagata Das Assignee: Saisai Shao Priority: Critical This is tricky in python because the KafkaStreamSuiteBase (which has the functionality of creating embedded kafka clusters) is in the test package, which is not in the python path. To fix that, we have to ways. 1. Add test jar to classpath in python test. Thats kind of trickier. 2. Bring that into the src package (maybe renamed as KafkaTestUtils), and then wrap that in python to use it from python. If (2) does not add any extra test dependencies to the main Kafka pom, then 2 should be simpler to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6232) Spark Streaming: simple application stalls processing
[ https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Platon Potapov updated SPARK-6232: -- Description: Below is a snippet of a simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} The stage... message is output to stderr. We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * if the commented-out print is un-commented. * if (window + reduceByKey) is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} was: Below is a snippet of a simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} The stage... message is output to stderr. We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * in the commented-out print is un-commented. * if the window+reduce is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} Spark Streaming: simple application stalls processing - Key: SPARK-6232 URL: https://issues.apache.org/jira/browse/SPARK-6232 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: Ubuntu, MacOS. Reporter: Platon Potapov Priority: Critical Below is a snippet of a simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall with the following output: {code} --- Time: 1425922369000 ms
[jira] [Commented] (SPARK-6232) Spark Streaming: simple application stalls processing
[ https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353487#comment-14353487 ] Sean Owen commented on SPARK-6232: -- I can't reproduce this, although, I just tried a slightly different setup using spark-shell and the latest master (~1.3.0). I used the code above otherwise, including words.print(). Maybe you can try that instead just to see whether it works or not for you? that could narrow things down. You have at least two cores available right? if you had only one core the receiver could starve the workers. I have also in the past observed that Spark programs sometimes don't work right as an {{App}} due to weird closure problems. Declaring a simple main() method resolves those. That's a wild guess though. Spark Streaming: simple application stalls processing - Key: SPARK-6232 URL: https://issues.apache.org/jira/browse/SPARK-6232 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: Ubuntu, MacOS. Tried builds with scala 2.11 and 2.10 (for kafka receiver). Also tried the pre-built spark-1.2.1-bin-hadoop2.4.tgz The bug reproduces in all cases on 3 different computers we've tried on. Reporter: Platon Potapov Priority: Critical Below is a complete source code of a very simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number into the nc terminal (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall completely (no new events are processed) with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} The stage... message is output to stderr. We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * if the commented-out print is un-commented. * if (window + reduceByKey) is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4911) Report the inputs and outputs of Spark jobs so that external systems can track data lineage
[ https://issues.apache.org/jira/browse/SPARK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353554#comment-14353554 ] Sandy Ryza commented on SPARK-4911: --- I know that [~malaskat] has played around with a solution that hacks around the lack of this, but it's definitely something still needed. Report the inputs and outputs of Spark jobs so that external systems can track data lineage Key: SPARK-4911 URL: https://issues.apache.org/jira/browse/SPARK-4911 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza When Spark runs a job, it would be useful to log its filesystem inputs and outputs somewhere. This allows external tools to track which persisted datasets are derived from other persisted datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)
[ https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353573#comment-14353573 ] Sean Owen commented on SPARK-5368: -- I feel qualified enough to review doc or config changes. It seems harmless enough if the intent is to allow passing some config straight through, and in fact it should already possible in general. I don't know the Akka bits that well, but we can CC people who do. I know it may not be trivial to update to Akka 2.4, if that's what's on deck here. Anyway is there a PR to look at? Spark should support NAT (via akka improvements) - Key: SPARK-5368 URL: https://issues.apache.org/jira/browse/SPARK-5368 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: jay vyas Fix For: 1.2.2 Spark sets up actors for akka with a set of variables which are defined in the {{AkkaUtils.scala}} class. A snippet: {noformat} 98 |akka.loggers = [akka.event.slf4j.Slf4jLogger] 99 |akka.stdout-loglevel = ERROR 100 |akka.jvm-exit-on-fatal-error = off 101 |akka.remote.require-cookie = $requireCookie 102 |akka.remote.secure-cookie = $secureCookie {noformat} We should allow users to pass in custom settings, for example, so that arbitrary akka modifications can be used at runtime for security, performance, logging, and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6022) GraphX `diff` test incorrectly operating on values (not VertexId's)
[ https://issues.apache.org/jira/browse/SPARK-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353617#comment-14353617 ] Ankur Dave commented on SPARK-6022: --- [~maropu] is correct: the original intent of diff was to operate on values, not VertexIds. It was really written for internal use in [mapVertices|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L133] and [outerJoinVertices|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L284], which use it to find the set of vertices whose values have changed so they can ship only those to the edge partitions. Based on your test you're looking for the set difference. Maybe you could introduce a new method called minus? GraphX `diff` test incorrectly operating on values (not VertexId's) --- Key: SPARK-6022 URL: https://issues.apache.org/jira/browse/SPARK-6022 Project: Spark Issue Type: Bug Components: GraphX Reporter: Brennon York The current GraphX {{diff}} test operates on values rather than the VertexId's and, if {{diff}} were working properly (per [SPARK-4600|https://issues.apache.org/jira/browse/SPARK-4600]), it should fail this test. The code to test {{diff}} should look like the below as it correctly generates {{VertexRDD}}'s with different {{VertexId}}'s to {{diff}} against. {code} test(diff functionality with small concrete values) { withSpark { sc = val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id = (id, id.toInt))) // setA := Set((0L, 0), (1L, 1)) val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id = (id, id.toInt+2))) // setB := Set((1L, 3), (2L, 4)) val diff = setA.diff(setB) assert(diff.collect.toSet == Set((2L, 4))) } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs
[ https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353634#comment-14353634 ] Joseph K. Bradley commented on SPARK-6113: -- Thanks! I don't think there are blockers. I'm going to get started on the initial PR, and we should be able to get your 1 open PR for GBTs in before merging becomes an issue. Stabilize DecisionTree and ensembles APIs - Key: SPARK-6113 URL: https://issues.apache.org/jira/browse/SPARK-6113 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical *Issue*: The APIs for DecisionTree and ensembles (RandomForests and GradientBoostedTrees) have been experimental for a long time. The API has become very convoluted because trees and ensembles have many, many variants, some of which we have added incrementally without a long-term design. *Proposal*: This JIRA is for discussing changes required to finalize the APIs. After we discuss, I will make a PR to update the APIs and make them non-Experimental. This will require making many breaking changes; see the design doc for details. [Design doc | https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]: This outlines current issues and the proposed API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6234) 10% Performance regression with Breeze upgrade
Nishkam Ravi created SPARK-6234: --- Summary: 10% Performance regression with Breeze upgrade Key: SPARK-6234 URL: https://issues.apache.org/jira/browse/SPARK-6234 Project: Spark Issue Type: Bug Reporter: Nishkam Ravi KMeans regresses by 10% with the Breeze upgrade from 0.10 to 0.11 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)
[ https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353586#comment-14353586 ] Timothy St. Clair commented on SPARK-5368: -- [~sowen] IIRC there are other Bugs around no longer maintaining and akka fork and updating to 2.4. https://issues.apache.org/jira/browse/SPARK-5293 Spark should support NAT (via akka improvements) - Key: SPARK-5368 URL: https://issues.apache.org/jira/browse/SPARK-5368 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: jay vyas Fix For: 1.2.2 Spark sets up actors for akka with a set of variables which are defined in the {{AkkaUtils.scala}} class. A snippet: {noformat} 98 |akka.loggers = [akka.event.slf4j.Slf4jLogger] 99 |akka.stdout-loglevel = ERROR 100 |akka.jvm-exit-on-fatal-error = off 101 |akka.remote.require-cookie = $requireCookie 102 |akka.remote.secure-cookie = $secureCookie {noformat} We should allow users to pass in custom settings, for example, so that arbitrary akka modifications can be used at runtime for security, performance, logging, and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5544) wholeTextFiles should recognize multiple input paths delimited by ,
[ https://issues.apache.org/jira/browse/SPARK-5544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353588#comment-14353588 ] Lev Khomich commented on SPARK-5544: I would like to work on this. [~mengxr], some things I need to clarify. 1. Should this behaviour also apply to `sc.newAPIHadoopFile` and `sc.binaryFiles`? 2. Does this change need to be explicitly documented elsewhere, because it can break backward compatibility (paths containing commas, for example)? wholeTextFiles should recognize multiple input paths delimited by , --- Key: SPARK-5544 URL: https://issues.apache.org/jira/browse/SPARK-5544 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Xiangrui Meng textFile takes delimited paths in a single path string. wholeTextFiles should behave the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-677) PySpark should not collect results through local filesystem
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-677: - Target Version/s: 1.2.2, 1.4.0, 1.3.1 Affects Version/s: (was: 0.7.0) 1.4.0 1.3.0 1.0.2 1.1.1 1.2.1 Assignee: Davies Liu PySpark should not collect results through local filesystem --- Key: SPARK-677 URL: https://issues.apache.org/jira/browse/SPARK-677 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.2, 1.1.1, 1.3.0, 1.2.1, 1.4.0 Reporter: Josh Rosen Assignee: Davies Liu Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs. On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance. Instead, we should stream the data from Java to Python over a local socket or a FIFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353641#comment-14353641 ] Apache Spark commented on SPARK-677: User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4923 PySpark should not collect results through local filesystem --- Key: SPARK-677 URL: https://issues.apache.org/jira/browse/SPARK-677 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.2, 1.1.1, 1.3.0, 1.2.1, 1.4.0 Reporter: Josh Rosen Assignee: Davies Liu Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs. On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance. Instead, we should stream the data from Java to Python over a local socket or a FIFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6233) Should spark.ml Models be distributed by default?
Joseph K. Bradley created SPARK-6233: Summary: Should spark.ml Models be distributed by default? Key: SPARK-6233 URL: https://issues.apache.org/jira/browse/SPARK-6233 Project: Spark Issue Type: Brainstorming Components: ML Affects Versions: 1.4.0 Reporter: Joseph K. Bradley This JIRA is for discussing a potential change for the spark.ml package. *Issue*: When an Estimator runs, it often computes helpful side information which is not stored in the returned Model. (E.g., linear methods have RDDs of residuals.) It would be nice to have this information by default, rather than having to recompute it. *Suggestion*: Introduce a DistributedModel trait. Every Estimator in the spark.ml package should be able to return a distributed model with extra info computed during training. *Motivation*: This kind of info is one of the most useful aspects of R. E.g., when you train a linear model, you can immediately summarize or plot information about the residuals. For MLlib, the user currently has to take extra steps (and computation time) to recompute this info. *API*: My general idea is as follows. {code} trait Model trait LocalModel extends Model trait DistributedModel[LocalModelType: LocalModel] extends Model { /** convert to local model */ def toLocal: LocalModelType } class LocalLDAModel extends LocalModel class DistributedLDAModel[LocalLDAModel] extends DistributedModel { def toLocal: LocalLDAModel } {code} *Issues with this API*: * API stability: To keep the API stable in the future, either (a) all models should return DistributedModels, or (b) all models should return Models which can then be tested for the LocalModel or DistributedModel trait. * memory “leaks”: Users may not expect models to store references to RDDs, so they may be surprised by how much storage is being used. * naturally distributed models: Some models will simply be too large to be converted into LocalModels. It is unclear what to do here. *Is this worthwhile?* Pros: * Saving computation * Easier for users (skipping 1 more step of computing this info) Cons: * API issues * Limited savings on computation. In general, computing this info may take much less time than model training (e.g., computing residuals vs. training a GLM). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6232) Spark Streaming: simple application stalls processing
[ https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Platon Potapov updated SPARK-6232: -- Description: Below is a complete source code of a very simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number into the nc terminal (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall completely (no new events are processed) with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} The stage... message is output to stderr. We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * if the commented-out print is un-commented. * if (window + reduceByKey) is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} was: Below is a complete source code of a very simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number into the nc terminal (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} The stage... message is output to stderr. We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * if the commented-out print is un-commented. * if (window + reduceByKey) is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} Spark Streaming: simple application stalls processing - Key: SPARK-6232 URL: https://issues.apache.org/jira/browse/SPARK-6232 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: Ubuntu, MacOS. Reporter: Platon Potapov Priority: Critical Below is a complete source code of a very simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number into the nc terminal (so that the window would slide over several non-empty RDDs). 2-3 such
[jira] [Resolved] (SPARK-4355) OnlineSummarizer doesn't merge mean correctly
[ https://issues.apache.org/jira/browse/SPARK-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4355. -- Resolution: Fixed Target Version/s: 1.2.0, 1.1.1, 1.0.3 (was: 1.1.1, 1.2.0, 1.0.3) OnlineSummarizer doesn't merge mean correctly - Key: SPARK-4355 URL: https://issues.apache.org/jira/browse/SPARK-4355 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Labels: backport-needed Fix For: 1.0.3, 1.2.0, 1.1.1 It happens when the mean on one side is zero. I will send an PR with some code clean-up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4355) OnlineSummarizer doesn't merge mean correctly
[ https://issues.apache.org/jira/browse/SPARK-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4355: - Fix Version/s: 1.0.3 OnlineSummarizer doesn't merge mean correctly - Key: SPARK-4355 URL: https://issues.apache.org/jira/browse/SPARK-4355 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Labels: backport-needed Fix For: 1.1.1, 1.2.0, 1.0.3 It happens when the mean on one side is zero. I will send an PR with some code clean-up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6232) Spark Streaming: simple application stalls processing
[ https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Platon Potapov updated SPARK-6232: -- Environment: Ubuntu, MacOS. Tried builds with scala 2.11 and 2.10 (for kafka receiver). Also tried the pre-built spark-1.2.1-bin-hadoop2.4.tgz The bug reproduces in all cases on 3 different computers we've tried on. was: Ubuntu, MacOS. Spark Streaming: simple application stalls processing - Key: SPARK-6232 URL: https://issues.apache.org/jira/browse/SPARK-6232 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: Ubuntu, MacOS. Tried builds with scala 2.11 and 2.10 (for kafka receiver). Also tried the pre-built spark-1.2.1-bin-hadoop2.4.tgz The bug reproduces in all cases on 3 different computers we've tried on. Reporter: Platon Potapov Priority: Critical Below is a complete source code of a very simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number into the nc terminal (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall completely (no new events are processed) with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} The stage... message is output to stderr. We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * if the commented-out print is un-commented. * if (window + reduceByKey) is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed
[ https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Shreedharan updated SPARK-6222: Description: When testing for our next release, our internal tests written by [~wypoon] caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs FlumePolling stream to read data from Flume, then kills the Application Master. Once YARN restarts it, the test waits until no more data is to be written and verifies the original against the data on HDFS. This was passing in 1.2.0, but is failing now. Since the test ties into Cloudera's internal infrastructure and build process, it cannot be directly run on an Apache build. But I have been working on isolating the commit that may have caused the regression. I have confirmed that it was caused by SPARK-5147 (PR # [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several times using the test and the failure is consistently reproducible. To re-confirm, I reverted just this one commit (and Clock consolidation one to avoid conflicts), and the issue was no longer reproducible. Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 /cc [~tdas], [~pwendell] was: When testing for our next release, our internal tests written by [~wypoon] caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs FlumePolling stream to read data from Flume, then kills the Application Master. Once YARN restarts it, the test waits until no more data is to be written and verifies the original against the data on HDFS. This was passing in 1.2.0, but is failing now. Since the test ties into Cloudera's internal infrastructure and build process, it cannot be directly run on an Apache build. But I have been working on isolating the commit that may have caused the regression. I have confirmed that it was caused by SPARK-5157 (PR # [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several times using the test and the failure is consistently reproducible. To re-confirm, I reverted just this one commit (and Clock consolidation one to avoid conflicts), and the issue was no longer reproducible. Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 /cc [~tdas], [~pwendell] [STREAMING] All data may not be recovered from WAL when driver is killed Key: SPARK-6222 URL: https://issues.apache.org/jira/browse/SPARK-6222 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Hari Shreedharan Priority: Blocker When testing for our next release, our internal tests written by [~wypoon] caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs FlumePolling stream to read data from Flume, then kills the Application Master. Once YARN restarts it, the test waits until no more data is to be written and verifies the original against the data on HDFS. This was passing in 1.2.0, but is failing now. Since the test ties into Cloudera's internal infrastructure and build process, it cannot be directly run on an Apache build. But I have been working on isolating the commit that may have caused the regression. I have confirmed that it was caused by SPARK-5147 (PR # [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several times using the test and the failure is consistently reproducible. To re-confirm, I reverted just this one commit (and Clock consolidation one to avoid conflicts), and the issue was no longer reproducible. Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 /cc [~tdas], [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6231) Join on two tables (generated from same one) is broken
[ https://issues.apache.org/jira/browse/SPARK-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6231: --- Labels: DataFrame (was: ) Join on two tables (generated from same one) is broken -- Key: SPARK-6231 URL: https://issues.apache.org/jira/browse/SPARK-6231 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Davies Liu Assignee: Michael Armbrust Priority: Critical Labels: DataFrame If the two column used in joinExpr come from the same table, they have the same id, then the joniExpr is explained in wrong way. {code} val df = sqlContext.load(path, parquet) val txns = df.groupBy(cust_id).agg($cust_id, countDistinct($day_num).as(txns)) val spend = df.groupBy(cust_id).agg($cust_id, sum($extended_price).as(spend)) val rmJoin = txns.join(spend, txns(cust_id) === spend(cust_id), inner) scala rmJoin.explain == Physical Plan == CartesianProduct Filter (cust_id#0 = cust_id#0) Aggregate false, [cust_id#0], [cust_id#0,CombineAndCount(partialSets#25) AS txns#7L] Exchange (HashPartitioning [cust_id#0], 200) Aggregate true, [cust_id#0], [cust_id#0,AddToHashSet(day_num#2L) AS partialSets#25] PhysicalRDD [cust_id#0,day_num#2L], MapPartitionsRDD[1] at map at newParquet.scala:542 Aggregate false, [cust_id#17], [cust_id#17,SUM(PartialSum#38) AS spend#8] Exchange (HashPartitioning [cust_id#17], 200) Aggregate true, [cust_id#17], [cust_id#17,SUM(extended_price#20) AS PartialSum#38] PhysicalRDD [cust_id#17,extended_price#20], MapPartitionsRDD[3] at map at newParquet.scala:542 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6050) Spark on YARN does not work --executor-cores is specified
[ https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6050: --- Fix Version/s: (was: 1.4.0) Spark on YARN does not work --executor-cores is specified - Key: SPARK-6050 URL: https://issues.apache.org/jira/browse/SPARK-6050 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Environment: 2.5 based YARN cluster. Reporter: Mridul Muralidharan Assignee: Marcelo Vanzin Priority: Blocker Fix For: 1.3.0 There are multiple issues here (which I will detail as comments), but to reproduce running the following ALWAYS hangs in our cluster with the 1.3 RC ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --executor-cores 8--num-executors 15 --driver-memory 4g --executor-memory 2g --queue webmap lib/spark-examples*.jar 10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3278) Isotonic regression
[ https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353549#comment-14353549 ] Martin Zapletal commented on SPARK-3278: What particular benchmarks would you like to see? I can do them. Isotonic regression --- Key: SPARK-3278 URL: https://issues.apache.org/jira/browse/SPARK-3278 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Martin Zapletal Fix For: 1.3.0 Add isotonic regression for score calibration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)
[ https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353534#comment-14353534 ] Matthew Farrellee commented on SPARK-5368: -- [~srowen] will you take a look at this? i'm trying to run spark via kubernetes (master pod + master service + slave replicationcontroller), and the service layer is creating a NAT-like environment. Spark should support NAT (via akka improvements) - Key: SPARK-5368 URL: https://issues.apache.org/jira/browse/SPARK-5368 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: jay vyas Fix For: 1.2.2 Spark sets up actors for akka with a set of variables which are defined in the {{AkkaUtils.scala}} class. A snippet: {noformat} 98 |akka.loggers = [akka.event.slf4j.Slf4jLogger] 99 |akka.stdout-loglevel = ERROR 100 |akka.jvm-exit-on-fatal-error = off 101 |akka.remote.require-cookie = $requireCookie 102 |akka.remote.secure-cookie = $secureCookie {noformat} We should allow users to pass in custom settings, for example, so that arbitrary akka modifications can be used at runtime for security, performance, logging, and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed
[ https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353358#comment-14353358 ] Tathagata Das commented on SPARK-6222: -- Could you upload the stack traces, and logs that show is error? The PR # 4149 is about automatically deleting old log files. Is it an error that WAL files are deleted automatically too early? [STREAMING] All data may not be recovered from WAL when driver is killed Key: SPARK-6222 URL: https://issues.apache.org/jira/browse/SPARK-6222 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Hari Shreedharan Priority: Blocker When testing for our next release, our internal tests written by [~wypoon] caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs FlumePolling stream to read data from Flume, then kills the Application Master. Once YARN restarts it, the test waits until no more data is to be written and verifies the original against the data on HDFS. This was passing in 1.2.0, but is failing now. Since the test ties into Cloudera's internal infrastructure and build process, it cannot be directly run on an Apache build. But I have been working on isolating the commit that may have caused the regression. I have confirmed that it was caused by SPARK-5147 (PR # [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several times using the test and the failure is consistently reproducible. To re-confirm, I reverted just this one commit (and Clock consolidation one to avoid conflicts), and the issue was no longer reproducible. Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 /cc [~tdas], [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6231) Join on two tables (generated from same one) is broken
Davies Liu created SPARK-6231: - Summary: Join on two tables (generated from same one) is broken Key: SPARK-6231 URL: https://issues.apache.org/jira/browse/SPARK-6231 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Davies Liu Assignee: Michael Armbrust Priority: Critical If the two column used in joinExpr come from the same table, they have the same id, then the joniExpr is explained in wrong way. {code} val df = sqlContext.load(path, parquet) val txns = df.groupBy(cust_id).agg($cust_id, countDistinct($day_num).as(txns)) val spend = df.groupBy(cust_id).agg($cust_id, sum($extended_price).as(spend)) val rmJoin = txns.join(spend, txns(cust_id) === spend(cust_id), inner) scala rmJoin.explain == Physical Plan == CartesianProduct Filter (cust_id#0 = cust_id#0) Aggregate false, [cust_id#0], [cust_id#0,CombineAndCount(partialSets#25) AS txns#7L] Exchange (HashPartitioning [cust_id#0], 200) Aggregate true, [cust_id#0], [cust_id#0,AddToHashSet(day_num#2L) AS partialSets#25] PhysicalRDD [cust_id#0,day_num#2L], MapPartitionsRDD[1] at map at newParquet.scala:542 Aggregate false, [cust_id#17], [cust_id#17,SUM(PartialSum#38) AS spend#8] Exchange (HashPartitioning [cust_id#17], 200) Aggregate true, [cust_id#17], [cust_id#17,SUM(extended_price#20) AS PartialSum#38] PhysicalRDD [cust_id#17,extended_price#20], MapPartitionsRDD[3] at map at newParquet.scala:542 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353347#comment-14353347 ] Xiangrui Meng commented on SPARK-6192: -- [~Manglano] and [~leckie-chn] Thanks for your interests in GSoC Spark MLlib! As [~MechCoder] mentioned, this JIRA was created for him based on his past experience and recent contributions to Spark MLlib. We tried to set a theme for the project but make the actual tasks flexible. So it doesn't mean that we are blocking others from implementing these features. You can contribute any of these features at any time. It would be great if you can start with some small features or helping review others' PRs. We need to know each other before we can plan a GSoC project, but I'm afraid that we may not have enough time to make it happen this year. Anyway, this is a good place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6232) Spark Streaming: simple application stalls processing
Platon Potapov created SPARK-6232: - Summary: Spark Streaming: simple application stalls processing Key: SPARK-6232 URL: https://issues.apache.org/jira/browse/SPARK-6232 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: Ubuntu, MacOS. Reporter: Platon Potapov Priority: Critical Below is a snippet of a simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number (so that the window would slide over several non-empty RDDs). 2-3 numbers is going to be enough for the program to stall with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * in the commented-out print is un-commented. * if the window+reduce is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6232) Spark Streaming: simple application stalls processing
[ https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Platon Potapov updated SPARK-6232: -- Description: Below is a complete source code of a very simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} The stage... message is output to stderr. We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * if the commented-out print is un-commented. * if (window + reduceByKey) is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} was: Below is a snippet of a simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} The stage... message is output to stderr. We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * if the commented-out print is un-commented. * if (window + reduceByKey) is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} Spark Streaming: simple application stalls processing - Key: SPARK-6232 URL: https://issues.apache.org/jira/browse/SPARK-6232 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: Ubuntu, MacOS. Reporter: Platon Potapov Priority: Critical Below is a complete source code of a very simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall with the following output: {code}
[jira] [Updated] (SPARK-6232) Spark Streaming: simple application stalls processing
[ https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Platon Potapov updated SPARK-6232: -- Description: Below is a complete source code of a very simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number into the nc terminal (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} The stage... message is output to stderr. We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * if the commented-out print is un-commented. * if (window + reduceByKey) is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} was: Below is a complete source code of a very simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} The stage... message is output to stderr. We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * if the commented-out print is un-commented. * if (window + reduceByKey) is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} Spark Streaming: simple application stalls processing - Key: SPARK-6232 URL: https://issues.apache.org/jira/browse/SPARK-6232 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: Ubuntu, MacOS. Reporter: Platon Potapov Priority: Critical Below is a complete source code of a very simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number into the nc terminal (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall
[jira] [Commented] (SPARK-6228) Provide SASL support in network/common module
[ https://issues.apache.org/jira/browse/SPARK-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353412#comment-14353412 ] Apache Spark commented on SPARK-6228: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/4953 Provide SASL support in network/common module - Key: SPARK-6228 URL: https://issues.apache.org/jira/browse/SPARK-6228 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin Currently, there's support for SASL in network/shuffle, but not in network/common. Moving the SASL code to network/common would enable other applications using that code to also support secure authentication and, later, encryption. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6219) Expand Python lint checks to check for compilation errors
[ https://issues.apache.org/jira/browse/SPARK-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353473#comment-14353473 ] Sean Owen commented on SPARK-6219: -- OK, that's reasonable to make sure that everything gets compiled, not just what the tests cover. If it's fast, I suppose the only cost is complexity, but your changes are actually more about refactoring than the compilation. You would know this script well as you created most of it I think. I suppose it'd be best to get another Python person to look at it but I don't object, in the name of catching more stuff early. Expand Python lint checks to check for compilation errors -- Key: SPARK-6219 URL: https://issues.apache.org/jira/browse/SPARK-6219 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Priority: Minor An easy lint check for Python would be to make sure the stuff at least compiles. That will catch only the most egregious errors, but it should help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6232) Spark Streaming: simple application stalls processing
[ https://issues.apache.org/jira/browse/SPARK-6232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Platon Potapov updated SPARK-6232: -- Description: Below is a snippet of a simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} The stage... message is output to stderr. We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * in the commented-out print is un-commented. * if the window+reduce is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} was: Below is a snippet of a simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number (so that the window would slide over several non-empty RDDs). 2-3 numbers is going to be enough for the program to stall with the following output: {code} --- Time: 1425922369000 ms --- --- Time: 142592237 ms --- (1.0,4.0) --- Time: 1425922371000 ms --- (1.0,4.0) [Stage 17:=(1 + 0) / 2] {code} We've tried both standalone (local master) and clustered setups - reproduces in all cases. We tried raw sockets and Kafka as a receiver - reproduces in both cases. NOTE that the bug does not reproduce under the following conditions: * the receiver is from a queue (StreamingContext.queueStream) * in the commented-out print is un-commented. * if the window+reduce is substituted to reduceByKeyAndWindow here is the simple test application: {code} import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming._ object SparkStreamingTest extends App { val sparkConf = new SparkConf().setMaster(local[*]).setAppName(SparkStreamingTest) val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines0 = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines0.map(x = (1.0, x.toDouble)) // words.print() // TODO: enable this print to avoid the program freeze val windowed = words.window(Seconds(4), Seconds(1)) val grouped = windowed.reduceByKey(_ + _) grouped.print() ssc.start() ssc.awaitTermination() } {code} Spark Streaming: simple application stalls processing - Key: SPARK-6232 URL: https://issues.apache.org/jira/browse/SPARK-6232 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: Ubuntu, MacOS. Reporter: Platon Potapov Priority: Critical Below is a snippet of a simple test application. Run it in one terminal window, and nc -lk in another. Once per second, enter a number (so that the window would slide over several non-empty RDDs). 2-3 such iterations is going to be enough for the program to stall with the following output: {code} --- Time: 1425922369000 ms ---
[jira] [Commented] (SPARK-6022) GraphX `diff` test incorrectly operating on values (not VertexId's)
[ https://issues.apache.org/jira/browse/SPARK-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353364#comment-14353364 ] Brennon York commented on SPARK-6022: - The test is correct (in what I believe {{diff}} should do). Maybe [~ankurd] can chime in here? And you're also correct in that the code implementing {{diff}} doesn't currently work properly which is why I believe this test should correctly assess whether {{diff}} is operating correctly. GraphX `diff` test incorrectly operating on values (not VertexId's) --- Key: SPARK-6022 URL: https://issues.apache.org/jira/browse/SPARK-6022 Project: Spark Issue Type: Bug Components: GraphX Reporter: Brennon York The current GraphX {{diff}} test operates on values rather than the VertexId's and, if {{diff}} were working properly (per [SPARK-4600|https://issues.apache.org/jira/browse/SPARK-4600]), it should fail this test. The code to test {{diff}} should look like the below as it correctly generates {{VertexRDD}}'s with different {{VertexId}}'s to {{diff}} against. {code} test(diff functionality with small concrete values) { withSpark { sc = val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id = (id, id.toInt))) // setA := Set((0L, 0), (1L, 1)) val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id = (id, id.toInt+2))) // setB := Set((1L, 3), (2L, 4)) val diff = setA.diff(setB) assert(diff.collect.toSet == Set((2L, 4))) } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3278) Isotonic regression
[ https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353419#comment-14353419 ] Xiangrui Meng commented on SPARK-3278: -- I don't know any. It really depends on how may buckets it outputs. I can imagine problems with 100M buckets. Isotonic regression --- Key: SPARK-3278 URL: https://issues.apache.org/jira/browse/SPARK-3278 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Martin Zapletal Fix For: 1.3.0 Add isotonic regression for score calibration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3477) Clean up code in Yarn Client / ClientBase
[ https://issues.apache.org/jira/browse/SPARK-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353784#comment-14353784 ] Peter Rudenko commented on SPARK-3477: -- +1 to return these classes to public. There's [an article|http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/] and i also have a use case to submit job programattically. Clean up code in Yarn Client / ClientBase - Key: SPARK-3477 URL: https://issues.apache.org/jira/browse/SPARK-3477 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or Fix For: 1.2.0 With the addition of new features and supporting multiple versions of yarn the code has become cumbersome and could use some clean. We should add comments and update to follow the new style guilines also. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6234) 10% Performance regression with Breeze upgrade
[ https://issues.apache.org/jira/browse/SPARK-6234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353793#comment-14353793 ] Nishkam Ravi commented on SPARK-6234: - [~mengxr] Variant of org.apache.spark.examples.LocalKMeans which uses Breeze. Input dataset: 20GB, 6-node cluster. 10% Performance regression with Breeze upgrade -- Key: SPARK-6234 URL: https://issues.apache.org/jira/browse/SPARK-6234 Project: Spark Issue Type: Bug Reporter: Nishkam Ravi KMeans regresses by 10% with the Breeze upgrade from 0.10 to 0.11 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6142) 10-12% Performance regression with finalize
[ https://issues.apache.org/jira/browse/SPARK-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353826#comment-14353826 ] Sean Owen commented on SPARK-6142: -- Is this resolved by reverting those commits then? 10-12% Performance regression with finalize - Key: SPARK-6142 URL: https://issues.apache.org/jira/browse/SPARK-6142 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Nishkam Ravi 10-12% performance regression in PageRank (and potentially other workloads) caused due to the use of finalize in ExternalAppendOnlyMap. Introduced by a commit on Feb 19th. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6005) Flaky test: o.a.s.streaming.kafka.DirectKafkaStreamSuite.offset recovery
[ https://issues.apache.org/jira/browse/SPARK-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353838#comment-14353838 ] Tathagata Das commented on SPARK-6005: -- [~c...@koeninger.org] Can you check this out? Flaky test: o.a.s.streaming.kafka.DirectKafkaStreamSuite.offset recovery Key: SPARK-6005 URL: https://issues.apache.org/jira/browse/SPARK-6005 Project: Spark Issue Type: Bug Components: Streaming Reporter: Iulian Dragos Labels: flaky-test, kafka, streaming [Link to failing test on Jenkins|https://ci.typesafe.com/view/Spark/job/spark-nightly-build/lastCompletedBuild/testReport/org.apache.spark.streaming.kafka/DirectKafkaStreamSuite/offset_recovery/] {code} The code passed to eventually never returned normally. Attempted 208 times over 10.00622791 seconds. Last failure message: strings.forall({ ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem)) }) was false. {code} {code:title=Stack trace} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 208 times over 10.00622791 seconds. Last failure message: strings.forall({ ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem)) }) was false. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.org$apache$spark$streaming$kafka$DirectKafkaStreamSuite$$anonfun$$sendDataAndWaitForReceive$1(DirectKafkaStreamSuite.scala:225) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.apply$mcV$sp(DirectKafkaStreamSuite.scala:287) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.apply(DirectKafkaStreamSuite.scala:211) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.apply(DirectKafkaStreamSuite.scala:211) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:39) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:39) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at
[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed
[ https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353840#comment-14353840 ] Tathagata Das commented on SPARK-6222: -- Which patch fixes the issue? [STREAMING] All data may not be recovered from WAL when driver is killed Key: SPARK-6222 URL: https://issues.apache.org/jira/browse/SPARK-6222 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Hari Shreedharan Priority: Blocker Attachments: SPARK-6122.patch When testing for our next release, our internal tests written by [~wypoon] caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs FlumePolling stream to read data from Flume, then kills the Application Master. Once YARN restarts it, the test waits until no more data is to be written and verifies the original against the data on HDFS. This was passing in 1.2.0, but is failing now. Since the test ties into Cloudera's internal infrastructure and build process, it cannot be directly run on an Apache build. But I have been working on isolating the commit that may have caused the regression. I have confirmed that it was caused by SPARK-5147 (PR # [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several times using the test and the failure is consistently reproducible. To re-confirm, I reverted just this one commit (and Clock consolidation one to avoid conflicts), and the issue was no longer reproducible. Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 /cc [~tdas], [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6234) 10% Performance regression with Breeze upgrade
[ https://issues.apache.org/jira/browse/SPARK-6234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353854#comment-14353854 ] Sean Owen commented on SPARK-6234: -- No, the thing that's not important here is the example implementation. It is not an example of using K-means in MLlib, but an example of a completely de novo, separate implementation of K-means that is provided as an example of using *Spark*. I don't know why Breeze or something that uses it would be slower though. The only thing here doing any serious computation is squaredDistance. That did change in 0.11: https://github.com/scalanlp/breeze/commit/5c26a9bceb1fbd621421fa459e1b1202e91f5e9b#diff-e9531f2d5b65b7140b75c0b1c4dab541 If you have the energy, a tightly-focused test case on this method that shows a performance hit would be useful to report against Breeze. I think all in all the positives of 0.11 outweigh negatives, but, this downside was not expected, if it is confirmed. If so it may not only affect this example. 10% Performance regression with Breeze upgrade -- Key: SPARK-6234 URL: https://issues.apache.org/jira/browse/SPARK-6234 Project: Spark Issue Type: Bug Reporter: Nishkam Ravi KMeans regresses by 10% with the Breeze upgrade from 0.10 to 0.11 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed
[ https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353867#comment-14353867 ] Hari Shreedharan commented on SPARK-6222: - [~srowen] This patch is actually not intended to fix the issue, since this patch will cause the WAL to not be cleaned up - which is not something we want. This was only intended to help isolate the problem -- from this patch it is clear that we are somehow attempting to clear the WAL data prematurely, causing the regression - when and why I am not yet sure. [STREAMING] All data may not be recovered from WAL when driver is killed Key: SPARK-6222 URL: https://issues.apache.org/jira/browse/SPARK-6222 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Hari Shreedharan Priority: Blocker Attachments: SPARK-6122.patch When testing for our next release, our internal tests written by [~wypoon] caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs FlumePolling stream to read data from Flume, then kills the Application Master. Once YARN restarts it, the test waits until no more data is to be written and verifies the original against the data on HDFS. This was passing in 1.2.0, but is failing now. Since the test ties into Cloudera's internal infrastructure and build process, it cannot be directly run on an Apache build. But I have been working on isolating the commit that may have caused the regression. I have confirmed that it was caused by SPARK-5147 (PR # [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several times using the test and the failure is consistently reproducible. To re-confirm, I reverted just this one commit (and Clock consolidation one to avoid conflicts), and the issue was no longer reproducible. Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 /cc [~tdas], [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2629) Improve performance of DStream.updateStateByKey
[ https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2629: - Summary: Improve performance of DStream.updateStateByKey (was: Improve performance of DStream.updateStateByKey using IndexRDD) Improve performance of DStream.updateStateByKey --- Key: SPARK-2629 URL: https://issues.apache.org/jira/browse/SPARK-2629 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5155) Python API for MQTT streaming
[ https://issues.apache.org/jira/browse/SPARK-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353913#comment-14353913 ] Tathagata Das commented on SPARK-5155: -- This issue is still blocking on us figuring out all the details of python API for Kafka. Python API for MQTT streaming - Key: SPARK-5155 URL: https://issues.apache.org/jira/browse/SPARK-5155 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Davies Liu Assignee: Prabeesh K Python API for MQTT Utils -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6238) Support shuffle where individual blocks might be 2G
Reynold Xin created SPARK-6238: -- Summary: Support shuffle where individual blocks might be 2G Key: SPARK-6238 URL: https://issues.apache.org/jira/browse/SPARK-6238 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6237) Support network transfer for blocks larger than 2G
Reynold Xin created SPARK-6237: -- Summary: Support network transfer for blocks larger than 2G Key: SPARK-6237 URL: https://issues.apache.org/jira/browse/SPARK-6237 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5155) Python API for MQTT streaming
[ https://issues.apache.org/jira/browse/SPARK-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5155: - Target Version/s: 1.4.0 (was: 1.3.0) Python API for MQTT streaming - Key: SPARK-5155 URL: https://issues.apache.org/jira/browse/SPARK-5155 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Davies Liu Assignee: Prabeesh K Python API for MQTT Utils -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6236) Support caching blocks larger than 2G
Reynold Xin created SPARK-6236: -- Summary: Support caching blocks larger than 2G Key: SPARK-6236 URL: https://issues.apache.org/jira/browse/SPARK-6236 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Due to the use java.nio.ByteBuffer, BlockManager does not support blocks larger than 2G. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6190) create LargeByteBuffer abstraction for eliminating 2GB limit on blocks
[ https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353940#comment-14353940 ] Reynold Xin commented on SPARK-6190: Hi [~imranr], As I said earlier, I would advise against attacking the network transfer problem at this point. We don't hear that often from users complaining about the 2G limit, and the complain of various issues drop probably by an order of magnitude in the following order: - caching 2g - fetching 2g non shuffle block - fetching 2g shuffle block - uploading 2g I think it'd make sense to solve the caching 2g limit first. It is important to think about the network part, but I would not try to address it here. It is much more complicated to deal with, e.g. transferring very large data in one shot brings all sorts of complicated resource management problems (e.g. large transfer blocking small ones, memory management, allocation...). For caching, I can think of two days to do this. First is to have a large byte buffer abstraction that encapsulates multiple, smallers buffers, as proposed here. Another is to assume the block manager can only handle blocks 2g, and then have the upper layers handle the chunking and reconnecting. It is not yet clear to me which one is better. While the first approach provides a better, clearer abstraction, the 2nd approach would allow us to cache partial blocks. Do you have any thoughts on this? Now for the large buffer abstraction here -- I'm confused. The proposed design is read-only. How do we even create a buffer? create LargeByteBuffer abstraction for eliminating 2GB limit on blocks -- Key: SPARK-6190 URL: https://issues.apache.org/jira/browse/SPARK-6190 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Imran Rashid Assignee: Imran Rashid Attachments: LargeByteBuffer.pdf A key component in eliminating the 2GB limit on blocks is creating a proper abstraction for storing more than 2GB. Currently spark is limited by a reliance on nio ByteBuffer and netty ByteBuf, both of which are limited at 2GB. This task will introduce the new abstraction and the relevant implementation and utilities, without effecting the existing implementation at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6128) Update Spark Streaming Guide for Spark 1.3
[ https://issues.apache.org/jira/browse/SPARK-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353818#comment-14353818 ] Apache Spark commented on SPARK-6128: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/4956 Update Spark Streaming Guide for Spark 1.3 -- Key: SPARK-6128 URL: https://issues.apache.org/jira/browse/SPARK-6128 Project: Spark Issue Type: Improvement Components: Documentation, Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Things to update - New Kafka Direct API - Python Kafka API - Add joins to streaming guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed
[ https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353842#comment-14353842 ] Hari Shreedharan commented on SPARK-6222: - The one on the jira. [STREAMING] All data may not be recovered from WAL when driver is killed Key: SPARK-6222 URL: https://issues.apache.org/jira/browse/SPARK-6222 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Hari Shreedharan Priority: Blocker Attachments: SPARK-6122.patch When testing for our next release, our internal tests written by [~wypoon] caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs FlumePolling stream to read data from Flume, then kills the Application Master. Once YARN restarts it, the test waits until no more data is to be written and verifies the original against the data on HDFS. This was passing in 1.2.0, but is failing now. Since the test ties into Cloudera's internal infrastructure and build process, it cannot be directly run on an Apache build. But I have been working on isolating the commit that may have caused the regression. I have confirmed that it was caused by SPARK-5147 (PR # [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several times using the test and the failure is consistently reproducible. To re-confirm, I reverted just this one commit (and Clock consolidation one to avoid conflicts), and the issue was no longer reproducible. Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 /cc [~tdas], [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5045) Update FlumePollingReceiver to use updated Receiver API
[ https://issues.apache.org/jira/browse/SPARK-5045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5045: - Target Version/s: (was: 1.3.0) Update FlumePollingReceiver to use updated Receiver API --- Key: SPARK-5045 URL: https://issues.apache.org/jira/browse/SPARK-5045 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Hari Shreedharan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5048) Add Flume to the Python Streaming API
[ https://issues.apache.org/jira/browse/SPARK-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5048: - Assignee: Hari Shreedharan Add Flume to the Python Streaming API - Key: SPARK-5048 URL: https://issues.apache.org/jira/browse/SPARK-5048 Project: Spark Issue Type: Improvement Components: PySpark, Streaming Reporter: Tathagata Das Assignee: Hari Shreedharan This is a similar effort as SPARK-5047 is for Kafka, and should take the same approach as it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5048) Add Flume to the Python Streaming API
[ https://issues.apache.org/jira/browse/SPARK-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5048: - Target Version/s: 1.4.0 (was: 1.3.0) Add Flume to the Python Streaming API - Key: SPARK-5048 URL: https://issues.apache.org/jira/browse/SPARK-5048 Project: Spark Issue Type: Improvement Components: PySpark, Streaming Reporter: Tathagata Das This is a similar effort as SPARK-5047 is for Kafka, and should take the same approach as it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5048) Add Flume to the Python Streaming API
[ https://issues.apache.org/jira/browse/SPARK-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353918#comment-14353918 ] Tathagata Das commented on SPARK-5048: -- [~hshreedharan] Can you take a crack at this? Add Flume to the Python Streaming API - Key: SPARK-5048 URL: https://issues.apache.org/jira/browse/SPARK-5048 Project: Spark Issue Type: Improvement Components: PySpark, Streaming Reporter: Tathagata Das Assignee: Hari Shreedharan This is a similar effort as SPARK-5047 is for Kafka, and should take the same approach as it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated SPARK-5682: Summary: Add encrypted shuffle in spark (was: Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle) Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. We reuse hadoop encrypted shuffle feature to spark and because ugi credential info is necessary in encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6234) 10% Performance regression with Breeze upgrade
[ https://issues.apache.org/jira/browse/SPARK-6234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353843#comment-14353843 ] Nishkam Ravi commented on SPARK-6234: - Are we saying that Breeze's performance is unimportant or are we saying that performance of K-Means-with-breeze is unimportant? If it's the former, we can promptly close this JIRA. If it's the later, K-Means should be perceived as a random piece of code that exposes a performance bug in Breeze. 10% Performance regression with Breeze upgrade -- Key: SPARK-6234 URL: https://issues.apache.org/jira/browse/SPARK-6234 Project: Spark Issue Type: Bug Reporter: Nishkam Ravi KMeans regresses by 10% with the Breeze upgrade from 0.10 to 0.11 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed
[ https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353858#comment-14353858 ] Sean Owen commented on SPARK-6222: -- [~hshreedharan] you can make a [WIP] pull request instead of a patch. It's easier to review that way. [STREAMING] All data may not be recovered from WAL when driver is killed Key: SPARK-6222 URL: https://issues.apache.org/jira/browse/SPARK-6222 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Hari Shreedharan Priority: Blocker Attachments: SPARK-6122.patch When testing for our next release, our internal tests written by [~wypoon] caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs FlumePolling stream to read data from Flume, then kills the Application Master. Once YARN restarts it, the test waits until no more data is to be written and verifies the original against the data on HDFS. This was passing in 1.2.0, but is failing now. Since the test ties into Cloudera's internal infrastructure and build process, it cannot be directly run on an Apache build. But I have been working on isolating the commit that may have caused the regression. I have confirmed that it was caused by SPARK-5147 (PR # [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several times using the test and the failure is consistently reproducible. To re-confirm, I reverted just this one commit (and Clock consolidation one to avoid conflicts), and the issue was no longer reproducible. Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 /cc [~tdas], [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5252) Streaming StatefulNetworkWordCount example hangs
[ https://issues.apache.org/jira/browse/SPARK-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353857#comment-14353857 ] Tathagata Das commented on SPARK-5252: -- [~LutzBuech] Can you try out the latest master and see if the problem still persists? This should have been fixed. Streaming StatefulNetworkWordCount example hangs Key: SPARK-5252 URL: https://issues.apache.org/jira/browse/SPARK-5252 Project: Spark Issue Type: Bug Components: Examples, PySpark, Streaming Affects Versions: 1.2.0 Environment: Ubuntu Linux Reporter: Lutz Buech Attachments: debug.txt Running the stateful network word count example in Python (on one local node): https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py At the beginning, when no data is streamed, empty status outputs are generated, only decorated by the current Time, e.g.: --- Time: 2015-01-14 17:58:20 --- --- Time: 2015-01-14 17:58:21 --- As soon as I stream some data via netcat, no new status updates will show. Instead, one line saying [Stage number: (2 + 0) / 3] where number is some integer number, e.g. 132. There is no further output on stdout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6234) 10% Performance regression with Breeze upgrade
[ https://issues.apache.org/jira/browse/SPARK-6234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353877#comment-14353877 ] Nishkam Ravi commented on SPARK-6234: - Right. This particular implementation can be thought of as a unit test case for Breeze (or squaredDistance if you will) for the purpose of this discussion. As mentioned in the PR, we seem to have three options: 1. Absorb the perf regression 2. Find the problem in Breeze and fix it while retaining 0.11 in Spark 3. Revert back to 0.10 (potentially open a JIRA for Breeze and upgrade to 0.11 when fixed) Assuming that we can/will report this problem against Breeze, for upstream Spark which of these three options do we prefer? For downstream, I'd recommend 3. 10% Performance regression with Breeze upgrade -- Key: SPARK-6234 URL: https://issues.apache.org/jira/browse/SPARK-6234 Project: Spark Issue Type: Bug Reporter: Nishkam Ravi KMeans regresses by 10% with the Breeze upgrade from 0.10 to 0.11 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5042) Updated Receiver API to make it easier to write reliable receivers that ack source
[ https://issues.apache.org/jira/browse/SPARK-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5042: - Target Version/s: (was: 1.4.0) Updated Receiver API to make it easier to write reliable receivers that ack source -- Key: SPARK-5042 URL: https://issues.apache.org/jira/browse/SPARK-5042 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Receivers in Spark Streaming receive data from different sources and push them into Spark’s block manager. However, the received records must be chunked into blocks before being pushed into the BlockManager. Related to this, the Receiver API provides two kinds of store() - 1. store(single record) - The receiver implementation submits one record-at-a-time and the system takes care of dividing it into right sized blocks, and limiting the ingestion rates. In future, it should also be able to do automatic rate / flow control. However, there is no feedback to the receiver on when blocks are formed thus no way to ensure reliability guarantees. Overall, receivers using this are easy to implement. 2. store(multiple records)- The receiver submits multiple records and that forms the blocks that are stored in the block manager. The receiver implementation has full control over block generation, which allows the receiver acknowledge source when blocks have been reliably received by BlockManager and/or WriteAheadLog. However, the implementation of the receivers will not get automatic block sizing and rate controlling; the developer will have to take care of that. All this adds to the complexity of the receiver implementation. So, to summarize, the (2) has the advantage of full control over block generation, but the users have to deal with the complexity of generating blocks of the right block size and rate control. So we want to update this API such that it is becomes easier for developers to achieve reliable receiving of records without sacrificing automatic block sizing and rate control. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2629) Improve performance of DStream.updateStateByKey
[ https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353906#comment-14353906 ] Tathagata Das commented on SPARK-2629: -- Since IndexRDD is not supposed to be added to the core Spark API, we are going to investigate other ways of improving the performance. Improve performance of DStream.updateStateByKey --- Key: SPARK-2629 URL: https://issues.apache.org/jira/browse/SPARK-2629 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5205) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI
[ https://issues.apache.org/jira/browse/SPARK-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5205: - Target Version/s: 1.4.0, 1.3.1 (was: 1.3.0, 1.2.1) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI -- Key: SPARK-5205 URL: https://issues.apache.org/jira/browse/SPARK-5205 Project: Spark Issue Type: Bug Components: Streaming Reporter: uncleGen The kill link is used to kill a stage in job. It works in any kinds of Spark job but Spark Streaming. To be specific, we can only kill the stage which is used to run Receiver, but not kill the Receivers. Well, the stage can be killed and cleaned from the ui, but the receivers are still alive and receiving data. I think it dose not fit with the common sense. IMHO, killing the receiver stage means kill the receivers and stopping receiving data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5046) Update KinesisReceiver to use updated Receiver API
[ https://issues.apache.org/jira/browse/SPARK-5046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5046: - Target Version/s: 1.4.0 (was: 1.3.0) Update KinesisReceiver to use updated Receiver API -- Key: SPARK-5046 URL: https://issues.apache.org/jira/browse/SPARK-5046 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Currently the KinesisReceier is not reliable as it does not correctly acknowledge the source. This tasks is to update the receiver to use the updated Receiver API in SPARK-5042 and implement a reliable receiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org