[jira] [Closed] (SPARK-8687) Spark on yarn-client mode can't send `spark.yarn.credentials.file` to executor.
[ https://issues.apache.org/jira/browse/SPARK-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8687. Resolution: Fixed Assignee: SaintBacchus Fix Version/s: 1.5.0 Target Version/s: 1.5.0 Spark on yarn-client mode can't send `spark.yarn.credentials.file` to executor. --- Key: SPARK-8687 URL: https://issues.apache.org/jira/browse/SPARK-8687 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.5.0 Reporter: SaintBacchus Assignee: SaintBacchus Fix For: 1.5.0 Yarn will set +spark.yarn.credentials.file+ after *DriverEndpoint* initialized. So executor will fetch the old configuration and will cause the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8783) CTAS with WITH clause does not work
Keuntae Park created SPARK-8783: --- Summary: CTAS with WITH clause does not work Key: SPARK-8783 URL: https://issues.apache.org/jira/browse/SPARK-8783 Project: Spark Issue Type: Bug Components: SQL Reporter: Keuntae Park Priority: Minor Following CTAS with WITH clause query {code} CREATE TABLE with_table1 AS WITH T AS ( SELECT * FROM table1 ) SELECT * FROM T {code} induces following error {code} no such table T; line 7 pos 5 org.apache.spark.sql.AnalysisException: no such table T; line 7 pos 5 ... {code} I think that WITH clause within CTAS is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8769) toLocalIterator should mention it results in many jobs
[ https://issues.apache.org/jira/browse/SPARK-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8769. Resolution: Fixed Assignee: holdenk Fix Version/s: 1.4.2 1.5.0 Target Version/s: 1.5.0, 1.4.2 toLocalIterator should mention it results in many jobs -- Key: SPARK-8769 URL: https://issues.apache.org/jira/browse/SPARK-8769 Project: Spark Issue Type: Documentation Components: Documentation Reporter: holdenk Assignee: holdenk Priority: Trivial Fix For: 1.5.0, 1.4.2 toLocalIterator on RDDs should mention that it results in mutliple jobs, and that to avoid re-computing, if the input was the result of a wide-transformation, the input should be cached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8706) Implement Pylint / Prospector checks for PySpark
[ https://issues.apache.org/jira/browse/SPARK-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611611#comment-14611611 ] Manoj Kumar commented on SPARK-8706: Sorry for sounding dumb, but the present code downloads pep8 as a script. However it seems that pylint is a repo, which again has two dependencies. What is the preferred way to do this in Spark? Implement Pylint / Prospector checks for PySpark Key: SPARK-8706 URL: https://issues.apache.org/jira/browse/SPARK-8706 Project: Spark Issue Type: New Feature Components: Project Infra, PySpark Reporter: Josh Rosen It would be nice to implement Pylint / Prospector (https://github.com/landscapeio/prospector) checks for PySpark. As with the style checker rules, I'll imagine that we'll want to roll out new rules gradually in order to avoid a mass refactoring commit. For starters, we should create a pull request that introduces the harness for running the linters, add a configuration file which enables only the lint checks that currently pass, and install the required dependencies on Jenkins. Once we've done this, we can open a series of smaller followup PRs to gradually enable more linting checks and to fix existing violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8729) Spark app unable to instantiate the classes using the reflection
[ https://issues.apache.org/jira/browse/SPARK-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8729. -- Resolution: Not A Problem Spark app unable to instantiate the classes using the reflection Key: SPARK-8729 URL: https://issues.apache.org/jira/browse/SPARK-8729 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.3.0 Reporter: Murthy Chelankuri Priority: Critical SPARK 1.3.0 unable to instantiate the classes using the reflection (using Class.forName). It says class not found even that class is available in the list jars. The following is the expection i am getting by the executors java.lang.ClassNotFoundException: com.abc.mq.msg.ObjectEncoder at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at kafka.utils.Utils$.createObject(Utils.scala:438) at kafka.producer.Producer.init(Producer.scala:61) The application is working fine with out any issues with 1.2.0 version. I am planing to upgrade to 1.3.0 and found it its not working. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611527#comment-14611527 ] hujiayin edited comment on SPARK-5682 at 7/2/15 6:10 AM: - Steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key If you use a better cypher solution, the performance downgrade will be minimized. i think AES is a bit heavy. In the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop Though the API is public stable, however, you cannot ensure if the API will not be changed since it is not the comercial software. was (Author: hujiayin): Steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key In the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop Though the API is public stable, however, you cannot ensure if the API will not be changed since it is not the comercial software. Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx, Design Document of Encrypted Spark Shuffle_20150402.docx, Design Document of Encrypted Spark Shuffle_20150506.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3071) Increase default driver memory
[ https://issues.apache.org/jira/browse/SPARK-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3071. Resolution: Fixed Fix Version/s: 1.5.0 Increase default driver memory -- Key: SPARK-3071 URL: https://issues.apache.org/jira/browse/SPARK-3071 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.2 Reporter: Xiangrui Meng Assignee: Ilya Ganelin Fix For: 1.5.0 The current default is 512M, which is usually too small because user also uses driver to do some computation. In local mode, executor memory setting is ignored while only driver memory is used, which provides more incentive to increase the default driver memory. I suggest 1. 2GB in local mode and warn users if executor memory is set a bigger value 2. same as worker memory on an EC2 standalone server -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8782) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
[ https://issues.apache.org/jira/browse/SPARK-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8782: --- Assignee: Apache Spark (was: Josh Rosen) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes) Key: SPARK-8782 URL: https://issues.apache.org/jira/browse/SPARK-8782 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Apache Spark Priority: Blocker Queries containing ORDER BY NULL currently result in a code generation exception: {code} public SpecificOrdering generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { return new SpecificOrdering(expr); } class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { private org.apache.spark.sql.catalyst.expressions.Expression[] expressions = null; public SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { expressions = expr; } @Override public int compare(InternalRow a, InternalRow b) { InternalRow i = null; // Holds current row being evaluated. i = a; final Object primitive1 = null; i = b; final Object primitive3 = null; if (true true) { // Nothing } else if (true) { return -1; } else if (true) { return 1; } else { int comp = primitive1.compare(primitive3); if (comp != 0) { return comp; } } return 0; } } org.codehaus.commons.compiler.CompileException: Line 29, Column 43: A method named compare is not declared in any enclosing class nor any supertype, nor through a static import at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8782) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
[ https://issues.apache.org/jira/browse/SPARK-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8782: --- Assignee: Josh Rosen (was: Apache Spark) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes) Key: SPARK-8782 URL: https://issues.apache.org/jira/browse/SPARK-8782 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Queries containing ORDER BY NULL currently result in a code generation exception: {code} public SpecificOrdering generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { return new SpecificOrdering(expr); } class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { private org.apache.spark.sql.catalyst.expressions.Expression[] expressions = null; public SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { expressions = expr; } @Override public int compare(InternalRow a, InternalRow b) { InternalRow i = null; // Holds current row being evaluated. i = a; final Object primitive1 = null; i = b; final Object primitive3 = null; if (true true) { // Nothing } else if (true) { return -1; } else if (true) { return 1; } else { int comp = primitive1.compare(primitive3); if (comp != 0) { return comp; } } return 0; } } org.codehaus.commons.compiler.CompileException: Line 29, Column 43: A method named compare is not declared in any enclosing class nor any supertype, nor through a static import at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8782) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
[ https://issues.apache.org/jira/browse/SPARK-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611543#comment-14611543 ] Apache Spark commented on SPARK-8782: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7179 GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes) Key: SPARK-8782 URL: https://issues.apache.org/jira/browse/SPARK-8782 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Queries containing ORDER BY NULL currently result in a code generation exception: {code} public SpecificOrdering generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { return new SpecificOrdering(expr); } class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { private org.apache.spark.sql.catalyst.expressions.Expression[] expressions = null; public SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { expressions = expr; } @Override public int compare(InternalRow a, InternalRow b) { InternalRow i = null; // Holds current row being evaluated. i = a; final Object primitive1 = null; i = b; final Object primitive3 = null; if (true true) { // Nothing } else if (true) { return -1; } else if (true) { return 1; } else { int comp = primitive1.compare(primitive3); if (comp != 0) { return comp; } } return 0; } } org.codehaus.commons.compiler.CompileException: Line 29, Column 43: A method named compare is not declared in any enclosing class nor any supertype, nor through a static import at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only
[ https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611555#comment-14611555 ] Antony Mayi commented on SPARK-8708: bq. Antony Mayi In your real case, how many partitions did ALS.predictAll return? 512 partitions of which 511 are empty and the single one with all 13M ratings. MatrixFactorizationModel.predictAll() populates single partition only - Key: SPARK-8708 URL: https://issues.apache.org/jira/browse/SPARK-8708 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Antony Mayi When using mllib.recommendation.ALS the RDD returned by .predictAll() has all values pushed into single partition despite using quite high parallelism. This degrades performance of further processing (I can obviously run .partitionBy()) to balance it but that's still too costly (ie if running .predictAll() in loop for thousands of products) and should be possible to do it rather somehow on the model (automatically)). Bellow is an example on tiny sample (same on large dataset): {code:title=pyspark} r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) r4 = (2, 2, 2.0) r5 = (3, 1, 1.0) ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) ratings.getNumPartitions() 5 users = ratings.map(itemgetter(0)).distinct() model = ALS.trainImplicit(ratings, 1, seed=10) predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) predictions_for_2.glom().map(len).collect() [0, 0, 3, 0, 0] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8781) Pusblished POMs are no longer effective POMs
[ https://issues.apache.org/jira/browse/SPARK-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611571#comment-14611571 ] Sean Owen commented on SPARK-8781: -- Does this affect release artifacts or just the snapshot? That commit doesn't look related since it doesn't touch the lines you reference here. Are you sure? If it's 'fixed' by changing it is maybe something else at work? Pusblished POMs are no longer effective POMs Key: SPARK-8781 URL: https://issues.apache.org/jira/browse/SPARK-8781 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.2, 1.4.1, 1.5.0 Reporter: Konstantin Shaposhnikov Published to maven repository POMs are no longer effective POMs. E.g. In https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom: {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_${scala.binary.version}/artifactId version${project.version}/version /dependency ... {noformat} while it should be {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_2.11/artifactId version${project.version}/version /dependency ... {noformat} The following commits are most likely the cause of it: - for branch-1.3: https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129 - for branch-1.4: https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78 - for master: https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724 On branch-1.4 reverting the commit fixed the issue. See SPARK-3812 for additional details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8726) Wrong spark.executor.memory when using different EC2 master and worker machine types
[ https://issues.apache.org/jira/browse/SPARK-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611584#comment-14611584 ] Stefano Parmesan commented on SPARK-8726: - I've created a pull request for this issue: https://github.com/mesos/spark-ec2/pull/128 Wrong spark.executor.memory when using different EC2 master and worker machine types Key: SPARK-8726 URL: https://issues.apache.org/jira/browse/SPARK-8726 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.0 Reporter: Stefano Parmesan By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32]; when using the same instance type for master and workers you will not notice, but when using different ones (which makes sense, as the master cannot be a spot instance, and using a big machine for the master would be a waste of resources) the default amount of memory given to each worker is capped to the amount of RAM available on the master (ex: if you create a cluster with an m1.small master (1.7GB RAM) and one m1.large worker (7.5GB RAM), spark.executor.memory will be set to 512MB). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8787) Change the parameter order of @deprecated in package object sql
[ https://issues.apache.org/jira/browse/SPARK-8787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611650#comment-14611650 ] Apache Spark commented on SPARK-8787: - User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/7183 Change the parameter order of @deprecated in package object sql Key: SPARK-8787 URL: https://issues.apache.org/jira/browse/SPARK-8787 Project: Spark Issue Type: Improvement Components: SQL Reporter: Vinod KC Priority: Trivial Parameter order of @deprecated annotation in package object sql is wrong deprecated(1.3.0, use DataFrame) . This has to be changed to deprecated(use DataFrame, 1.3.0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611527#comment-14611527 ] hujiayin edited comment on SPARK-5682 at 7/2/15 6:03 AM: - Steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key In the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop Though the API is public stable, however, you cannot ensure if the API will not be changed since it is not the comercial software. was (Author: hujiayin): steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key in the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop though it is public stable, however, you cannot ensure if the api will not be changed since it was not the comercial software. Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx, Design Document of Encrypted Spark Shuffle_20150402.docx, Design Document of Encrypted Spark Shuffle_20150506.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8688) Hadoop Configuration has to disable client cache when writing or reading delegation tokens.
[ https://issues.apache.org/jira/browse/SPARK-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8688. Resolution: Fixed Assignee: SaintBacchus Fix Version/s: 1.5.0 Target Version/s: 1.5.0 Hadoop Configuration has to disable client cache when writing or reading delegation tokens. --- Key: SPARK-8688 URL: https://issues.apache.org/jira/browse/SPARK-8688 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.5.0 Reporter: SaintBacchus Assignee: SaintBacchus Fix For: 1.5.0 In class *AMDelegationTokenRenewer* and *ExecutorDelegationTokenUpdater*, Spark will write and read the credentials. But if we don't disable the *fs.hdfs.impl.disable.cache*, Spark will use cached FileSystem (which will use old token ) to upload or download file. Then when the old token is expired, it can't gain the auth to get/put the hdfs. (I only tested in a very short time with the configuration: dfs.namenode.delegation.token.renew-interval=3min dfs.namenode.delegation.token.max-lifetime=10min I'm not sure whatever it matters. ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611527#comment-14611527 ] hujiayin edited comment on SPARK-5682 at 7/2/15 6:02 AM: - steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key in the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said rely on hadoop though it is public stable, however, you cannot ensure if the api will not be changed since it was not the comercial software. was (Author: hujiayin): steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key in the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said reply on hadoop though it is public stable, however, you cannot ensure if the api will not be changed since it was not the comercial software. Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx, Design Document of Encrypted Spark Shuffle_20150402.docx, Design Document of Encrypted Spark Shuffle_20150506.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8784) Add python API for hex/unhex
Davies Liu created SPARK-8784: - Summary: Add python API for hex/unhex Key: SPARK-8784 URL: https://issues.apache.org/jira/browse/SPARK-8784 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8781) Pusblished POMs are no longer effective POMs
[ https://issues.apache.org/jira/browse/SPARK-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611600#comment-14611600 ] Konstantin Shaposhnikov commented on SPARK-8781: I believe this will affect both released and SNAPSHOT artefacts. Basically, as part of SPARK-3812 the build was changed to deploy an effective POMs into maven repository. E.g. in https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/1.4.0/spark-core_2.11-1.4.0.pom you won't find {{$\{scala.binary.version}}, it was resolved to 2.11 by the maven during the build. This is required for Scala 2.11 build to make sure that jars that are built with Scala 2.11 reference Scala 2.11 jars (e.g. spark-core_2.11 should depend on spark-launcher_2.11, not on spark-launcher_2.10). By default {{$\{scala.binary.version}} will be resolved to 2.10 because scala-2.10 maven profile is the active by default. Publishing of effective POMs is implemented using maven-shade-plugin. To be honest I am not sure how exactly it works. However when I removed the following line from the parent POM {{createDependencyReducedPomfalse/createDependencyReducedPom}} the build started to deploy effective POMs again. I hope my explanation helps. Pusblished POMs are no longer effective POMs Key: SPARK-8781 URL: https://issues.apache.org/jira/browse/SPARK-8781 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.2, 1.4.1, 1.5.0 Reporter: Konstantin Shaposhnikov Published to maven repository POMs are no longer effective POMs. E.g. In https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom: {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_${scala.binary.version}/artifactId version${project.version}/version /dependency ... {noformat} while it should be {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_2.11/artifactId version${project.version}/version /dependency ... {noformat} The following commits are most likely the cause of it: - for branch-1.3: https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129 - for branch-1.4: https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78 - for master: https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724 On branch-1.4 reverting the commit fixed the issue. See SPARK-3812 for additional details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching
[ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611602#comment-14611602 ] Davies Liu commented on SPARK-8632: --- [~justin.uang] Sounds interesting, could you sending out the PR? Poor Python UDF performance because of RDD caching -- Key: SPARK-8632 URL: https://issues.apache.org/jira/browse/SPARK-8632 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0 Reporter: Justin Uang {quote} We have been running into performance problems using Python UDFs with DataFrames at large scale. From the implementation of BatchPythonEvaluation, it looks like the goal was to reuse the PythonRDD code. It caches the entire child RDD so that it can do two passes over the data. One to give to the PythonRDD, then one to join the python lambda results with the original row (which may have java objects that should be passed through). In addition, it caches all the columns, even the ones that don't need to be processed by the Python UDF. In the cases I was working with, I had a 500 column table, and i wanted to use a python UDF for one column, and it ended up caching all 500 columns. {quote} http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8785) Improve Parquet schema merging
[ https://issues.apache.org/jira/browse/SPARK-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611605#comment-14611605 ] Apache Spark commented on SPARK-8785: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/7182 Improve Parquet schema merging -- Key: SPARK-8785 URL: https://issues.apache.org/jira/browse/SPARK-8785 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend much time to merge duplicate schema. We can select only non duplicate schema and merge them later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8781) Pusblished POMs are no longer effective POMs
[ https://issues.apache.org/jira/browse/SPARK-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611606#comment-14611606 ] Konstantin Shaposhnikov commented on SPARK-8781: The original commit that adds effective POM publishing: https://github.com/apache/spark/commit/6e09c98b5d7ad92cf01a3b415008f48782f2f1a3 Pusblished POMs are no longer effective POMs Key: SPARK-8781 URL: https://issues.apache.org/jira/browse/SPARK-8781 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.2, 1.4.1, 1.5.0 Reporter: Konstantin Shaposhnikov Published to maven repository POMs are no longer effective POMs. E.g. In https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom: {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_${scala.binary.version}/artifactId version${project.version}/version /dependency ... {noformat} while it should be {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_2.11/artifactId version${project.version}/version /dependency ... {noformat} The following commits are most likely the cause of it: - for branch-1.3: https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129 - for branch-1.4: https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78 - for master: https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724 On branch-1.4 reverting the commit fixed the issue. See SPARK-3812 for additional details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8785) Improve Parquet schema merging
[ https://issues.apache.org/jira/browse/SPARK-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8785: --- Assignee: Apache Spark Improve Parquet schema merging -- Key: SPARK-8785 URL: https://issues.apache.org/jira/browse/SPARK-8785 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Apache Spark Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend much time to merge duplicate schema. We can select only non duplicate schema and merge them later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8785) Improve Parquet schema merging
[ https://issues.apache.org/jira/browse/SPARK-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8785: --- Assignee: (was: Apache Spark) Improve Parquet schema merging -- Key: SPARK-8785 URL: https://issues.apache.org/jira/browse/SPARK-8785 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend much time to merge duplicate schema. We can select only non duplicate schema and merge them later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611547#comment-14611547 ] liyunzhang_intel commented on SPARK-5682: - [~hujiayin]: thanks for your comment. This feature is not based on hadooop2.6. it is based on hadoop2.6 in original design. In the latest design doc(20150506), It shows that now there are two ways to implement encrypted shuffle in spark. Currently we only implement it on spark-on-yarn framework. One is based on [Chimera(Chimera is a project which strips code related to CryptoInputStream/CryptoOutputStream from Hadoop to facilitate AES-NI based data encryption in other projects)|https://github.com/intel-hadoop/chimera](see https://github.com/apache/spark/pull/5307). In the other way,we implement all the crypto classes like CryptoInputStream/CryptoOutputStream in scala under core/src/main/scala/org/apache/spark/crypto/ package(see https://github.com/apache/spark/pull/4491). For the problem of importing hadoop api in spark, if the interface of hadoop class is public and stable,it can be use in spark. in https://hadoop.apache.org/docs/current/api/org/apache/hadoop/classification/InterfaceStability.html, it says: {quote} Incompatible changes must not be made to classes marked as stable. {quote} which means when a class is marked stable, later release will not change it. Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx, Design Document of Encrypted Spark Shuffle_20150402.docx, Design Document of Encrypted Spark Shuffle_20150506.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8785) Improve Parquet schema merging
Liang-Chi Hsieh created SPARK-8785: -- Summary: Improve Parquet schema merging Key: SPARK-8785 URL: https://issues.apache.org/jira/browse/SPARK-8785 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend much time to merge duplicate schema. We can select only non duplicate schema and merge them later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8786) Create a wrapper for BinaryType
Davies Liu created SPARK-8786: - Summary: Create a wrapper for BinaryType Key: SPARK-8786 URL: https://issues.apache.org/jira/browse/SPARK-8786 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu The hashCode and equals() of Array[Byte] does check the bytes, we should create a wrapper to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8786) Create a wrapper for BinaryType
[ https://issues.apache.org/jira/browse/SPARK-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8786: -- Description: The hashCode and equals() of Array[Byte] does check the bytes, we should create a wrapper (internally) to do that. (was: The hashCode and equals() of Array[Byte] does check the bytes, we should create a wrapper to do that.) Create a wrapper for BinaryType --- Key: SPARK-8786 URL: https://issues.apache.org/jira/browse/SPARK-8786 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu The hashCode and equals() of Array[Byte] does check the bytes, we should create a wrapper (internally) to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8781) Pusblished POMs are no longer effective POMs
[ https://issues.apache.org/jira/browse/SPARK-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611618#comment-14611618 ] Sean Owen commented on SPARK-8781: -- Right, I get all that. Yes that makes it clear what the connection is to https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724 -- it's the createDependencyReducedPom issue, maybe. [~andrewor14] do you have more color on why that bit was needed? Pusblished POMs are no longer effective POMs Key: SPARK-8781 URL: https://issues.apache.org/jira/browse/SPARK-8781 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.2, 1.4.1, 1.5.0 Reporter: Konstantin Shaposhnikov Published to maven repository POMs are no longer effective POMs. E.g. In https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom: {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_${scala.binary.version}/artifactId version${project.version}/version /dependency ... {noformat} while it should be {noformat} ... dependency groupIdorg.apache.spark/groupId artifactIdspark-launcher_2.11/artifactId version${project.version}/version /dependency ... {noformat} The following commits are most likely the cause of it: - for branch-1.3: https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129 - for branch-1.4: https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78 - for master: https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724 On branch-1.4 reverting the commit fixed the issue. See SPARK-3812 for additional details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8787) Change the parameter order of @deprecated in package object sql
Vinod KC created SPARK-8787: --- Summary: Change the parameter order of @deprecated in package object sql Key: SPARK-8787 URL: https://issues.apache.org/jira/browse/SPARK-8787 Project: Spark Issue Type: Improvement Components: SQL Reporter: Vinod KC Priority: Trivial Parameter order of @deprecated annotation in package object sql is wrong deprecated(1.3.0, use DataFrame) . This has to be changed to deprecated(use DataFrame, 1.3.0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611527#comment-14611527 ] hujiayin commented on SPARK-5682: - steps were added to encode and decode the data, the performance will not be fast than before, in the same time, codes also have security issue, for example save the plain text in configuration file and finally used as the part of the key in the same time, the feature based on hadoop 2.6, it is the limitation, that is why i said reply on hadoop though it is public stable, however, you cannot ensure if the api will not be changed since it was not the comercial software. Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx, Design Document of Encrypted Spark Shuffle_20150402.docx, Design Document of Encrypted Spark Shuffle_20150506.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8754) YarnClientSchedulerBackend doesn't stop gracefully in failure conditions
[ https://issues.apache.org/jira/browse/SPARK-8754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8754. Resolution: Fixed Fix Version/s: 1.4.2 1.5.0 Target Version/s: 1.5.0, 1.4.2 YarnClientSchedulerBackend doesn't stop gracefully in failure conditions Key: SPARK-8754 URL: https://issues.apache.org/jira/browse/SPARK-8754 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Reporter: Devaraj K Priority: Minor Fix For: 1.5.0, 1.4.2 {code:xml} java.lang.NullPointerException at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:151) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:421) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1447) at org.apache.spark.SparkContext.stop(SparkContext.scala:1651) at org.apache.spark.SparkContext.init(SparkContext.scala:572) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:621) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} If the application has FINISHED/FAILED/KILLED or failed to launch application master, monitorThread is not getting initialized but monitorThread.interrupt() is getting invoked as part of stop() without any check and It is causing to throw NPE and also it is preventing to stop the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8687) Spark on yarn-client mode can't send `spark.yarn.credentials.file` to executor.
[ https://issues.apache.org/jira/browse/SPARK-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8687: - Fix Version/s: 1.4.2 Spark on yarn-client mode can't send `spark.yarn.credentials.file` to executor. --- Key: SPARK-8687 URL: https://issues.apache.org/jira/browse/SPARK-8687 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.5.0 Reporter: SaintBacchus Assignee: SaintBacchus Fix For: 1.5.0, 1.4.2 Yarn will set +spark.yarn.credentials.file+ after *DriverEndpoint* initialized. So executor will fetch the old configuration and will cause the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8687) Spark on yarn-client mode can't send `spark.yarn.credentials.file` to executor.
[ https://issues.apache.org/jira/browse/SPARK-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8687: - Target Version/s: 1.5.0, 1.4.2 (was: 1.5.0) Spark on yarn-client mode can't send `spark.yarn.credentials.file` to executor. --- Key: SPARK-8687 URL: https://issues.apache.org/jira/browse/SPARK-8687 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.5.0 Reporter: SaintBacchus Assignee: SaintBacchus Fix For: 1.5.0, 1.4.2 Yarn will set +spark.yarn.credentials.file+ after *DriverEndpoint* initialized. So executor will fetch the old configuration and will cause the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8771) Actor system deprecation tag uses deprecated deprecation tag
[ https://issues.apache.org/jira/browse/SPARK-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8771. Resolution: Fixed Assignee: holdenk Fix Version/s: 1.5.0 Target Version/s: 1.5.0 Actor system deprecation tag uses deprecated deprecation tag Key: SPARK-8771 URL: https://issues.apache.org/jira/browse/SPARK-8771 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: holdenk Assignee: holdenk Priority: Trivial Fix For: 1.5.0 The deprecation of the actor system adds a spurious build warning: {quote} @deprecated now takes two arguments; see the scaladoc. [warn] @deprecated(Actor system is no longer supported as of 1.4) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8771) Actor system deprecation tag uses deprecated deprecation tag
[ https://issues.apache.org/jira/browse/SPARK-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8771: - Affects Version/s: 1.4.0 Actor system deprecation tag uses deprecated deprecation tag Key: SPARK-8771 URL: https://issues.apache.org/jira/browse/SPARK-8771 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: holdenk Priority: Trivial The deprecation of the actor system adds a spurious build warning: {quote} @deprecated now takes two arguments; see the scaladoc. [warn] @deprecated(Actor system is no longer supported as of 1.4) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8783) CTAS with WITH clause does not work
[ https://issues.apache.org/jira/browse/SPARK-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8783: --- Assignee: Apache Spark CTAS with WITH clause does not work --- Key: SPARK-8783 URL: https://issues.apache.org/jira/browse/SPARK-8783 Project: Spark Issue Type: Bug Components: SQL Reporter: Keuntae Park Assignee: Apache Spark Priority: Minor Following CTAS with WITH clause query {code} CREATE TABLE with_table1 AS WITH T AS ( SELECT * FROM table1 ) SELECT * FROM T {code} induces following error {code} no such table T; line 7 pos 5 org.apache.spark.sql.AnalysisException: no such table T; line 7 pos 5 ... {code} I think that WITH clause within CTAS is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8783) CTAS with WITH clause does not work
[ https://issues.apache.org/jira/browse/SPARK-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8783: --- Assignee: (was: Apache Spark) CTAS with WITH clause does not work --- Key: SPARK-8783 URL: https://issues.apache.org/jira/browse/SPARK-8783 Project: Spark Issue Type: Bug Components: SQL Reporter: Keuntae Park Priority: Minor Following CTAS with WITH clause query {code} CREATE TABLE with_table1 AS WITH T AS ( SELECT * FROM table1 ) SELECT * FROM T {code} induces following error {code} no such table T; line 7 pos 5 org.apache.spark.sql.AnalysisException: no such table T; line 7 pos 5 ... {code} I think that WITH clause within CTAS is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8783) CTAS with WITH clause does not work
[ https://issues.apache.org/jira/browse/SPARK-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611556#comment-14611556 ] Apache Spark commented on SPARK-8783: - User 'sirpkt' has created a pull request for this issue: https://github.com/apache/spark/pull/7180 CTAS with WITH clause does not work --- Key: SPARK-8783 URL: https://issues.apache.org/jira/browse/SPARK-8783 Project: Spark Issue Type: Bug Components: SQL Reporter: Keuntae Park Priority: Minor Following CTAS with WITH clause query {code} CREATE TABLE with_table1 AS WITH T AS ( SELECT * FROM table1 ) SELECT * FROM T {code} induces following error {code} no such table T; line 7 pos 5 org.apache.spark.sql.AnalysisException: no such table T; line 7 pos 5 ... {code} I think that WITH clause within CTAS is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8784) Add python API for hex/unhex
[ https://issues.apache.org/jira/browse/SPARK-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8784: --- Assignee: Apache Spark (was: Davies Liu) Add python API for hex/unhex Key: SPARK-8784 URL: https://issues.apache.org/jira/browse/SPARK-8784 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Davies Liu Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8784) Add python API for hex/unhex
[ https://issues.apache.org/jira/browse/SPARK-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8784: --- Assignee: Davies Liu (was: Apache Spark) Add python API for hex/unhex Key: SPARK-8784 URL: https://issues.apache.org/jira/browse/SPARK-8784 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8784) Add python API for hex/unhex
[ https://issues.apache.org/jira/browse/SPARK-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611575#comment-14611575 ] Apache Spark commented on SPARK-8784: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7181 Add python API for hex/unhex Key: SPARK-8784 URL: https://issues.apache.org/jira/browse/SPARK-8784 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8691) Enable GZip for Web UI
[ https://issues.apache.org/jira/browse/SPARK-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8691. -- Resolution: Duplicate Enable GZip for Web UI -- Key: SPARK-8691 URL: https://issues.apache.org/jira/browse/SPARK-8691 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Shixiong Zhu When there are massive tasks in the stage page (such as, running {{sc.parallelize(1 to 10, 1).count()}}), the size of the stage page is large. Enabling GZip can reduce the size significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6573) Convert inbound NaN values as null
[ https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611565#comment-14611565 ] Josh Rosen commented on SPARK-6573: --- NaN can lead to confusing exceptions during sorting if it appears in a column. I just ran into an issue where Sort threw a Comparison method violates its general contract! error for data containing NaN columns. See my comments at https://github.com/apache/spark/pull/7179#discussion_r33749911 Convert inbound NaN values as null -- Key: SPARK-6573 URL: https://issues.apache.org/jira/browse/SPARK-6573 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: Fabian Boehnlein In pandas it is common to use numpy.nan as the null value, for missing data or whatever. http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna createDataFrame however only works with None as null values, parsing them as None in the RDD. I suggest to add support for np.nan values in pandas DataFrames. current stracktrace when calling a DataFrame with object type columns with np.nan values (which are floats) {code} TypeError Traceback (most recent call last) ipython-input-38-34f0263f0bf4 in module() 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 339 schema = self._inferSchema(data.map(lambda r: row_cls(*r)), samplingRatio) 340 -- 341 return self.applySchema(data, schema) 342 343 def registerDataFrameAsTable(self, rdd, tableName): /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in applySchema(self, rdd, schema) 246 247 for row in rows: -- 248 _verify_type(row, schema) 249 250 # convert python objects to sql data /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1064 length of fields (%d) % (len(obj), len(dataType.fields))) 1065 for v, f in zip(obj, dataType.fields): - 1066 _verify_type(v, f.dataType) 1067 1068 _cached_cls = weakref.WeakValueDictionary() /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1048 if type(obj) not in _acceptable_types[_type]: 1049 raise TypeError(%s can not accept object in type %s - 1050 % (dataType, type(obj))) 1051 1052 if isinstance(dataType, ArrayType): TypeError: StringType can not accept object in type type 'float'{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8773) Throw type mismatch in check analysis for expressions with expected input types defined
[ https://issues.apache.org/jira/browse/SPARK-8773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611623#comment-14611623 ] Akhil Thatipamula commented on SPARK-8773: -- [~rxin] aren't we checking that already, |case e: Expression if e.checkInputDataTypes().isFailure| am I missing somthing? Throw type mismatch in check analysis for expressions with expected input types defined --- Key: SPARK-8773 URL: https://issues.apache.org/jira/browse/SPARK-8773 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8740) Support GitHub OAuth tokens in dev/merge_spark_pr.py
[ https://issues.apache.org/jira/browse/SPARK-8740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8740. Resolution: Fixed Fix Version/s: 1.5.0 Target Version/s: 1.5.0 Support GitHub OAuth tokens in dev/merge_spark_pr.py Key: SPARK-8740 URL: https://issues.apache.org/jira/browse/SPARK-8740 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Josh Rosen Assignee: Josh Rosen Priority: Minor Fix For: 1.5.0 We should allow dev/merge_spark_pr.py to use personal GitHub OAuth tokens in order to make authenticated requests. This is necessary to work around per-IP rate limiting issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611553#comment-14611553 ] hujiayin commented on SPARK-5682: - Since the encrypted shuffle in spark is focus on the common module, it maybe not good to use hadoop API. On the other side, the AES solution is a bit heavy to encode/decode the live steaming data. Add encrypted shuffle in spark -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx, Design Document of Encrypted Spark Shuffle_20150318.docx, Design Document of Encrypted Spark Shuffle_20150402.docx, Design Document of Encrypted Spark Shuffle_20150506.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. AES is a specification for the encryption of electronic data. There are 5 common modes in AES. CTR is one of the modes. We use two codec JceAesCtrCryptoCodec and OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl provides. Because ugi credential info is used in the process of encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8787) Change the parameter order of @deprecated in package object sql
[ https://issues.apache.org/jira/browse/SPARK-8787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8787: --- Assignee: Apache Spark Change the parameter order of @deprecated in package object sql Key: SPARK-8787 URL: https://issues.apache.org/jira/browse/SPARK-8787 Project: Spark Issue Type: Improvement Components: SQL Reporter: Vinod KC Assignee: Apache Spark Priority: Trivial Parameter order of @deprecated annotation in package object sql is wrong deprecated(1.3.0, use DataFrame) . This has to be changed to deprecated(use DataFrame, 1.3.0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8787) Change the parameter order of @deprecated in package object sql
[ https://issues.apache.org/jira/browse/SPARK-8787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8787: --- Assignee: (was: Apache Spark) Change the parameter order of @deprecated in package object sql Key: SPARK-8787 URL: https://issues.apache.org/jira/browse/SPARK-8787 Project: Spark Issue Type: Improvement Components: SQL Reporter: Vinod KC Priority: Trivial Parameter order of @deprecated annotation in package object sql is wrong deprecated(1.3.0, use DataFrame) . This has to be changed to deprecated(use DataFrame, 1.3.0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611747#comment-14611747 ] Vincent Warmerdam commented on SPARK-8596: -- Cool, would love to hear your end of the story. It seems the only bother to get the script to work. Slightly deviating subject: I'm not just a frequent R user, I do a lot of python as well. Is there a similar ticket like this for the iPython (jupyter) notebook? It seems like the most appropriate GUI for the python language. Install and configure RStudio server on Spark EC2 - Key: SPARK-8596 URL: https://issues.apache.org/jira/browse/SPARK-8596 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8788) Java unit test for PCA transformer
[ https://issues.apache.org/jira/browse/SPARK-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611666#comment-14611666 ] Apache Spark commented on SPARK-8788: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7184 Java unit test for PCA transformer -- Key: SPARK-8788 URL: https://issues.apache.org/jira/browse/SPARK-8788 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: Yanbo Liang Add Java unit test for PCA transformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8788) Java unit test for PCA transformer
[ https://issues.apache.org/jira/browse/SPARK-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8788: --- Assignee: (was: Apache Spark) Java unit test for PCA transformer -- Key: SPARK-8788 URL: https://issues.apache.org/jira/browse/SPARK-8788 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: Yanbo Liang Add Java unit test for PCA transformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8788) Java unit test for PCA transformer
[ https://issues.apache.org/jira/browse/SPARK-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8788: --- Assignee: Apache Spark Java unit test for PCA transformer -- Key: SPARK-8788 URL: https://issues.apache.org/jira/browse/SPARK-8788 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: Yanbo Liang Assignee: Apache Spark Add Java unit test for PCA transformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611680#comment-14611680 ] Vincent Warmerdam commented on SPARK-8684: -- Mhm... I've tried multiple approaches. My collegue even had a look at it and left him without a clue. Make a stackoverflow question for advice. http://stackoverflow.com/questions/31180061/r-3-2-on-aws-ami I get the impression that the amazon AMI forces you to use the amazon repos if the package you need is also available in the amazon package system... which only have the old versions. Update R version in Spark EC2 AMI - Key: SPARK-8684 URL: https://issues.apache.org/jira/browse/SPARK-8684 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the R version in the AMI is 3.1 -- However a number of R libraries need R version 3.2 and it will be good to update the R version on the AMI while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611680#comment-14611680 ] Vincent Warmerdam edited comment on SPARK-8684 at 7/2/15 9:10 AM: -- Mhm... I've tried multiple approaches. My collegue even had a look at it and left him without a clue. Made a stackoverflow question for advice. http://stackoverflow.com/questions/31180061/r-3-2-on-aws-ami I get the impression that the amazon AMI forces you to use the amazon repos if the package you need is also available in the amazon package system... which only have the old versions. Does anybody know of a place where we could ask amazon to just add it? was (Author: cantdutchthis): Mhm... I've tried multiple approaches. My collegue even had a look at it and left him without a clue. Made a stackoverflow question for advice. http://stackoverflow.com/questions/31180061/r-3-2-on-aws-ami I get the impression that the amazon AMI forces you to use the amazon repos if the package you need is also available in the amazon package system... which only have the old versions. Update R version in Spark EC2 AMI - Key: SPARK-8684 URL: https://issues.apache.org/jira/browse/SPARK-8684 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the R version in the AMI is 3.1 -- However a number of R libraries need R version 3.2 and it will be good to update the R version on the AMI while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8244) string function: find_in_set
[ https://issues.apache.org/jira/browse/SPARK-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8244: --- Assignee: Cheng Hao (was: Apache Spark) string function: find_in_set Key: SPARK-8244 URL: https://issues.apache.org/jira/browse/SPARK-8244 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao Priority: Minor find_in_set(string str, string strList): int Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. For example, find_in_set('ab', 'abc,b,ab,c,def') returns 3. Only add this to SQL, not DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8244) string function: find_in_set
[ https://issues.apache.org/jira/browse/SPARK-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611698#comment-14611698 ] Apache Spark commented on SPARK-8244: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/7186 string function: find_in_set Key: SPARK-8244 URL: https://issues.apache.org/jira/browse/SPARK-8244 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao Priority: Minor find_in_set(string str, string strList): int Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. For example, find_in_set('ab', 'abc,b,ab,c,def') returns 3. Only add this to SQL, not DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8389) Expose KafkaRDDs offsetRange in Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611669#comment-14611669 ] Apache Spark commented on SPARK-8389: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/7185 Expose KafkaRDDs offsetRange in Python -- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8684) Update R version in Spark EC2 AMI
[ https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611680#comment-14611680 ] Vincent Warmerdam edited comment on SPARK-8684 at 7/2/15 9:09 AM: -- Mhm... I've tried multiple approaches. My collegue even had a look at it and left him without a clue. Made a stackoverflow question for advice. http://stackoverflow.com/questions/31180061/r-3-2-on-aws-ami I get the impression that the amazon AMI forces you to use the amazon repos if the package you need is also available in the amazon package system... which only have the old versions. was (Author: cantdutchthis): Mhm... I've tried multiple approaches. My collegue even had a look at it and left him without a clue. Make a stackoverflow question for advice. http://stackoverflow.com/questions/31180061/r-3-2-on-aws-ami I get the impression that the amazon AMI forces you to use the amazon repos if the package you need is also available in the amazon package system... which only have the old versions. Update R version in Spark EC2 AMI - Key: SPARK-8684 URL: https://issues.apache.org/jira/browse/SPARK-8684 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the R version in the AMI is 3.1 -- However a number of R libraries need R version 3.2 and it will be good to update the R version on the AMI while launching a EC2 cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8463) No suitable driver found for write.jdbc
[ https://issues.apache.org/jira/browse/SPARK-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611670#comment-14611670 ] Reynold Xin commented on SPARK-8463: [~mlety2] can you test this the patch created by [~viirya]? No suitable driver found for write.jdbc --- Key: SPARK-8463 URL: https://issues.apache.org/jira/browse/SPARK-8463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.5.0 Environment: Mesos, Ubuntu Reporter: Matthew Jones I am getting a java.sql.SQLException: No suitable driver found for jdbc:mysql://dbhost/test when using df.write.jdbc. I do not get this error when reading from the same database. This simple script can repeat the problem. First one must create a database called test with a table called table1 and insert some rows in it. The user test:secret must have read/write permissions. *testJDBC.scala:* import java.util.Properties import org.apache.spark.sql.Row import java.sql.Struct import org.apache.spark.sql.types.\{StructField, StructType, IntegerType, StringType} import org.apache.spark.\{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext val properties = new Properties() properties.setProperty(user, test) properties.setProperty(password, secret) val readTable = sqlContext.read.jdbc(jdbc:mysql://dbhost/test, table1, properties) print(readTable.show()) val rows = sc.parallelize(List(Row(1, write), Row(2, me))) val writeTable = sqlContext.createDataFrame(rows, StructType(List(StructField(id, IntegerType), StructField(name, StringType writeTable.write.jdbc(jdbc:mysql://dbhost/test, table2, properties)}} This is run using: {{spark-shell --conf spark.executor.extraClassPath=/path/to/mysql-connector-java-5.1.35-bin.jar --driver-class-path /path/to/mysql-connector-java-5.1.35-bin.jar --jars /path/to/mysql-connector-java-5.1.35-bin.jar -i:testJDBC.scala}} The read works fine and will print the rows in the table. The write fails with {{java.sql.SQLException: No suitable driver found for jdbc:mysql://dbhost/test}}. The new table is successfully created but it is empty. I have tested this on a Mesos cluster with Spark 1.4.0 and the current master branch as of June 18th. In the executor logs I do see before the error: INFO Utils: Fetching http://146.203.54.236:50624/jars/mysql-connector-java-5.1.35-bin.jar INFO Executor: Adding file:/tmp/mesos/slaves/.../mysql-connector-java-5.1.35-bin.jar to class loader A workaround is to add the mysql-connector-java-5.1.35-bin.jar to the same location on each executor node as defined in spark.executor.extraClassPath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8726) Wrong spark.executor.memory when using different EC2 master and worker machine types
[ https://issues.apache.org/jira/browse/SPARK-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefano Parmesan updated SPARK-8726: Description: _(this is a mirror of [SPARK-8726|https://issues.apache.org/jira/browse/MESOS-2985])_ By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32]; when using the same instance type for master and workers you will not notice, but when using different ones (which makes sense, as the master cannot be a spot instance, and using a big machine for the master would be a waste of resources) the default amount of memory given to each worker is capped to the amount of RAM available on the master (ex: if you create a cluster with an m1.small master (1.7GB RAM) and one m1.large worker (7.5GB RAM), spark.executor.memory will be set to 512MB). was:By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32]; when using the same instance type for master and workers you will not notice, but when using different ones (which makes sense, as the master cannot be a spot instance, and using a big machine for the master would be a waste of resources) the default amount of memory given to each worker is capped to the amount of RAM available on the master (ex: if you create a cluster with an m1.small master (1.7GB RAM) and one m1.large worker (7.5GB RAM), spark.executor.memory will be set to 512MB). Wrong spark.executor.memory when using different EC2 master and worker machine types Key: SPARK-8726 URL: https://issues.apache.org/jira/browse/SPARK-8726 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.0 Reporter: Stefano Parmesan _(this is a mirror of [SPARK-8726|https://issues.apache.org/jira/browse/MESOS-2985])_ By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32]; when using the same instance type for master and workers you will not notice, but when using different ones (which makes sense, as the master cannot be a spot instance, and using a big machine for the master would be a waste of resources) the default amount of memory given to each worker is capped to the amount of RAM available on the master (ex: if you create a cluster with an m1.small master (1.7GB RAM) and one m1.large worker (7.5GB RAM), spark.executor.memory will be set to 512MB). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8726) Wrong spark.executor.memory when using different EC2 master and worker machine types
[ https://issues.apache.org/jira/browse/SPARK-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefano Parmesan updated SPARK-8726: Description: _(this is a mirror of [MESOS-2985|https://issues.apache.org/jira/browse/MESOS-2985])_ By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32]; when using the same instance type for master and workers you will not notice, but when using different ones (which makes sense, as the master cannot be a spot instance, and using a big machine for the master would be a waste of resources) the default amount of memory given to each worker is capped to the amount of RAM available on the master (ex: if you create a cluster with an m1.small master (1.7GB RAM) and one m1.large worker (7.5GB RAM), spark.executor.memory will be set to 512MB). was: _(this is a mirror of [SPARK-8726|https://issues.apache.org/jira/browse/MESOS-2985])_ By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32]; when using the same instance type for master and workers you will not notice, but when using different ones (which makes sense, as the master cannot be a spot instance, and using a big machine for the master would be a waste of resources) the default amount of memory given to each worker is capped to the amount of RAM available on the master (ex: if you create a cluster with an m1.small master (1.7GB RAM) and one m1.large worker (7.5GB RAM), spark.executor.memory will be set to 512MB). Wrong spark.executor.memory when using different EC2 master and worker machine types Key: SPARK-8726 URL: https://issues.apache.org/jira/browse/SPARK-8726 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.0 Reporter: Stefano Parmesan _(this is a mirror of [MESOS-2985|https://issues.apache.org/jira/browse/MESOS-2985])_ By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32]; when using the same instance type for master and workers you will not notice, but when using different ones (which makes sense, as the master cannot be a spot instance, and using a big machine for the master would be a waste of resources) the default amount of memory given to each worker is capped to the amount of RAM available on the master (ex: if you create a cluster with an m1.small master (1.7GB RAM) and one m1.large worker (7.5GB RAM), spark.executor.memory will be set to 512MB). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8463) No suitable driver found for write.jdbc
[ https://issues.apache.org/jira/browse/SPARK-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8463: --- Shepherd: Reynold Xin Assignee: Liang-Chi Hsieh Target Version/s: 1.5.0, 1.4.2 No suitable driver found for write.jdbc --- Key: SPARK-8463 URL: https://issues.apache.org/jira/browse/SPARK-8463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.5.0 Environment: Mesos, Ubuntu Reporter: Matthew Jones Assignee: Liang-Chi Hsieh I am getting a java.sql.SQLException: No suitable driver found for jdbc:mysql://dbhost/test when using df.write.jdbc. I do not get this error when reading from the same database. This simple script can repeat the problem. First one must create a database called test with a table called table1 and insert some rows in it. The user test:secret must have read/write permissions. *testJDBC.scala:* import java.util.Properties import org.apache.spark.sql.Row import java.sql.Struct import org.apache.spark.sql.types.\{StructField, StructType, IntegerType, StringType} import org.apache.spark.\{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext val properties = new Properties() properties.setProperty(user, test) properties.setProperty(password, secret) val readTable = sqlContext.read.jdbc(jdbc:mysql://dbhost/test, table1, properties) print(readTable.show()) val rows = sc.parallelize(List(Row(1, write), Row(2, me))) val writeTable = sqlContext.createDataFrame(rows, StructType(List(StructField(id, IntegerType), StructField(name, StringType writeTable.write.jdbc(jdbc:mysql://dbhost/test, table2, properties)}} This is run using: {{spark-shell --conf spark.executor.extraClassPath=/path/to/mysql-connector-java-5.1.35-bin.jar --driver-class-path /path/to/mysql-connector-java-5.1.35-bin.jar --jars /path/to/mysql-connector-java-5.1.35-bin.jar -i:testJDBC.scala}} The read works fine and will print the rows in the table. The write fails with {{java.sql.SQLException: No suitable driver found for jdbc:mysql://dbhost/test}}. The new table is successfully created but it is empty. I have tested this on a Mesos cluster with Spark 1.4.0 and the current master branch as of June 18th. In the executor logs I do see before the error: INFO Utils: Fetching http://146.203.54.236:50624/jars/mysql-connector-java-5.1.35-bin.jar INFO Executor: Adding file:/tmp/mesos/slaves/.../mysql-connector-java-5.1.35-bin.jar to class loader A workaround is to add the mysql-connector-java-5.1.35-bin.jar to the same location on each executor node as defined in spark.executor.extraClassPath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8244) string function: find_in_set
[ https://issues.apache.org/jira/browse/SPARK-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8244: --- Assignee: Apache Spark (was: Cheng Hao) string function: find_in_set Key: SPARK-8244 URL: https://issues.apache.org/jira/browse/SPARK-8244 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Priority: Minor find_in_set(string str, string strList): int Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. For example, find_in_set('ab', 'abc,b,ab,c,def') returns 3. Only add this to SQL, not DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8788) Java unit test for PCA transformer
Yanbo Liang created SPARK-8788: -- Summary: Java unit test for PCA transformer Key: SPARK-8788 URL: https://issues.apache.org/jira/browse/SPARK-8788 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: Yanbo Liang Add Java unit test for PCA transformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7401) Dot product and squared_distances should be vectorized in Vectors
[ https://issues.apache.org/jira/browse/SPARK-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-7401: --- Priority: Major (was: Minor) Dot product and squared_distances should be vectorized in Vectors - Key: SPARK-7401 URL: https://issues.apache.org/jira/browse/SPARK-7401 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8616) SQLContext doesn't handle tricky column names when loading from JDBC
[ https://issues.apache.org/jira/browse/SPARK-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611811#comment-14611811 ] David Sabater commented on SPARK-8616: -- I would assume the error here is the lack of support for columns containing characters like ,;{}() = (This includes whitespaces which was my initial issue) If we are ok restricting this we just need to improve the error message when the exception is raised. I would suggest to revisit this in the Maillist to see what are the opinions out there. SQLContext doesn't handle tricky column names when loading from JDBC Key: SPARK-8616 URL: https://issues.apache.org/jira/browse/SPARK-8616 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Ubuntu 14.04, Sqlite 3.8.7, Spark 1.4.0 Reporter: Gergely Svigruha Reproduce: - create a table in a relational database (in my case sqlite) with a column name containing a space: CREATE TABLE my_table (id INTEGER, tricky column TEXT); - try to create a DataFrame using that table: sqlContext.read.format(jdbc).options(Map( url - jdbs:sqlite:..., dbtable - my_table)).load() java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such column: tricky) According to the SQL spec this should be valid: http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8616) SQLContext doesn't handle tricky column names when loading from JDBC
[ https://issues.apache.org/jira/browse/SPARK-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611811#comment-14611811 ] David Sabater edited comment on SPARK-8616 at 7/2/15 11:39 AM: --- I would assume the error here is the lack of support for column names containing characters like ,;{}() = (This includes whitespaces which was my initial issue) If we are ok restricting this we just need to improve the error message when the exception is raised. I would suggest to revisit this in the Maillist to see what are the opinions out there. was (Author: dsdinter): I would assume the error here is the lack of support for columns containing characters like ,;{}() = (This includes whitespaces which was my initial issue) If we are ok restricting this we just need to improve the error message when the exception is raised. I would suggest to revisit this in the Maillist to see what are the opinions out there. SQLContext doesn't handle tricky column names when loading from JDBC Key: SPARK-8616 URL: https://issues.apache.org/jira/browse/SPARK-8616 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Ubuntu 14.04, Sqlite 3.8.7, Spark 1.4.0 Reporter: Gergely Svigruha Reproduce: - create a table in a relational database (in my case sqlite) with a column name containing a space: CREATE TABLE my_table (id INTEGER, tricky column TEXT); - try to create a DataFrame using that table: sqlContext.read.format(jdbc).options(Map( url - jdbs:sqlite:..., dbtable - my_table)).load() java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such column: tricky) According to the SQL spec this should be valid: http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException
[ https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611960#comment-14611960 ] Daniel Darabos commented on SPARK-5945: --- At the moment we have a ton of these infinite retries. A stage is retried a few dozen times, then its parent goes missing and Spark starts retrying the parent until it also goes missing... We are still debugging the cause of our fetch failures, but I just wanted to mention that if there were a {{spark.stage.maxFailures}} option, we would be setting it to 1 at this point. Thanks for all the work on this bug. Even if it's not fixed yet, it's very informative. Spark should not retry a stage infinitely on a FetchFailedException --- Key: SPARK-5945 URL: https://issues.apache.org/jira/browse/SPARK-5945 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid Assignee: Ilya Ganelin While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the cause of each retry as well to still have the desired behavior. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run long pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch fails, but the executor eventually comes back and the stage gets retried again, but the same GC issues happen the second time around, etc. Copied from SPARK-5928, here's the example program that can regularly produce a loop of stage failures. Note that it will only fail from a remote fetch, so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2319) Number of tasks on executors become negative after executor failures
[ https://issues.apache.org/jira/browse/SPARK-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611849#comment-14611849 ] KaiXinXIaoLei commented on SPARK-2319: -- using the lastest version (1.4),I also met the same problem. Number of tasks on executors become negative after executor failures Key: SPARK-2319 URL: https://issues.apache.org/jira/browse/SPARK-2319 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Andrew Or Fix For: 1.4.0 Attachments: num active tasks become negative (-16).jpg See attached screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8787) Change the parameter order of @deprecated in package object sql
[ https://issues.apache.org/jira/browse/SPARK-8787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8787: - Assignee: Vinod KC Change the parameter order of @deprecated in package object sql Key: SPARK-8787 URL: https://issues.apache.org/jira/browse/SPARK-8787 Project: Spark Issue Type: Improvement Components: SQL Reporter: Vinod KC Assignee: Vinod KC Priority: Trivial Fix For: 1.5.0, 1.4.2 Parameter order of @deprecated annotation in package object sql is wrong deprecated(1.3.0, use DataFrame) . This has to be changed to deprecated(use DataFrame, 1.3.0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8787) Change the parameter order of @deprecated in package object sql
[ https://issues.apache.org/jira/browse/SPARK-8787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8787. -- Resolution: Fixed Fix Version/s: 1.5.0 1.4.2 Issue resolved by pull request 7183 [https://github.com/apache/spark/pull/7183] Change the parameter order of @deprecated in package object sql Key: SPARK-8787 URL: https://issues.apache.org/jira/browse/SPARK-8787 Project: Spark Issue Type: Improvement Components: SQL Reporter: Vinod KC Priority: Trivial Fix For: 1.4.2, 1.5.0 Parameter order of @deprecated annotation in package object sql is wrong deprecated(1.3.0, use DataFrame) . This has to be changed to deprecated(use DataFrame, 1.3.0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8791) Make a better hashcode for InternalRow
Cheng Hao created SPARK-8791: Summary: Make a better hashcode for InternalRow Key: SPARK-8791 URL: https://issues.apache.org/jira/browse/SPARK-8791 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor Currently, the InternalRow doesn't support well for complex data type while getting the hashCode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM
[ https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Liu updated SPARK-8790: --- Description: We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m brfore OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor connot contact master. Then the executor lost connection with Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, so sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. was: We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m until OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor connot contact master. Then the executor lost connection with Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, so sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. BlockManager.reregister cause OOM - Key: SPARK-8790 URL: https://issues.apache.org/jira/browse/SPARK-8790 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Liu Attachments: driver.log, executor.log, webui-executor.png, webui-slow-task.png We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m brfore OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor connot contact master. Then the executor lost connection with Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, so sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM
[ https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Liu updated SPARK-8790: --- Description: We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m brfore OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor lost heartbeat to Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. was: We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m brfore OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor lost heartbeat to Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, so sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. BlockManager.reregister cause OOM - Key: SPARK-8790 URL: https://issues.apache.org/jira/browse/SPARK-8790 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Liu Attachments: driver.log, executor.log, webui-executor.png, webui-slow-task.png We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m brfore OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor lost heartbeat to Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2319) Number of tasks on executors become negative after executor failures
[ https://issues.apache.org/jira/browse/SPARK-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611856#comment-14611856 ] KaiXinXIaoLei edited comment on SPARK-2319 at 7/2/15 12:08 PM: --- Using 1.4, num active tasks become negative, and Complete Tasks is more bigger than Total Tasks was (Author: kaixinxiaolei): Using 1.4, num active tasks become negative, and Complete Tasks is more bigger then Total Tasks Number of tasks on executors become negative after executor failures Key: SPARK-2319 URL: https://issues.apache.org/jira/browse/SPARK-2319 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Andrew Or Fix For: 1.4.0 Attachments: active tasks.png, num active tasks become negative (-16).jpg See attached screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6833) Extend `addPackage` so that any given R file can be sourced in the worker before functions are run.
[ https://issues.apache.org/jira/browse/SPARK-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6833: - Assignee: Sun Rui Extend `addPackage` so that any given R file can be sourced in the worker before functions are run. --- Key: SPARK-6833 URL: https://issues.apache.org/jira/browse/SPARK-6833 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Assignee: Sun Rui Priority: Minor Fix For: 1.5.0 Similar to how extra python files or packages can be specified (in zip / egg formats), it will be good to support the ability to add extra R files to the executors working directory. One thing that needs to be investigated is if this will just work out of the box using the spark-submit flag --files ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM
[ https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Liu updated SPARK-8790: --- Attachment: executor.log driver.log BlockManager.reregister cause OOM - Key: SPARK-8790 URL: https://issues.apache.org/jira/browse/SPARK-8790 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Liu Attachments: driver.log, executor.log We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m until OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor connot contact master. Then the executor lost connection with Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, so sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM
[ https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Liu updated SPARK-8790: --- Attachment: webui-executor.png BlockManager.reregister cause OOM - Key: SPARK-8790 URL: https://issues.apache.org/jira/browse/SPARK-8790 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Liu Attachments: driver.log, executor.log, webui-executor.png, webui-slow-task.png We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m until OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor connot contact master. Then the executor lost connection with Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, so sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM
[ https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Liu updated SPARK-8790: --- Attachment: webui-slow-task.png BlockManager.reregister cause OOM - Key: SPARK-8790 URL: https://issues.apache.org/jira/browse/SPARK-8790 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Liu Attachments: driver.log, executor.log, webui-executor.png, webui-slow-task.png We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m until OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor connot contact master. Then the executor lost connection with Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, so sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM
[ https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Liu updated SPARK-8790: --- Description: We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m brfore OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor lost heartbeat to Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, so sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. was: We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m brfore OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor connot contact master. Then the executor lost connection with Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, so sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. BlockManager.reregister cause OOM - Key: SPARK-8790 URL: https://issues.apache.org/jira/browse/SPARK-8790 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Liu Attachments: driver.log, executor.log, webui-executor.png, webui-slow-task.png We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m brfore OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor lost heartbeat to Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, so sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8789) improve SQLQuerySuite resilience by dropping tables in setup
[ https://issues.apache.org/jira/browse/SPARK-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8789: --- Assignee: Apache Spark improve SQLQuerySuite resilience by dropping tables in setup Key: SPARK-8789 URL: https://issues.apache.org/jira/browse/SPARK-8789 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 1.4.0 Reporter: Steve Loughran Assignee: Apache Spark Priority: Minor When some of the tests in {{SQLQuerySuite}} are having problems, followup test runs fail because the tables are still present. this can be addressed by some table dropping at startup, and some try/finally clauses -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8789) improve SQLQuerySuite resilience by dropping tables in setup
[ https://issues.apache.org/jira/browse/SPARK-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8789: --- Assignee: (was: Apache Spark) improve SQLQuerySuite resilience by dropping tables in setup Key: SPARK-8789 URL: https://issues.apache.org/jira/browse/SPARK-8789 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 1.4.0 Reporter: Steve Loughran Priority: Minor When some of the tests in {{SQLQuerySuite}} are having problems, followup test runs fail because the tables are still present. this can be addressed by some table dropping at startup, and some try/finally clauses -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8791) Make a better hashcode for InternalRow
[ https://issues.apache.org/jira/browse/SPARK-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611947#comment-14611947 ] Apache Spark commented on SPARK-8791: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/7189 Make a better hashcode for InternalRow -- Key: SPARK-8791 URL: https://issues.apache.org/jira/browse/SPARK-8791 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor Currently, the InternalRow doesn't support well for complex data type while getting the hashCode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8791) Make a better hashcode for InternalRow
[ https://issues.apache.org/jira/browse/SPARK-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8791: --- Assignee: (was: Apache Spark) Make a better hashcode for InternalRow -- Key: SPARK-8791 URL: https://issues.apache.org/jira/browse/SPARK-8791 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor Currently, the InternalRow doesn't support well for complex data type while getting the hashCode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8792) Add Python API for PCA transformer
[ https://issues.apache.org/jira/browse/SPARK-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8792: --- Assignee: (was: Apache Spark) Add Python API for PCA transformer -- Key: SPARK-8792 URL: https://issues.apache.org/jira/browse/SPARK-8792 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.5.0 Reporter: Yanbo Liang Add Python API for PCA transformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611832#comment-14611832 ] Vincent Warmerdam commented on SPARK-8596: -- By the way, I now have scripts that do install Rstudio (just ran and confirmed). The code is here: https://github.com/koaning/spark-ec2/tree/rstudio-install https://github.com/koaning/spark/tree/rstudio-install When initializing with this command: ./spark-ec2 --key-pair=spark-df --identity-file=/Users/code/Downloads/spark-df.pem --region=eu-west-1 -s 1 --instance-type=c3.2xlarge --spark-ec2-git-repo=https://github.com/koaning/spark-ec2 --spark-ec2-git-branch=rstudio-install launch mysparkr I can confirm that rstudio is installand and that a correct user is added. There are two concerns: - should we not force the user to supply the password themselves? setting a standard password seems like a security vulnerability. - I am not sure if this gets installed on all the slave nodes. I added this module (https://github.com/koaning/spark-ec2/blob/rstudio-install/rstudio/init.sh) and we only need it on the master node. I wonder what the best way is to ensure this. Install and configure RStudio server on Spark EC2 - Key: SPARK-8596 URL: https://issues.apache.org/jira/browse/SPARK-8596 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611832#comment-14611832 ] Vincent Warmerdam edited comment on SPARK-8596 at 7/2/15 11:55 AM: --- By the way, I now have scripts that do install Rstudio (just ran and confirmed). The code is here: https://github.com/koaning/spark-ec2/tree/rstudio-install (added rstudio as a module) https://github.com/koaning/spark/tree/rstudio-install When initializing with this command: ./spark-ec2 --key-pair=spark-df --identity-file=/Users/code/Downloads/spark-df.pem --region=eu-west-1 -s 1 --instance-type=c3.2xlarge --spark-ec2-git-repo=https://github.com/koaning/spark-ec2 --spark-ec2-git-branch=rstudio-install launch mysparkr I can confirm that rstudio is installand and that a correct user is added. There are two concerns: - should we not force the user to supply the password themselves? setting a standard password seems like a security vulnerability. - I am not sure if this gets installed on all the slave nodes. I added this module (https://github.com/koaning/spark-ec2/blob/rstudio-install/rstudio/init.sh) and we only need it on the master node. I wonder what the best way is to ensure this. was (Author: cantdutchthis): By the way, I now have scripts that do install Rstudio (just ran and confirmed). The code is here: https://github.com/koaning/spark-ec2/tree/rstudio-install https://github.com/koaning/spark/tree/rstudio-install When initializing with this command: ./spark-ec2 --key-pair=spark-df --identity-file=/Users/code/Downloads/spark-df.pem --region=eu-west-1 -s 1 --instance-type=c3.2xlarge --spark-ec2-git-repo=https://github.com/koaning/spark-ec2 --spark-ec2-git-branch=rstudio-install launch mysparkr I can confirm that rstudio is installand and that a correct user is added. There are two concerns: - should we not force the user to supply the password themselves? setting a standard password seems like a security vulnerability. - I am not sure if this gets installed on all the slave nodes. I added this module (https://github.com/koaning/spark-ec2/blob/rstudio-install/rstudio/init.sh) and we only need it on the master node. I wonder what the best way is to ensure this. Install and configure RStudio server on Spark EC2 - Key: SPARK-8596 URL: https://issues.apache.org/jira/browse/SPARK-8596 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2319) Number of tasks on executors become negative after executor failures
[ https://issues.apache.org/jira/browse/SPARK-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-2319: - Attachment: active tasks.png Using 1.4, num active tasks become negative, and Complete Tasks is more bigger then Total Tasks Number of tasks on executors become negative after executor failures Key: SPARK-2319 URL: https://issues.apache.org/jira/browse/SPARK-2319 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Andrew Or Fix For: 1.4.0 Attachments: active tasks.png, num active tasks become negative (-16).jpg See attached screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8789) improve SQLQuerySuite resilience by dropping tables in setup
[ https://issues.apache.org/jira/browse/SPARK-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611871#comment-14611871 ] Apache Spark commented on SPARK-8789: - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/7188 improve SQLQuerySuite resilience by dropping tables in setup Key: SPARK-8789 URL: https://issues.apache.org/jira/browse/SPARK-8789 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 1.4.0 Reporter: Steve Loughran Priority: Minor When some of the tests in {{SQLQuerySuite}} are having problems, followup test runs fail because the tables are still present. this can be addressed by some table dropping at startup, and some try/finally clauses -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
[ https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8746: - Assignee: Christian Kadner Need to update download link for Hive 0.13.1 jars (HiveComparisonTest) -- Key: SPARK-8746 URL: https://issues.apache.org/jira/browse/SPARK-8746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Christian Kadner Assignee: Christian Kadner Priority: Trivial Labels: documentation, test Fix For: 1.5.0, 1.4.2 Original Estimate: 1h Remaining Estimate: 1h The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) describes how to generate golden answer files for new hive comparison test cases. However the download link for the Hive 0.13.1 jars points to https://hive.apache.org/downloads.html but none of the linked mirror sites still has the 0.13.1 version. We need to update the link to https://archive.apache.org/dist/hive/hive-0.13.1/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
[ https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8746. -- Resolution: Fixed Fix Version/s: 1.5.0 1.4.2 Issue resolved by pull request 7144 [https://github.com/apache/spark/pull/7144] Need to update download link for Hive 0.13.1 jars (HiveComparisonTest) -- Key: SPARK-8746 URL: https://issues.apache.org/jira/browse/SPARK-8746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Christian Kadner Priority: Trivial Labels: documentation, test Fix For: 1.4.2, 1.5.0 Original Estimate: 1h Remaining Estimate: 1h The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) describes how to generate golden answer files for new hive comparison test cases. However the download link for the Hive 0.13.1 jars points to https://hive.apache.org/downloads.html but none of the linked mirror sites still has the 0.13.1 version. We need to update the link to https://archive.apache.org/dist/hive/hive-0.13.1/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8790) BlockManager.reregister cause OOM
Patrick Liu created SPARK-8790: -- Summary: BlockManager.reregister cause OOM Key: SPARK-8790 URL: https://issues.apache.org/jira/browse/SPARK-8790 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Liu We run SparkSQL 1.2.1 on Yarn. A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for 16m. The webUI shows that the executor has running GC for 15m until OOM. The log shows that the executor first try to connect to master to report broadcast value, however the network is not available, so the executor connot contact master. Then the executor lost connection with Master. Then the master require the executor to reregister. When executor are reporAllBlocks to master, the network is still not so stable, so sometimes time-out. Finally, the executor OOM. Please take a look. Attached is the detailed log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8789) improve SQLQuerySuite resilience by dropping tables in setup
Steve Loughran created SPARK-8789: - Summary: improve SQLQuerySuite resilience by dropping tables in setup Key: SPARK-8789 URL: https://issues.apache.org/jira/browse/SPARK-8789 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 1.4.0 Reporter: Steve Loughran Priority: Minor When some of the tests in {{SQLQuerySuite}} are having problems, followup test runs fail because the tables are still present. this can be addressed by some table dropping at startup, and some try/finally clauses -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8791) Make a better hashcode for InternalRow
[ https://issues.apache.org/jira/browse/SPARK-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8791: --- Assignee: Apache Spark Make a better hashcode for InternalRow -- Key: SPARK-8791 URL: https://issues.apache.org/jira/browse/SPARK-8791 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Assignee: Apache Spark Priority: Minor Currently, the InternalRow doesn't support well for complex data type while getting the hashCode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8792) Add Python API for PCA transformer
Yanbo Liang created SPARK-8792: -- Summary: Add Python API for PCA transformer Key: SPARK-8792 URL: https://issues.apache.org/jira/browse/SPARK-8792 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.5.0 Reporter: Yanbo Liang Add Python API for PCA transformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8792) Add Python API for PCA transformer
[ https://issues.apache.org/jira/browse/SPARK-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612002#comment-14612002 ] Apache Spark commented on SPARK-8792: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7190 Add Python API for PCA transformer -- Key: SPARK-8792 URL: https://issues.apache.org/jira/browse/SPARK-8792 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.5.0 Reporter: Yanbo Liang Add Python API for PCA transformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8792) Add Python API for PCA transformer
[ https://issues.apache.org/jira/browse/SPARK-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8792: --- Assignee: Apache Spark Add Python API for PCA transformer -- Key: SPARK-8792 URL: https://issues.apache.org/jira/browse/SPARK-8792 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.5.0 Reporter: Yanbo Liang Assignee: Apache Spark Add Python API for PCA transformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8793) error/warning with pyspark WholeTextFiles.first
Diana Carroll created SPARK-8793: Summary: error/warning with pyspark WholeTextFiles.first Key: SPARK-8793 URL: https://issues.apache.org/jira/browse/SPARK-8793 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Reporter: Diana Carroll Priority: Minor Attachments: wholefilesbug.txt In Spark 1.3.0 python, calling first() on sc.wholeTextFiles is not working correctly in pyspark. It works fine in Scala. I created a directory with two tiny, simple text files. this works: {code}sc.wholeTextFiles(testdata).collect(){code} this doesn't: {code}sc.wholeTextFiles(testdata).first(){code} The main error message is: {code}15/07/02 08:01:38 ERROR executor.Executor: Exception in task 0.0 in stage 12.0 (TID 12) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /usr/lib/spark/python/pyspark/worker.py, line 101, in main process() File /usr/lib/spark/python/pyspark/worker.py, line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File /usr/lib/spark/python/pyspark/serializers.py, line 236, in dump_stream vs = list(itertools.islice(iterator, batch)) File /usr/lib/spark/python/pyspark/rdd.py, line 1220, in takeUpToNumLeft while taken left: ImportError: No module named iter {code} I will attach the full stack trace to the JIRA. I'm using CentOS 6.6 with CDH 5.4.3 (Spark 1.3.0). Tested in both Python 2.6 and 2.7, same results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8793) error/warning with pyspark WholeTextFiles.first
[ https://issues.apache.org/jira/browse/SPARK-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll updated SPARK-8793: - Attachment: wholefilesbug.txt error/warning with pyspark WholeTextFiles.first --- Key: SPARK-8793 URL: https://issues.apache.org/jira/browse/SPARK-8793 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Reporter: Diana Carroll Priority: Minor Attachments: wholefilesbug.txt In Spark 1.3.0 python, calling first() on sc.wholeTextFiles is not working correctly in pyspark. It works fine in Scala. I created a directory with two tiny, simple text files. this works: {code}sc.wholeTextFiles(testdata).collect(){code} this doesn't: {code}sc.wholeTextFiles(testdata).first(){code} The main error message is: {code}15/07/02 08:01:38 ERROR executor.Executor: Exception in task 0.0 in stage 12.0 (TID 12) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /usr/lib/spark/python/pyspark/worker.py, line 101, in main process() File /usr/lib/spark/python/pyspark/worker.py, line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File /usr/lib/spark/python/pyspark/serializers.py, line 236, in dump_stream vs = list(itertools.islice(iterator, batch)) File /usr/lib/spark/python/pyspark/rdd.py, line 1220, in takeUpToNumLeft while taken left: ImportError: No module named iter {code} I will attach the full stack trace to the JIRA. I'm using CentOS 6.6 with CDH 5.4.3 (Spark 1.3.0). Tested in both Python 2.6 and 2.7, same results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org