[jira] [Commented] (SPARK-1638) Executors fail to come up if spark.executor.extraJavaOptions is set
[ https://issues.apache.org/jira/browse/SPARK-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983156#comment-13983156 ] Sean Owen commented on SPARK-1638: -- Almost certainly a duplicate of https://issues.apache.org/jira/browse/SPARK-1609 Executors fail to come up if spark.executor.extraJavaOptions is set -- Key: SPARK-1638 URL: https://issues.apache.org/jira/browse/SPARK-1638 Project: Spark Issue Type: Bug Components: Spark Core Environment: Bring up a cluster in EC2 using spark-ec2 scripts Reporter: Kalpit Shah Fix For: 1.0.0 If you try to launch a PySpark shell with spark.executor.extraJavaOptions set to -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, the executors never come up on any of the workers. I see the following error in log file : Spark Executor Command: /usr/lib/jvm/java/bin/java -cp /root/c3/lib/*::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar: -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xms13312M -Xmx13312M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@HOSTNAME:45429/user/CoarseGrainedScheduler 7 HOSTNAME 4 akka.tcp://sparkWorker@HOSTNAME:39727/user/Worker app-20140423224526- Unrecognized VM option 'UseCompressedOops -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps' Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1638) Executors fail to come up if spark.executor.extraJavaOptions is set
[ https://issues.apache.org/jira/browse/SPARK-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983168#comment-13983168 ] Kalpit Shah commented on SPARK-1638: Yeah, its very likely. I am going to pull the latest master and retest the fix for SPARK-1609 today. Will close this ticket after validation. Executors fail to come up if spark.executor.extraJavaOptions is set -- Key: SPARK-1638 URL: https://issues.apache.org/jira/browse/SPARK-1638 Project: Spark Issue Type: Bug Components: Spark Core Environment: Bring up a cluster in EC2 using spark-ec2 scripts Reporter: Kalpit Shah Fix For: 1.0.0 If you try to launch a PySpark shell with spark.executor.extraJavaOptions set to -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, the executors never come up on any of the workers. I see the following error in log file : Spark Executor Command: /usr/lib/jvm/java/bin/java -cp /root/c3/lib/*::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar: -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xms13312M -Xmx13312M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@HOSTNAME:45429/user/CoarseGrainedScheduler 7 HOSTNAME 4 akka.tcp://sparkWorker@HOSTNAME:39727/user/Worker app-20140423224526- Unrecognized VM option 'UseCompressedOops -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps' Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1658) Correctly identify if maven is installed and working
Rahul Singhal created SPARK-1658: Summary: Correctly identify if maven is installed and working Key: SPARK-1658 URL: https://issues.apache.org/jira/browse/SPARK-1658 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Rahul Singhal Priority: Trivial The current test in make-distribution.sh to identify if maven is installed is incorrect since the exit code is being checked for tail rather than mvn -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1659) improvements spark-submit usage
[ https://issues.apache.org/jira/browse/SPARK-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983197#comment-13983197 ] Guoqiang Li commented on SPARK-1659: ./bin/spark-submit /opt/spark/classes/toona-assembly-1.0.0-SNAPSHOT.jar --verbose --master spark://spark:7077 --deploy-mode client --class com.zhe800.toona.als.computation.DealCF 20140425 = {code} Using properties file: /opt/spark/spark-1.0.0-cdh3/conf/spark-defaults.conf Adding default property: spark.eventLog.enabled=true Adding default property: spark.akka.askTimeout=120 Adding default property: spark.default.parallelism=32 Adding default property: spark.executor.extraJavaOptions=-Xss5m -server -XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent -XX:+CMSClassUnloadingEnabled -XX:+AggressiveOpts -XX:PermSize=150M -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=150M Adding default property: spark.ui.killEnabled=false Adding default property: spark.storage.memoryFraction=0.7 Adding default property: spark.locality.wait=1 Adding default property: spark.executor.memory=13g Adding default property: spark.master=spark://spark:7077 Adding default property: spark.storage.blockManagerTimeoutIntervalMs=600 Adding default property: spark.akka.timeout=120 Adding default property: spark.akka.frameSize=1600 Adding default property: spark.broadcast.blockSize=4096 Adding default property: spark.eventLog.dir=/opt/spark/logs/ Adding default property: spark.driver.extraJavaOptions=-Xss5m -server -XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent -XX:+CMSClassUnloadingEnabled -XX:+AggressiveOpts -XX:PermSize=150M -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=150M Using properties file: /opt/spark/spark-1.0.0-cdh3/conf/spark-defaults.conf Adding default property: spark.eventLog.enabled=true Adding default property: spark.akka.askTimeout=120 Adding default property: spark.default.parallelism=32 Adding default property: spark.executor.extraJavaOptions=-Xss5m -server -XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent -XX:+CMSClassUnloadingEnabled -XX:+AggressiveOpts -XX:PermSize=150M -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=150M Adding default property: spark.ui.killEnabled=false Adding default property: spark.storage.memoryFraction=0.7 Adding default property: spark.locality.wait=1 Adding default property: spark.executor.memory=13g Adding default property: spark.master=spark://spark:7077 Adding default property: spark.storage.blockManagerTimeoutIntervalMs=600 Adding default property: spark.akka.timeout=120 Adding default property: spark.akka.frameSize=1600 Adding default property: spark.broadcast.blockSize=4096 Adding default property: spark.eventLog.dir=/opt/spark/logs/ Adding default property: spark.driver.extraJavaOptions=-Xss5m -server -XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent -XX:+CMSClassUnloadingEnabled -XX:+AggressiveOpts -XX:PermSize=150M -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=150M {code} printed twice improvements spark-submit usage --- Key: SPARK-1659 URL: https://issues.apache.org/jira/browse/SPARK-1659 Project: Spark Issue Type: Sub-task Components: Spark Core, YARN Reporter: Guoqiang Li Fix For: 1.0.0 Delete spark-submit obsolete usage: --arg ARG -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1649) DataType should contain nullable bit
[ https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983318#comment-13983318 ] Michael Armbrust commented on SPARK-1649: - Why do you think it would be better to have the nullability bit in the data type? Both attribute references and struct fields already have a nullable bit, so we can always describe whether or not a given attribute can be null or not. Right now we use primitive datatypes mostly as enums, so adding this bit to them would mean that everywhere we pattern match on datatype we would need to include a wildcard for nullability. This would also require a pretty big change to all expressions since right now we determine nullability propagation independent of datatype. DataType should contain nullable bit Key: SPARK-1649 URL: https://issues.apache.org/jira/browse/SPARK-1649 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Andre Schumacher Priority: Critical For the underlying storage layer it would simplify things such as schema conversions, predicate filter determination and such to record in the data type itself whether a column can be nullable. So the DataType type could look like like this: abstract class DataType(nullable: Boolean = true) Concrete subclasses could then override the nullable val. Mostly this could be left as the default but when types can be contained in nested types one could optimize for, e.g., arrays with elements that are nullable and those that are not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-975) Spark Replay Debugger
[ https://issues.apache.org/jira/browse/SPARK-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983603#comment-13983603 ] Kousuke Saruta commented on SPARK-975: -- Hi [~lian cheng]. Are there any updates on this issue? Spark Replay Debugger - Key: SPARK-975 URL: https://issues.apache.org/jira/browse/SPARK-975 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 0.9.0 Reporter: Cheng Lian Labels: arthur, debugger The Spark debugger was first mentioned as {{rddbg}} in the [RDD technical report|http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf]. [Arthur|https://github.com/mesos/spark/tree/arthur], authored by [Ankur Dave|https://github.com/ankurdave], is an old implementation of the Spark debugger, which demonstrated both the elegance and power behind the RDD abstraction. Unfortunately, the corresponding GitHub branch was not merged into the master branch and had stopped 2 years ago. For more information about Arthur, please refer to [the Spark Debugger Wiki page|https://github.com/mesos/spark/wiki/Spark-Debugger] in the old GitHub repository. As a useful tool for Spark application debugging and analysis, it would be nice to have a complete Spark debugger. In [PR-224|https://github.com/apache/incubator-spark/pull/224], I propose a new implementation of the Spark debugger, the Spark Replay Debugger (SRD). [PR-224|https://github.com/apache/incubator-spark/pull/224] is only a preview for discussion. In the current version, I only implemented features that can illustrate the basic mechanisms. There are still features appeared in Arthur but missing in SRD, such as checksum based nondeterminsm detection and single task debugging with conventional debugger (like {{jdb}}). However, these features can be easily built upon current SRD framework. To minimize code review effort, I didn't include them into the current version intentionally. Attached is the visualization of the MLlib ALS application (with 1 iteration) generated by SRD. For more information, please refer to [the SRD overview document|http://spark-replay-debugger-overview.readthedocs.org/en/latest/]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-944) Give example of writing to HBase from Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983721#comment-13983721 ] Tathagata Das commented on SPARK-944: - Would be great if you can submit an example soon, would be great if we can make it to Spark 1.0 ;) Give example of writing to HBase from Spark Streaming - Key: SPARK-944 URL: https://issues.apache.org/jira/browse/SPARK-944 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Patrick Wendell Assignee: Patrick Cogan Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1660) Centralize the definition of property names and default values
Kousuke Saruta created SPARK-1660: - Summary: Centralize the definition of property names and default values Key: SPARK-1660 URL: https://issues.apache.org/jira/browse/SPARK-1660 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Kousuke Saruta There are lots of multiple definition of property names and default values in the code. Let's consolidate and clean up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1652) Fixes and improvements for spark-submit/configs
[ https://issues.apache.org/jira/browse/SPARK-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1652. Resolution: Fixed Fixes and improvements for spark-submit/configs --- Key: SPARK-1652 URL: https://issues.apache.org/jira/browse/SPARK-1652 Project: Spark Issue Type: Bug Components: Spark Core, YARN Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.0.0 These are almost all a result of my config patch. Unfortunately the changes were difficult to unit-test and there several edge cases reported. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1657) Spark submit should fail gracefully if YARN support not enabled
[ https://issues.apache.org/jira/browse/SPARK-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1657. Resolution: Fixed Spark submit should fail gracefully if YARN support not enabled --- Key: SPARK-1657 URL: https://issues.apache.org/jira/browse/SPARK-1657 Project: Spark Issue Type: Sub-task Components: Spark Core, YARN Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.0.0 Currently it throws a ClassNotFoundException when trying to reflectively load the class. We should check if the yarn Client class is loadable and throw a nicer exception if it's not found. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1625) Ensure all legacy YARN options are supported with spark-submit
[ https://issues.apache.org/jira/browse/SPARK-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1625. Resolution: Duplicate Ensure all legacy YARN options are supported with spark-submit -- Key: SPARK-1625 URL: https://issues.apache.org/jira/browse/SPARK-1625 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1549) Add python support to spark-submit script
[ https://issues.apache.org/jira/browse/SPARK-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1549: --- Priority: Blocker (was: Major) Add python support to spark-submit script - Key: SPARK-1549 URL: https://issues.apache.org/jira/browse/SPARK-1549 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1649) DataType should contain nullable bit
[ https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983963#comment-13983963 ] Andre Schumacher commented on SPARK-1649: - OK, I now understand that this would be a bigger change. It's not just struct fields for nested types, array element types, map value value types, etc. IMHO it would be cleaner to have it inside the DataType. But since this seems to be mostly relevant only for nested types could one have a special DataType for them, something like NestedDataType(val nullable: Boolean) extends DataType? DataType should contain nullable bit Key: SPARK-1649 URL: https://issues.apache.org/jira/browse/SPARK-1649 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Andre Schumacher Priority: Critical For the underlying storage layer it would simplify things such as schema conversions, predicate filter determination and such to record in the data type itself whether a column can be nullable. So the DataType type could look like like this: abstract class DataType(nullable: Boolean = true) Concrete subclasses could then override the nullable val. Mostly this could be left as the default but when types can be contained in nested types one could optimize for, e.g., arrays with elements that are nullable and those that are not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983971#comment-13983971 ] Hari Shreedharan commented on SPARK-1645: - Yep, that is correct. I'd like to contribute to the design as much as possible - so perhaps we can work on the design document together. Once we start looking into this, we can definitely have to proceed on multiple fronts so we can get more of these features committed faster. Improve Spark Streaming compatibility with Flume Key: SPARK-1645 URL: https://issues.apache.org/jira/browse/SPARK-1645 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Currently the following issues affect Spark Streaming and Flume compatibilty: * If a spark worker goes down, it needs to be restarted on the same node, else Flume cannot send data to it. We can fix this by adding a Flume receiver that is polls Flume, and a Flume sink that supports this. * Receiver sends acks to Flume before the driver knows about the data. The new receiver should also handle this case. * Data loss when driver goes down - This is true for any streaming ingest, not just Flume. I will file a separate jira for this and we should work on it there. This is a longer term project and requires considerable development work. I intend to start working on these soon. Any input is appreciated. (It'd be great if someone can add me as a contributor on jira, so I can assign the jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1649) DataType should contain nullable bit
[ https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983985#comment-13983985 ] Andre Schumacher commented on SPARK-1649: - Thinking about it a bit longer.. could Nullable maybe be a mixin? But what should the default be, nullable or not nullable? DataType should contain nullable bit Key: SPARK-1649 URL: https://issues.apache.org/jira/browse/SPARK-1649 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Andre Schumacher Priority: Critical For the underlying storage layer it would simplify things such as schema conversions, predicate filter determination and such to record in the data type itself whether a column can be nullable. So the DataType type could look like like this: abstract class DataType(nullable: Boolean = true) Concrete subclasses could then override the nullable val. Mostly this could be left as the default but when types can be contained in nested types one could optimize for, e.g., arrays with elements that are nullable and those that are not. -- This message was sent by Atlassian JIRA (v6.2#6252)