[jira] [Commented] (SPARK-1638) Executors fail to come up if spark.executor.extraJavaOptions is set

2014-04-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983156#comment-13983156
 ] 

Sean Owen commented on SPARK-1638:
--

Almost certainly a duplicate of https://issues.apache.org/jira/browse/SPARK-1609

 Executors fail to come up if spark.executor.extraJavaOptions is set 
 --

 Key: SPARK-1638
 URL: https://issues.apache.org/jira/browse/SPARK-1638
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
 Environment: Bring up a cluster in EC2 using spark-ec2 scripts
Reporter: Kalpit Shah
 Fix For: 1.0.0


 If you try to launch a PySpark shell with spark.executor.extraJavaOptions 
 set to -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc 
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, the executors never come up on 
 any of the workers.
 I see the following error in log file :
 Spark Executor Command: /usr/lib/jvm/java/bin/java -cp 
 /root/c3/lib/*::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar:
  -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc 
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xms13312M -Xmx13312M 
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://spark@HOSTNAME:45429/user/CoarseGrainedScheduler 7 HOSTNAME 
 4 akka.tcp://sparkWorker@HOSTNAME:39727/user/Worker 
 app-20140423224526-
 
 Unrecognized VM option 'UseCompressedOops -XX:+UseCompressedStrings 
 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps'
 Error: Could not create the Java Virtual Machine.
 Error: A fatal exception has occurred. Program will exit.
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1638) Executors fail to come up if spark.executor.extraJavaOptions is set

2014-04-28 Thread Kalpit Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983168#comment-13983168
 ] 

Kalpit Shah commented on SPARK-1638:


Yeah, its very likely. I am going to pull the latest master and retest the fix 
for SPARK-1609 today. Will close this ticket after validation.

 Executors fail to come up if spark.executor.extraJavaOptions is set 
 --

 Key: SPARK-1638
 URL: https://issues.apache.org/jira/browse/SPARK-1638
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
 Environment: Bring up a cluster in EC2 using spark-ec2 scripts
Reporter: Kalpit Shah
 Fix For: 1.0.0


 If you try to launch a PySpark shell with spark.executor.extraJavaOptions 
 set to -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc 
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, the executors never come up on 
 any of the workers.
 I see the following error in log file :
 Spark Executor Command: /usr/lib/jvm/java/bin/java -cp 
 /root/c3/lib/*::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar:
  -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc 
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xms13312M -Xmx13312M 
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://spark@HOSTNAME:45429/user/CoarseGrainedScheduler 7 HOSTNAME 
 4 akka.tcp://sparkWorker@HOSTNAME:39727/user/Worker 
 app-20140423224526-
 
 Unrecognized VM option 'UseCompressedOops -XX:+UseCompressedStrings 
 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps'
 Error: Could not create the Java Virtual Machine.
 Error: A fatal exception has occurred. Program will exit.
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1658) Correctly identify if maven is installed and working

2014-04-28 Thread Rahul Singhal (JIRA)
Rahul Singhal created SPARK-1658:


 Summary: Correctly identify if maven is installed and working
 Key: SPARK-1658
 URL: https://issues.apache.org/jira/browse/SPARK-1658
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Rahul Singhal
Priority: Trivial


The current test in make-distribution.sh to identify if maven is installed is 
incorrect since the exit code is being checked for tail rather than mvn



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1659) improvements spark-submit usage

2014-04-28 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983197#comment-13983197
 ] 

Guoqiang Li commented on SPARK-1659:


./bin/spark-submit /opt/spark/classes/toona-assembly-1.0.0-SNAPSHOT.jar  
--verbose  --master spark://spark:7077 --deploy-mode client --class 
com.zhe800.toona.als.computation.DealCF 20140425  = 
{code}
Using properties file: /opt/spark/spark-1.0.0-cdh3/conf/spark-defaults.conf
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.akka.askTimeout=120
Adding default property: spark.default.parallelism=32
Adding default property: spark.executor.extraJavaOptions=-Xss5m -server 
-XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent 
-XX:+CMSClassUnloadingEnabled -XX:+AggressiveOpts -XX:PermSize=150M 
-XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=150M
Adding default property: spark.ui.killEnabled=false
Adding default property: spark.storage.memoryFraction=0.7
Adding default property: spark.locality.wait=1
Adding default property: spark.executor.memory=13g
Adding default property: spark.master=spark://spark:7077
Adding default property: spark.storage.blockManagerTimeoutIntervalMs=600
Adding default property: spark.akka.timeout=120
Adding default property: spark.akka.frameSize=1600
Adding default property: spark.broadcast.blockSize=4096
Adding default property: spark.eventLog.dir=/opt/spark/logs/
Adding default property: spark.driver.extraJavaOptions=-Xss5m -server 
-XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent 
-XX:+CMSClassUnloadingEnabled -XX:+AggressiveOpts -XX:PermSize=150M 
-XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=150M
Using properties file: /opt/spark/spark-1.0.0-cdh3/conf/spark-defaults.conf
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.akka.askTimeout=120
Adding default property: spark.default.parallelism=32
Adding default property: spark.executor.extraJavaOptions=-Xss5m -server 
-XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent 
-XX:+CMSClassUnloadingEnabled -XX:+AggressiveOpts -XX:PermSize=150M 
-XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=150M
Adding default property: spark.ui.killEnabled=false
Adding default property: spark.storage.memoryFraction=0.7
Adding default property: spark.locality.wait=1
Adding default property: spark.executor.memory=13g
Adding default property: spark.master=spark://spark:7077
Adding default property: spark.storage.blockManagerTimeoutIntervalMs=600
Adding default property: spark.akka.timeout=120
Adding default property: spark.akka.frameSize=1600
Adding default property: spark.broadcast.blockSize=4096
Adding default property: spark.eventLog.dir=/opt/spark/logs/
Adding default property: spark.driver.extraJavaOptions=-Xss5m -server 
-XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent 
-XX:+CMSClassUnloadingEnabled -XX:+AggressiveOpts -XX:PermSize=150M 
-XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=150M
{code}
printed twice

 improvements spark-submit usage
 ---

 Key: SPARK-1659
 URL: https://issues.apache.org/jira/browse/SPARK-1659
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, YARN
Reporter: Guoqiang Li
 Fix For: 1.0.0


 Delete spark-submit obsolete usage: --arg ARG



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1649) DataType should contain nullable bit

2014-04-28 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983318#comment-13983318
 ] 

Michael Armbrust commented on SPARK-1649:
-

Why do you think it would be better to have the nullability bit in the data 
type?  Both attribute references and struct fields already have a nullable bit, 
so we can always describe whether or not a given attribute can be null or not.

Right now we use primitive datatypes mostly as enums, so adding this bit to 
them would mean that everywhere we pattern match on datatype we would need to 
include a wildcard for nullability.  This would also require a pretty big 
change to all expressions since right now we determine nullability propagation 
independent of datatype.

 DataType should contain nullable bit
 

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-975) Spark Replay Debugger

2014-04-28 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983603#comment-13983603
 ] 

Kousuke Saruta commented on SPARK-975:
--

Hi [~lian cheng]. Are there any updates on this issue?

 Spark Replay Debugger
 -

 Key: SPARK-975
 URL: https://issues.apache.org/jira/browse/SPARK-975
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Cheng Lian
  Labels: arthur, debugger

 The Spark debugger was first mentioned as {{rddbg}} in the [RDD technical 
 report|http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf].
 [Arthur|https://github.com/mesos/spark/tree/arthur], authored by [Ankur 
 Dave|https://github.com/ankurdave], is an old implementation of the Spark 
 debugger, which demonstrated both the elegance and power behind the RDD 
 abstraction.  Unfortunately, the corresponding GitHub branch was not merged 
 into the master branch and had stopped 2 years ago.  For more information 
 about Arthur, please refer to [the Spark Debugger Wiki 
 page|https://github.com/mesos/spark/wiki/Spark-Debugger] in the old GitHub 
 repository.
 As a useful tool for Spark application debugging and analysis, it would be 
 nice to have a complete Spark debugger.  In 
 [PR-224|https://github.com/apache/incubator-spark/pull/224], I propose a new 
 implementation of the Spark debugger, the Spark Replay Debugger (SRD).
 [PR-224|https://github.com/apache/incubator-spark/pull/224] is only a preview 
 for discussion.  In the current version, I only implemented features that can 
 illustrate the basic mechanisms.  There are still features appeared in Arthur 
 but missing in SRD, such as checksum based nondeterminsm detection and single 
 task debugging with conventional debugger (like {{jdb}}).  However, these 
 features can be easily built upon current SRD framework.  To minimize code 
 review effort, I didn't include them into the current version intentionally.
 Attached is the visualization of the MLlib ALS application (with 1 iteration) 
 generated by SRD.  For more information, please refer to [the SRD overview 
 document|http://spark-replay-debugger-overview.readthedocs.org/en/latest/].



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-944) Give example of writing to HBase from Spark Streaming

2014-04-28 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983721#comment-13983721
 ] 

Tathagata Das commented on SPARK-944:
-

Would be great if you can submit an example soon, would be great if we can make 
it to Spark 1.0 ;)

 Give example of writing to HBase from Spark Streaming
 -

 Key: SPARK-944
 URL: https://issues.apache.org/jira/browse/SPARK-944
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Patrick Cogan
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1660) Centralize the definition of property names and default values

2014-04-28 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-1660:
-

 Summary: Centralize the definition of property names and default 
values
 Key: SPARK-1660
 URL: https://issues.apache.org/jira/browse/SPARK-1660
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Kousuke Saruta


There are lots of multiple definition of property names and default values in 
the code.
Let's consolidate and clean up.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1652) Fixes and improvements for spark-submit/configs

2014-04-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1652.


Resolution: Fixed

 Fixes and improvements for spark-submit/configs
 ---

 Key: SPARK-1652
 URL: https://issues.apache.org/jira/browse/SPARK-1652
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 These are almost all a result of my config patch. Unfortunately the changes 
 were difficult to unit-test and there several edge cases reported.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1657) Spark submit should fail gracefully if YARN support not enabled

2014-04-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1657.


Resolution: Fixed

 Spark submit should fail gracefully if YARN support not enabled
 ---

 Key: SPARK-1657
 URL: https://issues.apache.org/jira/browse/SPARK-1657
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, YARN
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 Currently it throws a ClassNotFoundException when trying to reflectively load 
 the class. We should check if the yarn Client class is loadable and throw a 
 nicer exception if it's not found.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1625) Ensure all legacy YARN options are supported with spark-submit

2014-04-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1625.


Resolution: Duplicate

 Ensure all legacy YARN options are supported with spark-submit
 --

 Key: SPARK-1625
 URL: https://issues.apache.org/jira/browse/SPARK-1625
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1549) Add python support to spark-submit script

2014-04-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1549:
---

Priority: Blocker  (was: Major)

 Add python support to spark-submit script
 -

 Key: SPARK-1549
 URL: https://issues.apache.org/jira/browse/SPARK-1549
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1649) DataType should contain nullable bit

2014-04-28 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983963#comment-13983963
 ] 

Andre Schumacher commented on SPARK-1649:
-

OK, I now understand that this would be a bigger change.

It's not just struct fields for nested types, array element types, map value 
value types, etc. IMHO it would be cleaner to have it inside the DataType. But 
since this seems to be mostly relevant only for nested types could one have a 
special DataType for them, something like NestedDataType(val nullable: 
Boolean) extends DataType?

 DataType should contain nullable bit
 

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-28 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983971#comment-13983971
 ] 

Hari Shreedharan commented on SPARK-1645:
-

Yep, that is correct. I'd like to contribute to the design as much as possible 
- so perhaps we can work on the design document together. Once we start looking 
into this, we can definitely have to proceed on multiple fronts so we can get 
more of these features committed faster.

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1649) DataType should contain nullable bit

2014-04-28 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983985#comment-13983985
 ] 

Andre Schumacher commented on SPARK-1649:
-

Thinking about it a bit longer.. could Nullable maybe be a mixin? But what 
should the default be, nullable or not nullable?

 DataType should contain nullable bit
 

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)