[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958014#comment-13958014 ] Shivaram Venkataraman commented on SPARK-1391: -- Oh and yes, I'd be happy to test out any patch / WIP BlockManager cannot transfer blocks larger than 2G in size -- Key: SPARK-1391 URL: https://issues.apache.org/jira/browse/SPARK-1391 Project: Spark Issue Type: Bug Components: Block Manager, Shuffle Affects Versions: 1.0.0 Reporter: Shivaram Venkataraman If a task tries to remotely access a cached RDD block, I get an exception when the block size is 2G. The exception is pasted below. Memory capacities are huge these days ( 60G), and many workflows depend on having large blocks in memory, so it would be good to fix this bug. I don't know if the same thing happens on shuffles if one transfer (from mapper to reducer) is 2G. 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer message java.lang.ArrayIndexOutOfBoundsException at it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) at org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) at org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) at org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) at org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958013#comment-13958013 ] Shivaram Venkataraman commented on SPARK-1391: -- I am not using any fastutil version explicitly. I am just using Spark's master branch from around March 23rd. (The exact commit I am synced to is https://github.com/apache/spark/commit/8265dc7739caccc59bc2456b2df055ca96337fe4) BlockManager cannot transfer blocks larger than 2G in size -- Key: SPARK-1391 URL: https://issues.apache.org/jira/browse/SPARK-1391 Project: Spark Issue Type: Bug Components: Block Manager, Shuffle Affects Versions: 1.0.0 Reporter: Shivaram Venkataraman If a task tries to remotely access a cached RDD block, I get an exception when the block size is 2G. The exception is pasted below. Memory capacities are huge these days ( 60G), and many workflows depend on having large blocks in memory, so it would be good to fix this bug. I don't know if the same thing happens on shuffles if one transfer (from mapper to reducer) is 2G. 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer message java.lang.ArrayIndexOutOfBoundsException at it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) at org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) at org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) at org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) at org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960100#comment-13960100 ] Shivaram Venkataraman commented on SPARK-1391: -- Thanks for the patch. I will try this out in the next couple of days and get back. BlockManager cannot transfer blocks larger than 2G in size -- Key: SPARK-1391 URL: https://issues.apache.org/jira/browse/SPARK-1391 Project: Spark Issue Type: Bug Components: Block Manager, Shuffle Affects Versions: 1.0.0 Reporter: Shivaram Venkataraman Assignee: Min Zhou Attachments: SPARK-1391.diff If a task tries to remotely access a cached RDD block, I get an exception when the block size is 2G. The exception is pasted below. Memory capacities are huge these days ( 60G), and many workflows depend on having large blocks in memory, so it would be good to fix this bug. I don't know if the same thing happens on shuffles if one transfer (from mapper to reducer) is 2G. {noformat} 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer message java.lang.ArrayIndexOutOfBoundsException at it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) at org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) at org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) at org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) at org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1614) Move Mesos protobufs out of TaskState
Shivaram Venkataraman created SPARK-1614: Summary: Move Mesos protobufs out of TaskState Key: SPARK-1614 URL: https://issues.apache.org/jira/browse/SPARK-1614 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 0.9.1 Reporter: Shivaram Venkataraman Priority: Minor To isolate usage of Mesos protobufs it would be good to move them out of TaskState into either a new class (MesosUtils ?) or CoarseGrainedMesos{Executor, Backend}. This would allow applications to build Spark to run without including protobuf from Mesos in their shaded jars. This is one way to avoid protobuf conflicts between Mesos and Hadoop (https://issues.apache.org/jira/browse/MESOS-1203) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2046) Support config properties that are changeable across tasks/stages within a job
[ https://issues.apache.org/jira/browse/SPARK-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019458#comment-14019458 ] Shivaram Venkataraman commented on SPARK-2046: -- FWIW I have an older implementation that did this using LocalProperties in SparkContext. https://github.com/shivaram/spark-1/commit/256a34c12d4f3c8ed1a09174f331868a7bf30e11 I haven't tested it in a setting with multiple jobs running at the same time though Support config properties that are changeable across tasks/stages within a job -- Key: SPARK-2046 URL: https://issues.apache.org/jira/browse/SPARK-2046 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Zongheng Yang Suppose an application consists of multiple stages, where some stages contain computation-intensive tasks, and other stages contain less computation-intensive (or otherwise ordinary) tasks. For such job to run efficiently, it might make sense to provide user a function to set spark.task.cpus to a high number right before the computation-intensive stages/tasks are getting generated in the user code, and set the property to a lower number for other stages/tasks. As a first step, supporting this feature across stages instead of the more fine-grained task-level might suffice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2316) StorageStatusListener should avoid O(blocks) operations
[ https://issues.apache.org/jira/browse/SPARK-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065256#comment-14065256 ] Shivaram Venkataraman commented on SPARK-2316: -- I'd just like to add that in cases where we have many thousands of blocks, this stack trace occupies one core constantly on the Master and is probably one of the reasons why the WebUI stops functioning after a certain point. StorageStatusListener should avoid O(blocks) operations --- Key: SPARK-2316 URL: https://issues.apache.org/jira/browse/SPARK-2316 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0 Reporter: Patrick Wendell Assignee: Andrew Or In the case where jobs are frequently causing dropped blocks the storage status listener can bottleneck. This is slow for a few reasons, one being that we use Scala collection operations, the other being that we operations that are O(number of blocks). I think using a few indices here could make this much faster. {code} at java.lang.Integer.valueOf(Integer.java:642) at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70) at org.apache.spark.storage.StorageUtils$$anonfun$9.apply(StorageUtils.scala:82) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327) at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105) at org.apache.spark.storage.StorageUtils$.rddInfoFromStorageStatus(StorageUtils.scala:82) at org.apache.spark.ui.storage.StorageListener.updateRDDInfo(StorageTab.scala:56) at org.apache.spark.ui.storage.StorageListener.onTaskEnd(StorageTab.scala:67) - locked 0xa27ebe30 (a org.apache.spark.ui.storage.StorageListener) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2563) Make number of connection retries configurable
Shivaram Venkataraman created SPARK-2563: Summary: Make number of connection retries configurable Key: SPARK-2563 URL: https://issues.apache.org/jira/browse/SPARK-2563 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Priority: Minor In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. We should make the number of retries before failing configurable to handle these cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2563) Make number of connection retries configurable
[ https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065735#comment-14065735 ] Shivaram Venkataraman commented on SPARK-2563: -- https://github.com/apache/spark/pull/1471 Make number of connection retries configurable -- Key: SPARK-2563 URL: https://issues.apache.org/jira/browse/SPARK-2563 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Priority: Minor In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. We should make the number of retries before failing configurable to handle these cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2316) StorageStatusListener should avoid O(blocks) operations
[ https://issues.apache.org/jira/browse/SPARK-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074692#comment-14074692 ] Shivaram Venkataraman commented on SPARK-2316: -- On a related note, can we have flags to turn off some of the UI listeners ? If the StorageTab is going to be too expensive to update, it'll be good to have a way to turn it off and just have the JobProgress show up in the UI StorageStatusListener should avoid O(blocks) operations --- Key: SPARK-2316 URL: https://issues.apache.org/jira/browse/SPARK-2316 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0 Reporter: Patrick Wendell Assignee: Andrew Or Priority: Critical In the case where jobs are frequently causing dropped blocks the storage status listener can bottleneck. This is slow for a few reasons, one being that we use Scala collection operations, the other being that we operations that are O(number of blocks). I think using a few indices here could make this much faster. {code} at java.lang.Integer.valueOf(Integer.java:642) at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70) at org.apache.spark.storage.StorageUtils$$anonfun$9.apply(StorageUtils.scala:82) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327) at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105) at org.apache.spark.storage.StorageUtils$.rddInfoFromStorageStatus(StorageUtils.scala:82) at org.apache.spark.ui.storage.StorageListener.updateRDDInfo(StorageTab.scala:56) at org.apache.spark.ui.storage.StorageListener.onTaskEnd(StorageTab.scala:67) - locked 0xa27ebe30 (a org.apache.spark.ui.storage.StorageListener) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2563) Re-open sockets to handle connect timeouts
[ https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-2563: - Description: In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. If the connection attempt times out, the socket gets closed and from [1] we get a ClosedChannelException. We should check if the Socket was closed due to a timeout and open a new socket and try to connect. FWIW, I was able to work around my problems by increasing the number of SYN retries in Linux. (I ran echo 8 /proc/sys/net/ipv4/tcp_syn_retries) [1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573 was:In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. We should make the number of retries before failing configurable to handle these cases. Summary: Re-open sockets to handle connect timeouts (was: Make number of connection retries configurable) Re-open sockets to handle connect timeouts -- Key: SPARK-2563 URL: https://issues.apache.org/jira/browse/SPARK-2563 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Priority: Minor In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. If the connection attempt times out, the socket gets closed and from [1] we get a ClosedChannelException. We should check if the Socket was closed due to a timeout and open a new socket and try to connect. FWIW, I was able to work around my problems by increasing the number of SYN retries in Linux. (I ran echo 8 /proc/sys/net/ipv4/tcp_syn_retries) [1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2563) Re-open sockets to handle connect timeouts
[ https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065735#comment-14065735 ] Shivaram Venkataraman edited comment on SPARK-2563 at 7/28/14 5:43 PM: --- More details about the bug is -https://github.com/apache/spark/pull/1471- was (Author: shivaram): https://github.com/apache/spark/pull/1471 Re-open sockets to handle connect timeouts -- Key: SPARK-2563 URL: https://issues.apache.org/jira/browse/SPARK-2563 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Priority: Minor In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. If the connection attempt times out, the socket gets closed and from [1] we get a ClosedChannelException. We should check if the Socket was closed due to a timeout and open a new socket and try to connect. FWIW, I was able to work around my problems by increasing the number of SYN retries in Linux. (I ran echo 8 /proc/sys/net/ipv4/tcp_syn_retries) [1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2723) Block Manager should catch exceptions in putValues
Shivaram Venkataraman created SPARK-2723: Summary: Block Manager should catch exceptions in putValues Key: SPARK-2723 URL: https://issues.apache.org/jira/browse/SPARK-2723 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Shivaram Venkataraman The BlockManager should catch exceptions encountered while writing out files to disk. Right now these exceptions get counted as user-level task failures and the job is aborted after failing 4 times. We should either fail the executor or handle this better to prevent the job from dying. I ran into an issue where one disk on a large EC2 cluster failed and this resulted in a long running job terminating. Longer term, we should also look at black-listing local directories when one of them become unusable ? Exception pasted below: 14/07/29 00:55:39 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /mnt2/spark/spark-local-20140728175256-e7cb/28/broadcast_264_piece20 (Input/output error) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at java.io.FileOutputStream.init(FileOutputStream.java:171) at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:79) at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:66) at org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:847) at org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:267) at org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:256) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.storage.MemoryStore.ensureFreeSpace(MemoryStore.scala:256) at org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:179) at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:76) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:663) at org.apache.spark.storage.BlockManager.put(BlockManager.scala:574) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2774) Set preferred locations for reduce tasks
Shivaram Venkataraman created SPARK-2774: Summary: Set preferred locations for reduce tasks Key: SPARK-2774 URL: https://issues.apache.org/jira/browse/SPARK-2774 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Currently we do not set preferred locations for reduce tasks in Spark. This patch proposes setting preferred locations based on the map output sizes and locations tracked by the MapOutputTracker. This is useful in two conditions 1. When you have a small job in a large cluster it can be useful to co-locate map and reduce tasks to avoid going over the network 2. If there is a lot of data skew in the map stage outputs, then it is beneficial to place the reducer close to the largest output. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2950) Add GC time and Shuffle Write time to JobLogger output
Shivaram Venkataraman created SPARK-2950: Summary: Add GC time and Shuffle Write time to JobLogger output Key: SPARK-2950 URL: https://issues.apache.org/jira/browse/SPARK-2950 Project: Spark Issue Type: Improvement Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Minor The JobLogger is very useful for performing offline performance profiling of Spark jobs. GC Time and Shuffle Write time are available in TaskMetrics but are currently missed from the JobLogger output. This change adds these two fields. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2950) Add GC time and Shuffle Write time to JobLogger output
[ https://issues.apache.org/jira/browse/SPARK-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-2950: - Fix Version/s: 1.2.0 Add GC time and Shuffle Write time to JobLogger output -- Key: SPARK-2950 URL: https://issues.apache.org/jira/browse/SPARK-2950 Project: Spark Issue Type: Improvement Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Minor Fix For: 1.2.0 The JobLogger is very useful for performing offline performance profiling of Spark jobs. GC Time and Shuffle Write time are available in TaskMetrics but are currently missed from the JobLogger output. This change adds these two fields. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2950) Add GC time and Shuffle Write time to JobLogger output
[ https://issues.apache.org/jira/browse/SPARK-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-2950. -- Resolution: Fixed Add GC time and Shuffle Write time to JobLogger output -- Key: SPARK-2950 URL: https://issues.apache.org/jira/browse/SPARK-2950 Project: Spark Issue Type: Improvement Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Minor Fix For: 1.2.0 The JobLogger is very useful for performing offline performance profiling of Spark jobs. GC Time and Shuffle Write time are available in TaskMetrics but are currently missed from the JobLogger output. This change adds these two fields. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext
[ https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112855#comment-14112855 ] Shivaram Venkataraman commented on SPARK-3215: -- This looks very interesting -- One thing that would be very useful is to make the RPC interface language agnostic. This would make it possible to submit Python or R jobs to a SparkContext without embedding a JVM in the driver process. Could we use Thrift or Protocol Buffers or something like that ? Also it'll be great to make a tentative list of RPCs that are required to get a simple application to work. Add remote interface for SparkContext - Key: SPARK-3215 URL: https://issues.apache.org/jira/browse/SPARK-3215 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Marcelo Vanzin Labels: hive Attachments: RemoteSparkContext.pdf A quick description of the issue: as part of running Hive jobs on top of Spark, it's desirable to have a SparkContext that is running in the background and listening for job requests for a particular user session. Running multiple contexts in the same JVM is not a very good solution. Not only SparkContext currently has issues sharing the same JVM among multiple instances, but that turns the JVM running the contexts into a huge bottleneck in the system. So I'm proposing a solution where we have a SparkContext that is running in a separate process, and listening for requests from the client application via some RPC interface (most probably Akka). I'll attach a document shortly with the current proposal. Let's use this bug to discuss the proposal and any other suggestions. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3659) Set EC2 version to 1.1.0 in master branch
Shivaram Venkataraman created SPARK-3659: Summary: Set EC2 version to 1.1.0 in master branch Key: SPARK-3659 URL: https://issues.apache.org/jira/browse/SPARK-3659 Project: Spark Issue Type: Bug Components: EC2 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Minor Master branch should be in sync with branch-1.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3674) Add support for launching YARN clusters in spark-ec2
Shivaram Venkataraman created SPARK-3674: Summary: Add support for launching YARN clusters in spark-ec2 Key: SPARK-3674 URL: https://issues.apache.org/jira/browse/SPARK-3674 Project: Spark Issue Type: Bug Components: EC2 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Right now spark-ec2 only supports launching Spark Standalone clusters. While this is sufficient for basic usage it is hard to test features or do performance benchmarking on YARN. It will be good to add support for installing, configuring a Apache YARN cluster at a fixed version -- say the latest stable version 2.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3522) Make spark-ec2 verbosity configurable
[ https://issues.apache.org/jira/browse/SPARK-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151348#comment-14151348 ] Shivaram Venkataraman commented on SPARK-3522: -- It would be good, but I think most of the output in spark-ec2 comes from the shell scripts that install things like HDFS, Spark etc. So this would be less of a python logging change and more of a change in shell scripts in spark-ec2. Also the other thing to consider is that the output is often the only way to figure out what / why things went wrong during cluster launch. So it might be better to save it to a file (spark-ec2-cluster-name-launch.log) as sometimes re-running spark-ec2 with more logging could be expensive. Make spark-ec2 verbosity configurable - Key: SPARK-3522 URL: https://issues.apache.org/jira/browse/SPARK-3522 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor When launching a cluster, {{spark-ec2}} spits out a lot of stuff that feels like debug output. It would be better for the user if {{spark-ec2}} did the following: * default to info output level * allow option to increase verbosity and include debug output This will require converting most of the {{print}} statements in the script to use Python's {{logging}} module and setting output levels ({{INFO}}, {{WARN}}, {{DEBUG}}) for each statement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2008) Enhance spark-ec2 to be able to add and remove slaves to an existing cluster
[ https://issues.apache.org/jira/browse/SPARK-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153770#comment-14153770 ] Shivaram Venkataraman commented on SPARK-2008: -- This will be a very useful feature for spark-ec2 and is a good issue to work on. I think removing slaves should be relatively easy to implement as systems like HDFS, Spark should be resistant to slaves being removed. For adding slaves we'll need a new script that'll run setup-slave.sh https://github.com/mesos/spark-ec2/blob/v3/setup-slave.sh and bring up Datanodes, Spark workers etc. Enhance spark-ec2 to be able to add and remove slaves to an existing cluster Key: SPARK-2008 URL: https://issues.apache.org/jira/browse/SPARK-2008 Project: Spark Issue Type: New Feature Components: EC2 Affects Versions: 1.0.0 Reporter: Nicholas Chammas Priority: Minor Per [the discussion here|http://apache-spark-user-list.1001560.n3.nabble.com/Having-spark-ec2-join-new-slaves-to-existing-cluster-td3783.html]: {quote} I would like to be able to use spark-ec2 to launch new slaves and add them to an existing, running cluster. Similarly, I would also like to remove slaves from an existing cluster. Use cases include: * Oh snap, I sized my cluster incorrectly. Let me add/remove some slaves. * During scheduled batch processing, I want to add some new slaves, perhaps on spot instances. When that processing is done, I want to kill them. (Cruel, I know.) I gather this is not possible at the moment. spark-ec2 appears to be able to launch new slaves for an existing cluster only if the master is stopped. I also do not see any ability to remove slaves from a cluster. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153774#comment-14153774 ] Shivaram Venkataraman commented on SPARK-3434: -- I'll post a design doc by sometime tonight. We also have a reference implementation that I will add a link to and we can base our discussion off that. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162755#comment-14162755 ] Shivaram Venkataraman commented on SPARK-3821: -- 1. Yes - the same stuff is installed on master and slaves. In fact they have the same AMI. 2. The base Spark AMI is created using `create_image.sh` (from a base Amazon AMI) -- After that we pass in the AMI-ID to `spark_ec2.py` which calls `setup.sh` on the master. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167478#comment-14167478 ] Shivaram Venkataraman commented on SPARK-3434: -- ~brkyvz -- We are just adding a few more test cases to classes to make sure our interfaces look fine. I'll also create a simple design doc and post it here. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167478#comment-14167478 ] Shivaram Venkataraman edited comment on SPARK-3434 at 10/10/14 8:45 PM: [~brkyvz] -- We are just adding a few more test cases to classes to make sure our interfaces look fine. I'll also create a simple design doc and post it here. was (Author: shivaram): ~brkyvz -- We are just adding a few more test cases to classes to make sure our interfaces look fine. I'll also create a simple design doc and post it here. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman reassigned SPARK-3434: Assignee: Shivaram Venkataraman Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Shivaram Venkataraman This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171272#comment-14171272 ] Shivaram Venkataraman commented on SPARK-3434: -- Sorry for the delay in getting back -- I've posted a design doc at http://goo.gl/0eE5fh and a reference implementation at https://github.com/amplab/ml-matrix. The design doc and the reference implementation use Spark as a library -- so this works as a standalone library in case somebody wants to try it out. Some more points to note regarding the integration: 1. The existing implementation uses breeze matrices in the interface but we will change that to use local Matrix trait already present in Spark. 2. The matrix layouts will also extend the DistributedMatrix class in MLLib and we could create a new interface BlockDistributedMatrix from the interface in amplab/ml-matrix 3. We can also use this JIRA or create a new JIRA to discuss what algorithms / operations should be merged into Spark. I think TSQR, NormalEquations should be pretty useful. Other algorithms like 2-D BlockQR and BlockCoordinateDescent can be merged later if we feel its useful (these haven't been pushed to ml-matrix yet). I will create a first patch for the matrix formats in a couple of days. Please let me know if there are any questions / clarifications. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173954#comment-14173954 ] Shivaram Venkataraman commented on SPARK-3957: -- I think it needs to be tracked in the Block Manager -- However we also need to track this on a per-executor basis and not just at the driver. Right now AFAIK, executors do not report new broadcast blocks to the master to reduce communication. However we could add broadcast blocks to some periodic report. [~andrewor] might know more. Broadcast variable memory usage not reflected in UI --- Key: SPARK-3957 URL: https://issues.apache.org/jira/browse/SPARK-3957 Project: Spark Issue Type: Bug Components: Block Manager, Web UI Affects Versions: 1.0.2, 1.1.0 Reporter: Shivaram Venkataraman Assignee: Nan Zhu Memory used by broadcast variables are not reflected in the memory usage reported in the WebUI. For example, the executors tab shows memory used in each executor but this number doesn't include memory used by broadcast variables. Similarly the storage tab only shows list of rdds cached and how much memory they use. We should add a separate column / tab for broadcast variables to make it easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3973) Print callSite information for broadcast variables
Shivaram Venkataraman created SPARK-3973: Summary: Print callSite information for broadcast variables Key: SPARK-3973 URL: https://issues.apache.org/jira/browse/SPARK-3973 Project: Spark Issue Type: Bug Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Minor Fix For: 1.2.0 Printing call site information for broadcast variables will help in debugging which variables are used, when they are used etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4030) `destroy` method in Broadcast should be public
Shivaram Venkataraman created SPARK-4030: Summary: `destroy` method in Broadcast should be public Key: SPARK-4030 URL: https://issues.apache.org/jira/browse/SPARK-4030 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Shivaram Venkataraman The destroy method in Broadcast.scala [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala#L91] is right now marked as private[spark] This prevents long-running applications from cleaning up memory used by broadcast variables on the driver. Also as broadcast variables are always created with persistence MEMORY_DISK, this slows down jobs when old broadcast variables are flushed to disk. Making `destroy` public can help applications control the lifetime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4031) Read broadcast variables on use
Shivaram Venkataraman created SPARK-4031: Summary: Read broadcast variables on use Key: SPARK-4031 URL: https://issues.apache.org/jira/browse/SPARK-4031 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman This is a proposal to change the broadcast variable implementations in Spark to only read values when they are used rather than on deserializing. This change will be very helpful (and in our use cases required) for complex applications which have a large number of broadcast variables. For example if broadcast variables are class members, they are captured in closures even when they are not used. We could also consider cleaning closures more aggressively, but that might be a more complex change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4030) `destroy` method in Broadcast should be public
[ https://issues.apache.org/jira/browse/SPARK-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178532#comment-14178532 ] Shivaram Venkataraman commented on SPARK-4030: -- Yes - there is a bunch of logic around `valid` which checks for destroyed broadcast variables. I don't mind having a more esoteric option that is harder to use -- like unpersist(dropFromMaster=true) -- which you can't use by mistake. `destroy` method in Broadcast should be public -- Key: SPARK-4030 URL: https://issues.apache.org/jira/browse/SPARK-4030 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Shivaram Venkataraman The destroy method in Broadcast.scala [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala#L91] is right now marked as private[spark] This prevents long-running applications from cleaning up memory used by broadcast variables on the driver. Also as broadcast variables are always created with persistence MEMORY_DISK, this slows down jobs when old broadcast variables are flushed to disk. Making `destroy` public can help applications control the lifetime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4030) `destroy` method in Broadcast should be public
[ https://issues.apache.org/jira/browse/SPARK-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182181#comment-14182181 ] Shivaram Venkataraman commented on SPARK-4030: -- Great -- I'll send a PR and also include the change to capture the callSite and print it out if `assertValid` fails. `destroy` method in Broadcast should be public -- Key: SPARK-4030 URL: https://issues.apache.org/jira/browse/SPARK-4030 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Shivaram Venkataraman The destroy method in Broadcast.scala [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala#L91] is right now marked as private[spark] This prevents long-running applications from cleaning up memory used by broadcast variables on the driver. Also as broadcast variables are always created with persistence MEMORY_DISK, this slows down jobs when old broadcast variables are flushed to disk. Making `destroy` public can help applications control the lifetime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4030) `destroy` method in Broadcast should be public
[ https://issues.apache.org/jira/browse/SPARK-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman reassigned SPARK-4030: Assignee: Shivaram Venkataraman `destroy` method in Broadcast should be public -- Key: SPARK-4030 URL: https://issues.apache.org/jira/browse/SPARK-4030 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman The destroy method in Broadcast.scala [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala#L91] is right now marked as private[spark] This prevents long-running applications from cleaning up memory used by broadcast variables on the driver. Also as broadcast variables are always created with persistence MEMORY_DISK, this slows down jobs when old broadcast variables are flushed to disk. Making `destroy` public can help applications control the lifetime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4030) `destroy` method in Broadcast should be public
[ https://issues.apache.org/jira/browse/SPARK-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-4030. -- Resolution: Fixed Fix Version/s: 1.2.0 `destroy` method in Broadcast should be public -- Key: SPARK-4030 URL: https://issues.apache.org/jira/browse/SPARK-4030 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Fix For: 1.2.0 The destroy method in Broadcast.scala [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala#L91] is right now marked as private[spark] This prevents long-running applications from cleaning up memory used by broadcast variables on the driver. Also as broadcast variables are always created with persistence MEMORY_DISK, this slows down jobs when old broadcast variables are flushed to disk. Making `destroy` public can help applications control the lifetime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4030) `destroy` method in Broadcast should be public
[ https://issues.apache.org/jira/browse/SPARK-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185263#comment-14185263 ] Shivaram Venkataraman commented on SPARK-4030: -- Issue resolved by pull request 2922 https://github.com/apache/spark/pull/2922 `destroy` method in Broadcast should be public -- Key: SPARK-4030 URL: https://issues.apache.org/jira/browse/SPARK-4030 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Fix For: 1.2.0 The destroy method in Broadcast.scala [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala#L91] is right now marked as private[spark] This prevents long-running applications from cleaning up memory used by broadcast variables on the driver. Also as broadcast variables are always created with persistence MEMORY_DISK, this slows down jobs when old broadcast variables are flushed to disk. Making `destroy` public can help applications control the lifetime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4031) Read broadcast variables on use
[ https://issues.apache.org/jira/browse/SPARK-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-4031. -- Resolution: Fixed Fix Version/s: 1.2.0 Read broadcast variables on use --- Key: SPARK-4031 URL: https://issues.apache.org/jira/browse/SPARK-4031 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Fix For: 1.2.0 This is a proposal to change the broadcast variable implementations in Spark to only read values when they are used rather than on deserializing. This change will be very helpful (and in our use cases required) for complex applications which have a large number of broadcast variables. For example if broadcast variables are class members, they are captured in closures even when they are not used. We could also consider cleaning closures more aggressively, but that might be a more complex change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4031) Read broadcast variables on use
[ https://issues.apache.org/jira/browse/SPARK-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187122#comment-14187122 ] Shivaram Venkataraman commented on SPARK-4031: -- Issue resolved by pull request 2871 https://github.com/apache/spark/pull/2871 Read broadcast variables on use --- Key: SPARK-4031 URL: https://issues.apache.org/jira/browse/SPARK-4031 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Fix For: 1.2.0 This is a proposal to change the broadcast variable implementations in Spark to only read values when they are used rather than on deserializing. This change will be very helpful (and in our use cases required) for complex applications which have a large number of broadcast variables. For example if broadcast variables are class members, they are captured in closures even when they are not used. We could also consider cleaning closures more aggressively, but that might be a more complex change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4137) Relative paths don't get handled correctly by spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-4137: - Assignee: Nicholas Chammas Relative paths don't get handled correctly by spark-ec2 --- Key: SPARK-4137 URL: https://issues.apache.org/jira/browse/SPARK-4137 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203757#comment-14203757 ] Shivaram Venkataraman commented on SPARK-3821: -- [~nchammas] Thanks for putting this together -- This is looking great ! I just had a couple of quick questions, clarifications 1. My preference would be to just have a single AMI across Spark versions for a couple of reasons. First it reduces steps for every release (even though creating AMIs is definitely much simpler now !). Also the number of AMIs we maintain could get large if we do this for every minor and major release like 1.1.1. [~pwendell] could probably comment more on the release process etc. 2. Could you clarify if Hadoop is pre-installed in new AMIs or are is it still installed on startup ? The flexibility we right now have of switching between Hadoop 1, Hadoop 2, YARN etc. is useful for testing. (Related packer question: Are the [init scripts| https://github.com/nchammas/spark-ec2/blob/packer/packer/spark-packer.json#L129] run during AMI creation or during startup ?) 3. Do you have some benchmarks for the new AMI without Spark 1.1.0 pre-installed ? [We right now have old AMI vs. new AMI with spark|https://github.com/nchammas/spark-ec2/blob/packer/packer/proposal.md#new-amis---latest-os-updates-and-spark-110-pre-installed-single-run] . I see a couple of huge wins in the new AMI (from SSH wait time, ganglia init etc.) which I guess we should get even without Spark being pre-installed. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205131#comment-14205131 ] Shivaram Venkataraman commented on SPARK-3821: -- Regarding reducing init time, I think there are simple things we can do in init.sh that will get us most of the way there. For example, we can download the tar.gz files for Hadoop, Spark on each machine and untar in parallel instead of rsync-ing at the end. But we can revisit this in a separate change I guess Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.
[ https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215640#comment-14215640 ] Shivaram Venkataraman commented on SPARK-3337: -- [~andrewor14] can we pull this in to 1.1.1 ? A lot of people ran into this bug in the AMPCamp exercises as their install paths had spaces. Paranoid quoting in shell to allow install dirs with spaces within. --- Key: SPARK-3337 URL: https://issues.apache.org/jira/browse/SPARK-3337 Project: Spark Issue Type: Improvement Components: Build, Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Prashant Sharma Assignee: Prashant Sharma Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext
[ https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230878#comment-14230878 ] Shivaram Venkataraman commented on SPARK-3963: -- [~pwendell] This looks pretty useful -- Was this postponed from 1.2 ? I have a use case that needs Hadoop file names and was wondering if there was a workaround before this is implemented. Support getting task-scoped properties from TaskContext --- Key: SPARK-3963 URL: https://issues.apache.org/jira/browse/SPARK-3963 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell This is a proposal for a minor feature. Given stabilization of the TaskContext API, it would be nice to have a mechanism for Spark jobs to access properties that are defined based on task-level scope by Spark RDD's. I'd like to propose adding a simple properties hash map with some standard spark properties that users can access. Later it would be nice to support users setting these properties, but for now to keep it simple in 1.2. I'd prefer users not be able to set them. The main use case is providing the file name from Hadoop RDD's, a very common request. But I'd imagine us using this for other things later on. We could also use this to expose some of the taskMetrics, such as e.g. the input bytes. {code} val data = sc.textFile(s3n//..2014/*/*/*.json) data.mapPartitions { val tc = TaskContext.get val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME) val parts = fileName.split(/) val (year, month, day) = (parts[3], parts[4], parts[5]) ... } {code} Internally we'd have a method called setProperty, but this wouldn't be exposed initially. This is structured as a simple (String, String) hash map for ease of porting to python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext
[ https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230900#comment-14230900 ] Shivaram Venkataraman commented on SPARK-3963: -- Thanks. I somehow missed `mapPartitionsWithInputSplit` -- that will work for now. Support getting task-scoped properties from TaskContext --- Key: SPARK-3963 URL: https://issues.apache.org/jira/browse/SPARK-3963 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell This is a proposal for a minor feature. Given stabilization of the TaskContext API, it would be nice to have a mechanism for Spark jobs to access properties that are defined based on task-level scope by Spark RDD's. I'd like to propose adding a simple properties hash map with some standard spark properties that users can access. Later it would be nice to support users setting these properties, but for now to keep it simple in 1.2. I'd prefer users not be able to set them. The main use case is providing the file name from Hadoop RDD's, a very common request. But I'd imagine us using this for other things later on. We could also use this to expose some of the taskMetrics, such as e.g. the input bytes. {code} val data = sc.textFile(s3n//..2014/*/*/*.json) data.mapPartitions { val tc = TaskContext.get val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME) val parts = fileName.split(/) val (year, month, day) = (parts[3], parts[4], parts[5]) ... } {code} Internally we'd have a method called setProperty, but this wouldn't be exposed initially. This is structured as a simple (String, String) hash map for ease of porting to python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252053#comment-14252053 ] Shivaram Venkataraman commented on SPARK-2075: -- So just to make sure I understand things correctly, is it the case that the jar published to maven (spark-core-1.1.1) is built using Hadoop2 dependencies while the Hadoop1 assembly jar that is distributed is built using Hadoop 1 (obviously...) ? [~srowen] While I see that we officially support submitting jobs using spark-submit, it is surprising to me that other deployment methods would fail this way (from the user's perspective the Spark versions presumably at compile time and run time presumably match up ?). We should at the very least document this, but it would also be good to see if there is a work around. Anonymous classes are missing from Spark distribution - Key: SPARK-2075 URL: https://issues.apache.org/jira/browse/SPARK-2075 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 1.0.0 Reporter: Paul R. Brown Priority: Critical Running a job built against the Maven dep for 1.0.0 and the hadoop1 distribution produces: {code} java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 {code} Here's what's in the Maven dep as of 1.0.0: {code} jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} And here's what's in the hadoop1 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' {code} I.e., it's not there. It is in the hadoop2 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252360#comment-14252360 ] Shivaram Venkataraman commented on SPARK-2075: -- Hmm -- looking at the release steps it looks like the release on maven should be from Hadoop 1.0.4 [~pwendell] or [~andrewor14] might be able to throw more light on this. (BTW I wonder if we can trace the source of this mismatch for the case reported by [~sunrui] where the distribution with Hadoop1 of Spark 1.1.1 doesn't work with the Maven central jar) I see your high level point that this is not about spark-submit per se, but about having the exact same binary on the server and as a compile-time dependency. Its just unfortunate that having the same Spark version number isn't sufficient. Also is the workaround right now to rebuild Spark from source using `make-distribution`, do `mvn install`, rebuild the application and deploy Spark using the assembly jar ? Anonymous classes are missing from Spark distribution - Key: SPARK-2075 URL: https://issues.apache.org/jira/browse/SPARK-2075 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 1.0.0 Reporter: Paul R. Brown Priority: Critical Running a job built against the Maven dep for 1.0.0 and the hadoop1 distribution produces: {code} java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 {code} Here's what's in the Maven dep as of 1.0.0: {code} jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} And here's what's in the hadoop1 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' {code} I.e., it's not there. It is in the hadoop2 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255364#comment-14255364 ] Shivaram Venkataraman commented on SPARK-2075: -- [~sunrui] What I can see from this JIRA discussion (and [~srowen] please correct me if I am wrong) is that Hadoop 1 vs. Hadoop 2 is one of the causes of incompatibility. It is _not the only_ reason and I don't think we exactly know why the pre-built binary for 1.1.0 is different from the maven version. I think the best practice advice is to use the exact same jar in the application and in the runtime. Marking Spark a provided dependency in the application build and using spark-submit is one way of achieving this. Or one can publish a local build to maven and use the same local build to start the cluster etc. Anonymous classes are missing from Spark distribution - Key: SPARK-2075 URL: https://issues.apache.org/jira/browse/SPARK-2075 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 1.0.0 Reporter: Paul R. Brown Assignee: Shixiong Zhu Priority: Critical Running a job built against the Maven dep for 1.0.0 and the hadoop1 distribution produces: {code} java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 {code} Here's what's in the Maven dep as of 1.0.0: {code} jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} And here's what's in the hadoop1 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' {code} I.e., it's not there. It is in the hadoop2 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4977) spark-ec2 start resets all the spark/conf configurations
[ https://issues.apache.org/jira/browse/SPARK-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259480#comment-14259480 ] Shivaram Venkataraman commented on SPARK-4977: -- I've run into this before too, but its not very easy to fix. The reason most conf files get overwritten is that hostnames change on EC2 when machines are stopped and started, and we need to update the hostnames in the config files. I guess there are a couple of solutions I can think of 1. To provide an extension-like mechanism where we source script which contains user-defined options (like spark-env-extensions.sh) and we don't overwrite this file during start / stop. 2. To separate out conf files which need hostname changes vs. those that don't and only overwrrite the former. This will need changes to `deploy_templates.py` in our current setup. spark-ec2 start resets all the spark/conf configurations Key: SPARK-4977 URL: https://issues.apache.org/jira/browse/SPARK-4977 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0 Reporter: Noah Young Priority: Minor Running `spark-ec2 start` to restart an already-launched cluster causes the cluster setup scripts to be run, which reset any existing spark configuration files on the remote machines. The expected behavior is that all the modules (tachyon, hadoop, spark itself) should be restarted, and perhaps the master configuration copy-dir'd out, but anything in spark/conf should (at least optionally) be left alone. As far as I know, one must create and execute their own init script to set all spark configurables as needed after restarting a cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263193#comment-14263193 ] Shivaram Venkataraman commented on SPARK-3821: -- Yeah you are right that the times are pretty close for Packer, base AMI. I was just curious if I was missing some thing. I don't think there is much else I had in mind -- having the full cluster launch times for existing AMI vs. Packer would be good and it would also be good to see how Packer compares to images created using [create_image.sh|https://github.com/mesos/spark-ec2/blob/v4/create_image.sh] Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263181#comment-14263181 ] Shivaram Venkataraman commented on SPARK-3821: -- [~nchammas] Thanks for the benchmark. One thing I am curious about is why the Packer AMI is faster than launching just the base Amazon AMI. Is this because we spend some time installing things on the base AMI that we avoid with Packer ? Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289679#comment-14289679 ] Shivaram Venkataraman commented on SPARK-5386: -- Couple of things that might be worth inspecting 1. It might be interesting to see if this is a problem in `reduce` or in the `map` stage. i.e. Does running a count after the parallelize work ? 2. The error message indicates requesting around 2.3G of memory which seems to indicate that a bunch of these vectors are being created at once ? It'd be interesting to see what happens when say p = 2 in your script Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289731#comment-14289731 ] Shivaram Venkataraman commented on SPARK-5386: -- Note that having 2 worker instances and 2 cores per worker would make it 4 tasks per machine. And if the `count` works and `reduce` fails, then it looks like it has something to do with allocating extra vectors to hold the result in each partition ([1]) etc. I don't know much about the scala implementation of reduceLeft or ways to trace down where the memory allocations are coming from. [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L865 Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.count() vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289957#comment-14289957 ] Shivaram Venkataraman commented on SPARK-5386: -- Results are merged on the driver one at at time. You can see the merge function that is called right below at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L873 However I dont know if there is anything that limits the rate at which results are fetched etc. Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.count() vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5654) Integrate SparkR into Apache Spark
Shivaram Venkataraman created SPARK-5654: Summary: Integrate SparkR into Apache Spark Key: SPARK-5654 URL: https://issues.apache.org/jira/browse/SPARK-5654 Project: Spark Issue Type: New Feature Reporter: Shivaram Venkataraman The SparkR project [1] provides a light-weight frontend to launch Spark jobs from R. The project was started at the AMPLab around a year ago and has been incubated as its own project to make sure it can be easily merged into upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s goals are similar to PySpark and shares a similar design pattern as described in our meetup talk[2], Spark Summit presentation[3]. Integrating SparkR into the Apache project will enable R users to use Spark out of the box and given R’s large user base, it will help the Spark project reach more users. Additionally, work in progress features like providing R integration with ML Pipelines and Dataframes can be better achieved by development in a unified code base. SparkR is available under the Apache 2.0 License and does not have any external dependencies other than requiring users to have R and Java installed on their machines. SparkR’s developers come from many organizations including UC Berkeley, Alteryx, Intel and we will support future development, maintenance after the integration. [1] https://github.com/amplab-extras/SparkR-pkg [2] http://files.meetup.com/3138542/SparkR-meetup.pdf [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve
[ https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282117#comment-14282117 ] Shivaram Venkataraman commented on SPARK-5246: -- Yes - this can be resolved. However I can't seem to assign this to [~vgrigor]. Not sure if this needs some JIRA permissions. spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve -- Key: SPARK-5246 URL: https://issues.apache.org/jira/browse/SPARK-5246 Project: Spark Issue Type: Bug Components: EC2 Reporter: Vladimir Grigor How to reproduce: 1) http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html should be sufficient to setup VPC for this bug. After you followed that guide, start new instance in VPC, ssh to it (though NAT server) 2) user starts a cluster in VPC: {code} ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 --subnet-id=subnet-2571dd4d --zone=eu-west-1a launch SparkByScript Setting up security groups... (omitted for brevity) 10.1.1.62 10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop no org.apache.spark.deploy.master.Master to stop starting org.apache.spark.deploy.master.Master, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out failed to launch org.apache.spark.deploy.master.Master: at java.net.InetAddress.getLocalHost(InetAddress.java:1469) ... 12 more full log in /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out 10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out 10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker: 10.1.1.62:at java.net.InetAddress.getLocalHost(InetAddress.java:1469) 10.1.1.62:... 12 more 10.1.1.62: full log in /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out [timing] spark-standalone setup: 00h 00m 28s (omitted for brevity) {code} /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out {code} Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp :::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 --webui-port 8080 15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, HUP, INT] Exception in thread main java.net.UnknownHostException: ip-10-1-1-151: ip-10-1-1-151: Name or service not known at java.net.InetAddress.getLocalHost(InetAddress.java:1473) at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620) at org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612) at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612) at org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613) at org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613) at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665) at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.util.Utils$.localHostName(Utils.scala:665) at org.apache.spark.deploy.master.MasterArguments.init(MasterArguments.scala:27) at org.apache.spark.deploy.master.Master$.main(Master.scala:819) at org.apache.spark.deploy.master.Master.main(Master.scala) Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not known at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) at java.net.InetAddress.getLocalHost(InetAddress.java:1469) ... 12 more {code} Problem is that instance launched in VPC may be not able to resolve own local hostname. Please see https://forums.aws.amazon.com/thread.jspa?threadID=92092. I am going to submit a fix for this problem since I need this functionality asap. -- This message was sent by
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277334#comment-14277334 ] Shivaram Venkataraman commented on SPARK-3821: -- Regarding the pre-built distributions, AFAIK we don't support full Hadoop2 as in YARN. We run CDH4 which has some parts of Hadoop2, but with MapReduce. There is an open PR to add support for Hadoop2 at https://github.com/mesos/spark-ec2/pull/77 and and you can see that it gets the right [prebuilt Spark|https://github.com/mesos/spark-ec2/pull/77/files#diff-1d040c3294246f2b59643d63868fc2adR97] in that case Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1302) httpd doesn't start in spark-ec2 (cc2.8xlarge)
[ https://issues.apache.org/jira/browse/SPARK-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316541#comment-14316541 ] Shivaram Venkataraman commented on SPARK-1302: -- [~soid] Could you let us which Spark version you were using to launch the cluster. The fix for spark-ec2 was merged into `branch-1.3` (and the master branch) httpd doesn't start in spark-ec2 (cc2.8xlarge) -- Key: SPARK-1302 URL: https://issues.apache.org/jira/browse/SPARK-1302 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 0.9.0 Reporter: Shivaram Venkataraman Priority: Minor In a cc2.8xlarge EC2 cluster launched from master branch, httpd won't start (i.e ganglia doesn't work). The reason seems to be httpd.conf is wrong (newer httpd version ?). The config file contains a bunch of non-existent modules and this happens because we overwrite the default conf with our config file from spark-ec2. We could explore using patch or something like that to just apply the diff we need -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster
[ https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325354#comment-14325354 ] Shivaram Venkataraman commented on SPARK-5629: -- This sounds fine to me and I really like YAML -- Does Python have native support for printing out YAML ? One thing we should do is probably marking this as experimental as we might not be able maintain backwards compatibility etc. (On that note are YAML parsers backwards compatible ? i.e. if we add a new field in the next release will it break things ?) Add spark-ec2 action to return info about an existing cluster - Key: SPARK-5629 URL: https://issues.apache.org/jira/browse/SPARK-5629 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor You can launch multiple clusters using spark-ec2. At some point, you might just want to get some information about an existing cluster. Use cases include: * Wanting to check something about your cluster in the EC2 web console. * Wanting to feed information about your cluster to another tool (e.g. as described in [SPARK-5627]). So, in addition to the [existing actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]: * {{launch}} * {{destroy}} * {{login}} * {{stop}} * {{start}} * {{get-master}} * {{reboot-slaves}} We add a new action, {{describe}}, which describes an existing cluster if given a cluster name, and all clusters if not. Some examples: {code} # describes all clusters launched by spark-ec2 spark-ec2 describe {code} {code} # describes cluster-1 spark-ec2 describe cluster-1 {code} In combination with the proposal in [SPARK-5627]: {code} # describes cluster-3 in a machine-readable way (e.g. JSON) spark-ec2 describe cluster-3 --machine-readable {code} Parallels in similar tools include: * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju * [{{starcluster listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node] from MIT StarCluster -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-757) Deserialization Exception partway into long running job with Netty - MLbase
[ https://issues.apache.org/jira/browse/SPARK-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324687#comment-14324687 ] Shivaram Venkataraman commented on SPARK-757: - Since the shuffle implementation has changed recently I think this can be marked as obsolete Deserialization Exception partway into long running job with Netty - MLbase --- Key: SPARK-757 URL: https://issues.apache.org/jira/browse/SPARK-757 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.0 Reporter: Evan Sparks Assignee: Shivaram Venkataraman Attachments: imgnet_shiv6.log.gz, joblogs.tgz, newerlogs.tgz, newworklog.log, shivlogs.tgz Using Netty for communication I see some derserialization errors that are crashing my job about 30% of the way through an iterative 10-step job. Happens reliably around the same point of the job after multiple attempts. Logs on master and a couple of affected workers attached per request from Shivaram. 13/05/31 23:19:12 INFO cluster.TaskSetManager: Serialized task 11.0:454 as 3414 bytes in 0 ms 13/05/31 23:19:14 INFO cluster.TaskSetManager: Finished TID 11344 in 55289 ms (progress: 312/1000) 13/05/31 23:19:14 INFO scheduler.DAGScheduler: Completed ResultTask(11, 344) 13/05/31 23:19:14 INFO cluster.ClusterScheduler: parentName:,name:TaskSet_11,runningTasks:143 13/05/31 23:19:14 INFO cluster.TaskSetManager: Starting task 11.0:455 as TID 11455 on slave 8: ip-10-60-217-218.ec2.internal:56262 (NODE_LOCAL) 13/05/31 23:19:14 INFO cluster.TaskSetManager: Serialized task 11.0:455 as 3414 bytes in 0 ms 13/05/31 23:19:17 INFO cluster.TaskSetManager: Lost TID 11412 (task 11.0:412) 13/05/31 23:19:17 INFO cluster.TaskSetManager: Loss was due to java.io.EOFException java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2322) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2791) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:798) at java.io.ObjectInputStream.init(ObjectInputStream.java:298) at spark.JavaDeserializationStream$$anon$1.init(JavaSerializer.scala:18) at spark.JavaDeserializationStream.init(JavaSerializer.scala:18) at spark.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:53) at spark.storage.BlockManager.dataDeserialize(BlockManager.scala:925) at spark.storage.BlockFetcherIterator$NettyBlockFetcherIterator$$anonfun$5.apply(BlockFetcherIterator.scala:279) at spark.storage.BlockFetcherIterator$NettyBlockFetcherIterator$$anonfun$5.apply(BlockFetcherIterator.scala:279) at spark.storage.BlockFetcherIterator$NettyBlockFetcherIterator.next(BlockFetcherIterator.scala:318) at spark.storage.BlockFetcherIterator$NettyBlockFetcherIterator.next(BlockFetcherIterator.scala:239) at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440) at spark.util.CompletionIterator.hasNext(CompletionIterator.scala:9) at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:457) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.collection.Iterator$$anon$22.foreach(Iterator.scala:451) at spark.Aggregator.combineCombinersByKey(Aggregator.scala:33) at spark.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:72) at spark.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:72) at spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:19) at spark.RDD.computeOrReadCheckpoint(RDD.scala:220) at spark.RDD.iterator(RDD.scala:209) at spark.scheduler.ResultTask.run(ResultTask.scala:84) at spark.executor.Executor$TaskRunner.run(Executor.scala:104) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:679) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster
[ https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324907#comment-14324907 ] Shivaram Venkataraman commented on SPARK-5629: -- Is there an example output for `describe` you have in mind ? And I am not sure it'll be easy to list all the clusters as spark-ec2 looks up clusters by the security group / cluster-id ? Add spark-ec2 action to return info about an existing cluster - Key: SPARK-5629 URL: https://issues.apache.org/jira/browse/SPARK-5629 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor You can launch multiple clusters using spark-ec2. At some point, you might just want to get some information about an existing cluster. Use cases include: * Wanting to check something about your cluster in the EC2 web console. * Wanting to feed information about your cluster to another tool (e.g. as described in [SPARK-5627]). So, in addition to the [existing actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]: * {{launch}} * {{destroy}} * {{login}} * {{stop}} * {{start}} * {{get-master}} * {{reboot-slaves}} We add a new action, {{describe}}, which describes an existing cluster if given a cluster name, and all clusters if not. Some examples: {code} # describes all clusters launched by spark-ec2 spark-ec2 describe {code} {code} # describes cluster-1 spark-ec2 describe cluster-1 {code} In combination with the proposal in [SPARK-5627]: {code} # describes cluster-3 in a machine-readable way (e.g. JSON) spark-ec2 describe cluster-3 --machine-readable {code} Parallels in similar tools include: * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju * [{{starcluster listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node] from MIT StarCluster -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5008) Persistent HDFS does not recognize EBS Volumes
[ https://issues.apache.org/jira/browse/SPARK-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272830#comment-14272830 ] Shivaram Venkataraman commented on SPARK-5008: -- Hmm I think https://github.com/mesos/spark-ec2/pull/66 probably broke this in some way. We made some tweaks to keep spark-ec2 backwards compatible by symlinking /vol3 to /vol -- However I think the new behavior is now broken as persistent-hdfs expects /vol to be exist and can't find it. I think one fix might be to create a symlink from /vol0 to /vol if /vol3 doesn't exist -- Or we could also change core-site.xml in persistent-hdfs to pick up all the volumes Persistent HDFS does not recognize EBS Volumes -- Key: SPARK-5008 URL: https://issues.apache.org/jira/browse/SPARK-5008 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0 Environment: 8 Node Cluster Generated from 1.2.0 spark-ec2 script. -m c3.2xlarge -t c3.8xlarge --ebs-vol-size 300 --ebs-vol-type gp2 --ebs-vol-num 1 Reporter: Brad Willard Cluster is built with correct size EBS volumes. It creates the volume at /dev/xvds and it mounted to /vol0. However when you start persistent hdfs with start-all script, it starts but it isn't correctly configured to use the EBS volume. I'm assuming some sym links or expected mounts are not correctly configured. This has worked flawlessly on all previous versions of spark. I have a stupid workaround by installing pssh and mucking with it by mounting it to /vol, which worked, however it doesn't not work between restarts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5122) Remove Shark from spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267296#comment-14267296 ] Shivaram Venkataraman commented on SPARK-5122: -- Yes I think removing shark should be fine. We can also get rid of the Spark - Shark version map in spark_ec2.py Remove Shark from spark-ec2 --- Key: SPARK-5122 URL: https://issues.apache.org/jira/browse/SPARK-5122 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor Since Shark has been replaced by Spark SQL, we don't need it in {{spark-ec2}} anymore. (?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4948) Use pssh instead of bash-isms and remove unnecessary operations
[ https://issues.apache.org/jira/browse/SPARK-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-4948: - Assignee: Nicholas Chammas Use pssh instead of bash-isms and remove unnecessary operations --- Key: SPARK-4948 URL: https://issues.apache.org/jira/browse/SPARK-4948 Project: Spark Issue Type: Sub-task Components: EC2 Affects Versions: 1.2.0 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.3.0 Remove unnecessarily high sleep times in {{setup.sh}}, as well as unnecessary SSH calls to pre-approve keys. Replace bash-isms like {{while ... command ... wait}} with {{pssh}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276505#comment-14276505 ] Shivaram Venkataraman commented on SPARK-3821: -- [~nchammas] Yes -- That sounds good Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo
[ https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313269#comment-14313269 ] Shivaram Venkataraman commented on SPARK-5676: -- Yes - it is managed as a self-contained project. However bugs in that project are often experienced by Spark users, so we end up with issues created here. I think filing these issues under the component EC2 is a fine thing to do as it does affect Spark usage on EC2. License missing from spark-ec2 repo --- Key: SPARK-5676 URL: https://issues.apache.org/jira/browse/SPARK-5676 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein There is no LICENSE file or licence headers in the code in the spark-ec2 repo. Also, I believe there is no contributor license agreement notification in place (like there is in the main spark repo). It would be great to fix this (sooner better than later while contributors list is small), so that users wishing to use this part of Spark are not in doubt over licensing issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo
[ https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313210#comment-14313210 ] Shivaram Venkataraman commented on SPARK-5676: -- Just to clarify some things: - Having a separate repo for cluster launch scripts was a conscious decision in order to separate out release level changes from runtime settings (like Ganglia config etc.) - Though the repository exists in mesos/spark-ec2, AFAIK it is only used by the Spark EC2 scripts. In fact we do track some bugs in that repo using issues in the Spark JIRA. - However from what I can see, I don't think the repository's license affects the Spark project in anyway. It is not distributed as a part of any artifact we build and EC2 support is in itself a strictly optional component. That said it is good to have a LICENSE file, so we will add one. License missing from spark-ec2 repo --- Key: SPARK-5676 URL: https://issues.apache.org/jira/browse/SPARK-5676 Project: Spark Issue Type: Bug Components: EC2 Reporter: Florian Verhein There is no LICENSE file or licence headers in the code in the spark-ec2 repo. Also, I believe there is no contributor license agreement notification in place (like there is in the main spark repo). It would be great to fix this (sooner better than later while contributors list is small), so that users wishing to use this part of Spark are not in doubt over licensing issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6231) Join on two tables (generated from same one) is broken
[ https://issues.apache.org/jira/browse/SPARK-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372447#comment-14372447 ] Shivaram Venkataraman commented on SPARK-6231: -- [~marmbrus] I've sent the dataset to you by email. The code that used to cause this bug is at https://gist.github.com/shivaram/4ff0a9c226dda2030507 Join on two tables (generated from same one) is broken -- Key: SPARK-6231 URL: https://issues.apache.org/jira/browse/SPARK-6231 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Davies Liu Assignee: Michael Armbrust Priority: Blocker Labels: DataFrame If the two column used in joinExpr come from the same table, they have the same id, then the joniExpr is explained in wrong way. {code} val df = sqlContext.load(path, parquet) val txns = df.groupBy(cust_id).agg($cust_id, countDistinct($day_num).as(txns)) val spend = df.groupBy(cust_id).agg($cust_id, sum($extended_price).as(spend)) val rmJoin = txns.join(spend, txns(cust_id) === spend(cust_id), inner) scala rmJoin.explain == Physical Plan == CartesianProduct Filter (cust_id#0 = cust_id#0) Aggregate false, [cust_id#0], [cust_id#0,CombineAndCount(partialSets#25) AS txns#7L] Exchange (HashPartitioning [cust_id#0], 200) Aggregate true, [cust_id#0], [cust_id#0,AddToHashSet(day_num#2L) AS partialSets#25] PhysicalRDD [cust_id#0,day_num#2L], MapPartitionsRDD[1] at map at newParquet.scala:542 Aggregate false, [cust_id#17], [cust_id#17,SUM(PartialSum#38) AS spend#8] Exchange (HashPartitioning [cust_id#17], 200) Aggregate true, [cust_id#17], [cust_id#17,SUM(extended_price#20) AS PartialSum#38] PhysicalRDD [cust_id#17,extended_price#20], MapPartitionsRDD[3] at map at newParquet.scala:542 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6246) spark-ec2 can't handle clusters with 100 nodes
[ https://issues.apache.org/jira/browse/SPARK-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355403#comment-14355403 ] Shivaram Venkataraman commented on SPARK-6246: -- Hmm - This seems like a bad problem. And it looks like a AWS side change rather than a boto change I guess. [~nchammas] Similar to the EC2Box issue above, can we also batch calls to `get_instances` 100 instances at a time ? spark-ec2 can't handle clusters with 100 nodes Key: SPARK-6246 URL: https://issues.apache.org/jira/browse/SPARK-6246 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.3.0 Reporter: Nicholas Chammas Priority: Minor This appears to be a new restriction, perhaps resulting from our upgrade of boto. Maybe it's a new restriction from EC2. Not sure yet. We didn't have this issue around the Spark 1.1.0 time frame from what I can remember. I'll track down where the issue is and when it started. Attempting to launch a cluster with 100 slaves yields the following: {code} Spark AMI: ami-35b1885c Launching instances... Launched 100 slaves in us-east-1c, regid = r-9c408776 Launched master in us-east-1c, regid = r-92408778 Waiting for AWS to propagate instance metadata... Waiting for cluster to enter 'ssh-ready' state.ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the maximum number of instance IDs that can be specificied (100). Please specify fewer than 100 instance IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response Traceback (most recent call last): File ./ec2/spark_ec2.py, line 1338, in module main() File ./ec2/spark_ec2.py, line 1330, in main real_main() File ./ec2/spark_ec2.py, line 1170, in real_main cluster_state='ssh-ready' File ./ec2/spark_ec2.py, line 795, in wait_for_cluster_state statuses = conn.get_all_instance_status(instance_ids=[i.id for i in cluster_instances]) File /path/apache/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 737, in get_all_instance_status InstanceStatusSet, verb='POST') File /path/apache/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1204, in get_object raise self.ResponseError(response.status, response.reason, body) boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the maximum number of instance IDs that can be specificied (100). Please specify fewer than 100 instance IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response {code} This problem seems to be with {{get_all_instance_status()}}, though I am not sure if other methods are affected too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+
[ https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352317#comment-14352317 ] Shivaram Venkataraman commented on SPARK-5134: -- Yeah so this did change in 1.2 and I think I mentioned it to Patrick when it affected a couple of other projects of mine. The main problem there was that even if you have an explicit Hadoop 1 dependency in your project, SBT picks up the highest version required while building an assembly jar for the project -- Thus with Spark linked against Hadoop 2.2, one would require an exclusion rule to use Hadoop 1. It might be good to add this to the docs or to some of the example Quick Start documentation we have Bump default Hadoop version to 2+ - Key: SPARK-5134 URL: https://issues.apache.org/jira/browse/SPARK-5134 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor [~srowen] and I discussed bumping [the default hadoop version in the parent POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122] from {{1.0.4}} to something more recent. There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352514#comment-14352514 ] Shivaram Venkataraman commented on SPARK-6220: -- Seems like a good idea and the syntax sounds good to me. Just curious: Are these the only two boto calls we use ? Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+
[ https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352374#comment-14352374 ] Shivaram Venkataraman commented on SPARK-5134: -- Yeah if you exclude Spark's Hadoop dependency things work correctly for Hadoop1. There are some additional issues that come up in 1.2 if due to the Guava changes, but those are not related to the default Hadoop version change. I think the documentation to update would be [1] but I am thinking it would be good to mention this in the Quick Start guide [2] as well [1] https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/docs/hadoop-third-party-distributions.md#linking-applications-to-the-hadoop-version [2] https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/docs/quick-start.md#self-contained-applications Bump default Hadoop version to 2+ - Key: SPARK-5134 URL: https://issues.apache.org/jira/browse/SPARK-5134 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor [~srowen] and I discussed bumping [the default hadoop version in the parent POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122] from {{1.0.4}} to something more recent. There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6088) UI is malformed when tasks fetch remote results
[ https://issues.apache.org/jira/browse/SPARK-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6088: - Attachment: Screenshot 2015-02-28 18.24.42.png UI is malformed when tasks fetch remote results --- Key: SPARK-6088 URL: https://issues.apache.org/jira/browse/SPARK-6088 Project: Spark Issue Type: Bug Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Attachments: Screenshot 2015-02-28 18.24.42.png There are two issues when tasks get remote results: (1) The status never changes from GET_RESULT to SUCCEEDED (2) The time to get the result is shown as the absolute time (resulting in a non-sensical output that says getting the result took 1 million hours) rather than the elapsed time cc [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6089) Size of task result fetched can't be found in UI
Shivaram Venkataraman created SPARK-6089: Summary: Size of task result fetched can't be found in UI Key: SPARK-6089 URL: https://issues.apache.org/jira/browse/SPARK-6089 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Shivaram Venkataraman When you do a large collect the amount of data fetched as task result from each task is not present in the WebUI. We should make this appear under the 'Output' column (both per-task and in executor-level aggregation) [cc ~kayousterhout] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6088) UI is malformed when tasks fetch remote results
[ https://issues.apache.org/jira/browse/SPARK-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341896#comment-14341896 ] Shivaram Venkataraman commented on SPARK-6088: -- Also for some reason the get result time is also included in the Scheduler Delay. Screen shot attached shows how the get result took 33 mins and how this shows up in scheduler delay. UI is malformed when tasks fetch remote results --- Key: SPARK-6088 URL: https://issues.apache.org/jira/browse/SPARK-6088 Project: Spark Issue Type: Bug Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Attachments: Screenshot 2015-02-28 18.24.42.png There are two issues when tasks get remote results: (1) The status never changes from GET_RESULT to SUCCEEDED (2) The time to get the result is shown as the absolute time (resulting in a non-sensical output that says getting the result took 1 million hours) rather than the elapsed time cc [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6089) Size of task result fetched can't be found in UI
[ https://issues.apache.org/jira/browse/SPARK-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6089: - Description: When you do a large collect the amount of data fetched as task result from each task is not present in the WebUI. We should make this appear under the 'Output' column (both per-task and in executor-level aggregation) cc [~kayousterhout] was: When you do a large collect the amount of data fetched as task result from each task is not present in the WebUI. We should make this appear under the 'Output' column (both per-task and in executor-level aggregation) [cc ~kayousterhout] Size of task result fetched can't be found in UI Key: SPARK-6089 URL: https://issues.apache.org/jira/browse/SPARK-6089 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Shivaram Venkataraman When you do a large collect the amount of data fetched as task result from each task is not present in the WebUI. We should make this appear under the 'Output' column (both per-task and in executor-level aggregation) cc [~kayousterhout] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-6881. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5493 [https://github.com/apache/spark/pull/5493] Change the checkpoint directory name from checkpoints to checkpoint --- Key: SPARK-6881 URL: https://issues.apache.org/jira/browse/SPARK-6881 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Hao Priority: Trivial Fix For: 1.4.0 Name checkpoint instead of checkpoints is included in .gitignore -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6818) Support column deletion in SparkR DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6818: - Assignee: Sun Rui Support column deletion in SparkR DataFrame API --- Key: SPARK-6818 URL: https://issues.apache.org/jira/browse/SPARK-6818 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman Assignee: Sun Rui Fix For: 1.4.0 We should support deleting columns using traditional R syntax i.e. something like df$age - NULL should delete the `age` column -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6818) Support column deletion in SparkR DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-6818. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5655 [https://github.com/apache/spark/pull/5655] Support column deletion in SparkR DataFrame API --- Key: SPARK-6818 URL: https://issues.apache.org/jira/browse/SPARK-6818 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman Fix For: 1.4.0 We should support deleting columns using traditional R syntax i.e. something like df$age - NULL should delete the `age` column -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6797) Add support for YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6797: - Assignee: Sun Rui Add support for YARN cluster mode - Key: SPARK-6797 URL: https://issues.apache.org/jira/browse/SPARK-6797 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Assignee: Sun Rui Priority: Critical SparkR currently does not work in YARN cluster mode as the R package is not shipped along with the assembly jar to the YARN AM. We could try to use the support for archives in YARN to send out the R package as a zip file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7033) Use JavaRDD.partitions() instead of JavaRDD.splits()
[ https://issues.apache.org/jira/browse/SPARK-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-7033. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5628 [https://github.com/apache/spark/pull/5628] Use JavaRDD.partitions() instead of JavaRDD.splits() Key: SPARK-7033 URL: https://issues.apache.org/jira/browse/SPARK-7033 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui Priority: Minor Fix For: 1.4.0 In numPartitions(), JavaRDD.splits() is called to get the number of partitions in an RDD. But JavaRDD.splits() is deprecated. Use JavaRDD.partitions() instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6852) Accept numeric as numPartitions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-6852. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5613 [https://github.com/apache/spark/pull/5613] Accept numeric as numPartitions in SparkR - Key: SPARK-6852 URL: https://issues.apache.org/jira/browse/SPARK-6852 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Davies Liu Priority: Critical Fix For: 1.4.0 All the API should accept numeric as numPartitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6852) Accept numeric as numPartitions in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6852: - Assignee: Sun Rui Accept numeric as numPartitions in SparkR - Key: SPARK-6852 URL: https://issues.apache.org/jira/browse/SPARK-6852 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Davies Liu Assignee: Sun Rui Priority: Critical Fix For: 1.4.0 All the API should accept numeric as numPartitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6824) Fill the docs for DataFrame API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6824: - Issue Type: Sub-task (was: New Feature) Parent: SPARK-7228 Fill the docs for DataFrame API in SparkR - Key: SPARK-6824 URL: https://issues.apache.org/jira/browse/SPARK-6824 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Priority: Blocker Some of the DataFrame functions in SparkR do not have complete roxygen docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6815) Support accumulators in R
[ https://issues.apache.org/jira/browse/SPARK-6815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6815: - Target Version/s: 1.5.0 (was: 1.4.0) Support accumulators in R - Key: SPARK-6815 URL: https://issues.apache.org/jira/browse/SPARK-6815 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor SparkR doesn't support acccumulators right now. It might be good to add support for this to get feature parity with PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6825) Data sources implementation to support `sequenceFile`
[ https://issues.apache.org/jira/browse/SPARK-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6825: - Target Version/s: 1.5.0 (was: 1.4.0) Data sources implementation to support `sequenceFile` - Key: SPARK-6825 URL: https://issues.apache.org/jira/browse/SPARK-6825 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman SequenceFiles are a widely used input format and right now they are not supported in SparkR. It would be good to add support for SequenceFiles by implementing a new data source that can create a DataFrame from a SequenceFile. However as SequenceFiles can have arbitrary types, we probably need to map them to User-defined types in SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6816) Add SparkConf API to configure SparkR
[ https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6816: - Target Version/s: 1.5.0 (was: 1.4.0) Add SparkConf API to configure SparkR - Key: SPARK-6816 URL: https://issues.apache.org/jira/browse/SPARK-6816 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the only way to configure SparkR is to pass in arguments to sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python to make configuration easier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6838) Explore using Reference Classes instead of S4 objects
[ https://issues.apache.org/jira/browse/SPARK-6838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6838: - Target Version/s: 1.5.0 (was: 1.4.0) Explore using Reference Classes instead of S4 objects - Key: SPARK-6838 URL: https://issues.apache.org/jira/browse/SPARK-6838 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor The current RDD and PipelinedRDD are represented in S4 objects. R has a new OO system: Reference Class (RC or R5). It seems to be a more message-passing OO and instances are mutable objects. It is not an important issue, but it should also require trivial work. It could also remove the kind-of awkward @ operator in S4. R6 is also worth checking out. Feels closer to your ordinary object oriented language. https://github.com/wch/R6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6803) [SparkR] Support SparkR Streaming
[ https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6803: - Target Version/s: 1.5.0 (was: 1.4.0) [SparkR] Support SparkR Streaming - Key: SPARK-6803 URL: https://issues.apache.org/jira/browse/SPARK-6803 Project: Spark Issue Type: New Feature Components: SparkR, Streaming Reporter: Hao Fix For: 1.4.0 Adds R API for Spark Streaming. A experimental version is presented in repo [1]. which follows the PySpark streaming design. Also, this PR can be further broken down into sub task issues. [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6833) Extend `addPackage` so that any given R file can be sourced in the worker before functions are run.
[ https://issues.apache.org/jira/browse/SPARK-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6833: - Target Version/s: 1.5.0 (was: 1.4.0) Extend `addPackage` so that any given R file can be sourced in the worker before functions are run. --- Key: SPARK-6833 URL: https://issues.apache.org/jira/browse/SPARK-6833 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Similar to how extra python files or packages can be specified (in zip / egg formats), it will be good to support the ability to add extra R files to the executors working directory. One thing that needs to be investigated is if this will just work out of the box using the spark-submit flag --files ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6813) SparkR style guide
[ https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6813: - Target Version/s: 1.5.0 (was: 1.4.0) SparkR style guide -- Key: SPARK-6813 URL: https://issues.apache.org/jira/browse/SPARK-6813 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman We should develop a SparkR style guide document based on the some of the guidelines we use and some of the best practices in R. Some examples of R style guide are: http://r-pkgs.had.co.nz/r.html#style http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html A related issue is to work on a automatic style checking tool. https://github.com/jimhester/lintr seems promising We could have a R style guide based on the one from google [1], and adjust some of them with the conversation in Spark: 1. Line Length: maximum 100 characters 2. no limit on function name (API should be similar as in other languages) 3. Allow S4 objects/methods -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6809) Make numPartitions optional in pairRDD APIs
[ https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6809: - Priority: Major (was: Critical) Make numPartitions optional in pairRDD APIs --- Key: SPARK-6809 URL: https://issues.apache.org/jira/browse/SPARK-6809 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6820) Convert NAs to null type in SparkR DataFrames
[ https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6820: - Priority: Critical (was: Major) Convert NAs to null type in SparkR DataFrames - Key: SPARK-6820 URL: https://issues.apache.org/jira/browse/SPARK-6820 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman Priority: Critical While converting RDD or local R DataFrame to a SparkR DataFrame we need to handle missing values or NAs. We should convert NAs to SparkSQL's null type to handle the conversion correctly -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6799) Add dataframe examples for SparkR
[ https://issues.apache.org/jira/browse/SPARK-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6799: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-7228 Add dataframe examples for SparkR - Key: SPARK-6799 URL: https://issues.apache.org/jira/browse/SPARK-6799 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Priority: Critical We should add more data frame usage examples for SparkR . This can be similar to the python examples at https://github.com/apache/spark/blob/1b2aab8d5b9cc2ff702506038bd71aa8debe7ca0/examples/src/main/python/sql.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6809) Make numPartitions optional in pairRDD APIs
[ https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6809: - Target Version/s: 1.5.0 (was: 1.4.0) Make numPartitions optional in pairRDD APIs --- Key: SPARK-6809 URL: https://issues.apache.org/jira/browse/SPARK-6809 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6826) `hashCode` support for arbitrary R objects
[ https://issues.apache.org/jira/browse/SPARK-6826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6826: - Target Version/s: 1.5.0 (was: 1.4.0) `hashCode` support for arbitrary R objects -- Key: SPARK-6826 URL: https://issues.apache.org/jira/browse/SPARK-6826 Project: Spark Issue Type: Bug Components: SparkR Reporter: Shivaram Venkataraman From the SparkR JIRA digest::digest looks interesting, but it seems to be more heavyweight than our requirements. One relatively easy way to do this is to serialize the given R object into a string (serialize(object, ascii=T)) and then just call the string hashCode function on this. FWIW it looks like digest follows a similar strategy where the md5sum / shasum etc. are calculated on serialized objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
Shivaram Venkataraman created SPARK-7230: Summary: Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6814) Support sorting for any data type in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6814: - Target Version/s: 1.5.0 (was: 1.4.0) Support sorting for any data type in SparkR --- Key: SPARK-6814 URL: https://issues.apache.org/jira/browse/SPARK-6814 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Critical I get various return status == 0 is false and unimplemented type errors trying to get data out of any rdd with top() or collect(). The errors are not consistent. I think spark is installed properly because some operations do work. I apologize if I'm missing something easy or not providing the right diagnostic info – I'm new to SparkR, and this seems to be the only resource for SparkR issues. Some logs: {code} Browse[1] top(estep.rdd, 1L) Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : unimplemented type 'list' in 'orderVector1' Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order Execution halted 15/02/13 19:11:57 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14) org.apache.spark.SparkException: R computation failed with Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : unimplemented type 'list' in 'orderVector1' Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order Execution halted at edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/13 19:11:57 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 14, localhost): org.apache.spark.SparkException: R computation failed with Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : unimplemented type 'list' in 'orderVector1' Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order Execution halted edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7228) SparkR public API for 1.4 release
Shivaram Venkataraman created SPARK-7228: Summary: SparkR public API for 1.4 release Key: SPARK-7228 URL: https://issues.apache.org/jira/browse/SPARK-7228 Project: Spark Issue Type: Umbrella Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This in an umbrella ticket to track the public APIs and documentation to be released as a part of SparkR in the 1.4 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6832) Handle partial reads in SparkR JVM to worker communication
[ https://issues.apache.org/jira/browse/SPARK-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6832: - Target Version/s: 1.5.0 (was: 1.4.0) Handle partial reads in SparkR JVM to worker communication -- Key: SPARK-6832 URL: https://issues.apache.org/jira/browse/SPARK-6832 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor After we move to use socket between R worker and JVM, it's possible that readBin() in R will return partial results (for example, interrupted by signal). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7226) Support math functions in R DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-7226: - Priority: Critical (was: Major) Support math functions in R DataFrame - Key: SPARK-7226 URL: https://issues.apache.org/jira/browse/SPARK-7226 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Reynold Xin Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org