[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-02 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958014#comment-13958014
 ] 

Shivaram Venkataraman commented on SPARK-1391:
--

Oh and yes, I'd be happy to test out any patch / WIP

 BlockManager cannot transfer blocks larger than 2G in size
 --

 Key: SPARK-1391
 URL: https://issues.apache.org/jira/browse/SPARK-1391
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Shuffle
Affects Versions: 1.0.0
Reporter: Shivaram Venkataraman

 If a task tries to remotely access a cached RDD block, I get an exception 
 when the block size is  2G. The exception is pasted below.
 Memory capacities are huge these days ( 60G), and many workflows depend on 
 having large blocks in memory, so it would be good to fix this bug.
 I don't know if the same thing happens on shuffles if one transfer (from 
 mapper to reducer) is  2G.
 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
 message
 java.lang.ArrayIndexOutOfBoundsException
 at 
 it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
 at 
 it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
 at 
 it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
 at 
 org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
 at 
 org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
 at 
 org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
 at 
 org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
 at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
 at 
 org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
 at 
 org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
 at 
 org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at 
 org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at 
 org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
 at 
 org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
 at 
 org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
 at 
 org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-02 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958013#comment-13958013
 ] 

Shivaram Venkataraman commented on SPARK-1391:
--

I am not using any fastutil version explicitly. I am just using Spark's master 
branch from around March 23rd. (The exact commit I am synced to is 
https://github.com/apache/spark/commit/8265dc7739caccc59bc2456b2df055ca96337fe4)

 BlockManager cannot transfer blocks larger than 2G in size
 --

 Key: SPARK-1391
 URL: https://issues.apache.org/jira/browse/SPARK-1391
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Shuffle
Affects Versions: 1.0.0
Reporter: Shivaram Venkataraman

 If a task tries to remotely access a cached RDD block, I get an exception 
 when the block size is  2G. The exception is pasted below.
 Memory capacities are huge these days ( 60G), and many workflows depend on 
 having large blocks in memory, so it would be good to fix this bug.
 I don't know if the same thing happens on shuffles if one transfer (from 
 mapper to reducer) is  2G.
 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
 message
 java.lang.ArrayIndexOutOfBoundsException
 at 
 it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
 at 
 it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
 at 
 it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
 at 
 org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
 at 
 org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
 at 
 org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
 at 
 org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
 at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
 at 
 org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
 at 
 org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
 at 
 org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at 
 org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at 
 org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
 at 
 org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
 at 
 org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
 at 
 org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-04 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960100#comment-13960100
 ] 

Shivaram Venkataraman commented on SPARK-1391:
--

Thanks for the patch. I will try this out in the next couple of days and get 
back.

 BlockManager cannot transfer blocks larger than 2G in size
 --

 Key: SPARK-1391
 URL: https://issues.apache.org/jira/browse/SPARK-1391
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Shuffle
Affects Versions: 1.0.0
Reporter: Shivaram Venkataraman
Assignee: Min Zhou
 Attachments: SPARK-1391.diff


 If a task tries to remotely access a cached RDD block, I get an exception 
 when the block size is  2G. The exception is pasted below.
 Memory capacities are huge these days ( 60G), and many workflows depend on 
 having large blocks in memory, so it would be good to fix this bug.
 I don't know if the same thing happens on shuffles if one transfer (from 
 mapper to reducer) is  2G.
 {noformat}
 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
 message
 java.lang.ArrayIndexOutOfBoundsException
 at 
 it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
 at 
 it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
 at 
 it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
 at 
 org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
 at 
 org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
 at 
 org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
 at 
 org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
 at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
 at 
 org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
 at 
 org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
 at 
 org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at 
 org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at 
 org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
 at 
 org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
 at 
 org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
 at 
 org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1614) Move Mesos protobufs out of TaskState

2014-04-24 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-1614:


 Summary: Move Mesos protobufs out of TaskState
 Key: SPARK-1614
 URL: https://issues.apache.org/jira/browse/SPARK-1614
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 0.9.1
Reporter: Shivaram Venkataraman
Priority: Minor


To isolate usage of Mesos protobufs it would be good to move them out of 
TaskState into either a new class (MesosUtils ?) or 
CoarseGrainedMesos{Executor, Backend}.

This would allow applications to build Spark to run without including protobuf 
from Mesos in their shaded jars.  This is one way to avoid protobuf conflicts 
between Mesos and Hadoop (https://issues.apache.org/jira/browse/MESOS-1203)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2046) Support config properties that are changeable across tasks/stages within a job

2014-06-05 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019458#comment-14019458
 ] 

Shivaram Venkataraman commented on SPARK-2046:
--

FWIW I have an older implementation that did this using LocalProperties in 
SparkContext. 
https://github.com/shivaram/spark-1/commit/256a34c12d4f3c8ed1a09174f331868a7bf30e11
 

I haven't tested it in a setting with multiple jobs running at the same time 
though

 Support config properties that are changeable across tasks/stages within a job
 --

 Key: SPARK-2046
 URL: https://issues.apache.org/jira/browse/SPARK-2046
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Zongheng Yang

 Suppose an application consists of multiple stages, where some stages contain 
 computation-intensive tasks, and other stages contain less 
 computation-intensive (or otherwise ordinary) tasks. 
 For such job to run efficiently, it might make sense to provide user a 
 function to set spark.task.cpus to a high number right before the 
 computation-intensive stages/tasks are getting generated in the user code, 
 and set the property to a lower number for other stages/tasks.
 As a first step, supporting this feature across stages instead of the more 
 fine-grained task-level might suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2316) StorageStatusListener should avoid O(blocks) operations

2014-07-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065256#comment-14065256
 ] 

Shivaram Venkataraman commented on SPARK-2316:
--

I'd just like to add that in cases where we have many thousands of blocks, this 
stack trace occupies one core constantly on the Master and is probably one of 
the reasons why the WebUI stops functioning after a certain point. 

 StorageStatusListener should avoid O(blocks) operations
 ---

 Key: SPARK-2316
 URL: https://issues.apache.org/jira/browse/SPARK-2316
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Andrew Or

 In the case where jobs are frequently causing dropped blocks the storage 
 status listener can bottleneck. This is slow for a few reasons, one being 
 that we use Scala collection operations, the other being that we operations 
 that are O(number of blocks). I think using a few indices here could make 
 this much faster.
 {code}
  at java.lang.Integer.valueOf(Integer.java:642)
 at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70)
 at 
 org.apache.spark.storage.StorageUtils$$anonfun$9.apply(StorageUtils.scala:82)
 at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328)
 at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327)
 at 
 scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327)
 at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105)
 at 
 org.apache.spark.storage.StorageUtils$.rddInfoFromStorageStatus(StorageUtils.scala:82)
 at 
 org.apache.spark.ui.storage.StorageListener.updateRDDInfo(StorageTab.scala:56)
 at 
 org.apache.spark.ui.storage.StorageListener.onTaskEnd(StorageTab.scala:67)
 - locked 0xa27ebe30 (a 
 org.apache.spark.ui.storage.StorageListener)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2563) Make number of connection retries configurable

2014-07-17 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-2563:


 Summary: Make number of connection retries configurable
 Key: SPARK-2563
 URL: https://issues.apache.org/jira/browse/SPARK-2563
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman
Priority: Minor


In a large EC2 cluster, I often see the first shuffle stage in a job fail due 
to connection timeout exceptions. We should make the number of retries before 
failing configurable to handle these cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2563) Make number of connection retries configurable

2014-07-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065735#comment-14065735
 ] 

Shivaram Venkataraman commented on SPARK-2563:
--

https://github.com/apache/spark/pull/1471

 Make number of connection retries configurable
 --

 Key: SPARK-2563
 URL: https://issues.apache.org/jira/browse/SPARK-2563
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman
Priority: Minor

 In a large EC2 cluster, I often see the first shuffle stage in a job fail due 
 to connection timeout exceptions. We should make the number of retries before 
 failing configurable to handle these cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2316) StorageStatusListener should avoid O(blocks) operations

2014-07-25 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074692#comment-14074692
 ] 

Shivaram Venkataraman commented on SPARK-2316:
--

On a related note, can we have flags to turn off some of the UI listeners ? If 
the StorageTab is going to be too expensive to update, it'll be good to have a 
way to turn it off and just have the JobProgress show up in the UI

 StorageStatusListener should avoid O(blocks) operations
 ---

 Key: SPARK-2316
 URL: https://issues.apache.org/jira/browse/SPARK-2316
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Andrew Or
Priority: Critical

 In the case where jobs are frequently causing dropped blocks the storage 
 status listener can bottleneck. This is slow for a few reasons, one being 
 that we use Scala collection operations, the other being that we operations 
 that are O(number of blocks). I think using a few indices here could make 
 this much faster.
 {code}
  at java.lang.Integer.valueOf(Integer.java:642)
 at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70)
 at 
 org.apache.spark.storage.StorageUtils$$anonfun$9.apply(StorageUtils.scala:82)
 at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328)
 at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327)
 at 
 scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327)
 at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105)
 at 
 org.apache.spark.storage.StorageUtils$.rddInfoFromStorageStatus(StorageUtils.scala:82)
 at 
 org.apache.spark.ui.storage.StorageListener.updateRDDInfo(StorageTab.scala:56)
 at 
 org.apache.spark.ui.storage.StorageListener.onTaskEnd(StorageTab.scala:67)
 - locked 0xa27ebe30 (a 
 org.apache.spark.ui.storage.StorageListener)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2563) Re-open sockets to handle connect timeouts

2014-07-28 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-2563:
-

Description: 
In a large EC2 cluster, I often see the first shuffle stage in a job fail due 
to connection timeout exceptions. 

 If the connection attempt times out, the socket gets closed and from [1] we 
get a ClosedChannelException.  We should check if the Socket was closed due to 
a timeout and open a new socket and try to connect. 

FWIW, I was able to work around my problems by increasing the number of SYN 
retries in Linux. (I ran echo 8  /proc/sys/net/ipv4/tcp_syn_retries)

[1] 
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573

  was:In a large EC2 cluster, I often see the first shuffle stage in a job fail 
due to connection timeout exceptions. We should make the number of retries 
before failing configurable to handle these cases.

Summary: Re-open sockets to handle connect timeouts  (was: Make number 
of connection retries configurable)

 Re-open sockets to handle connect timeouts
 --

 Key: SPARK-2563
 URL: https://issues.apache.org/jira/browse/SPARK-2563
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman
Priority: Minor

 In a large EC2 cluster, I often see the first shuffle stage in a job fail due 
 to connection timeout exceptions. 
  If the connection attempt times out, the socket gets closed and from [1] we 
 get a ClosedChannelException.  We should check if the Socket was closed due 
 to a timeout and open a new socket and try to connect. 
 FWIW, I was able to work around my problems by increasing the number of SYN 
 retries in Linux. (I ran echo 8  /proc/sys/net/ipv4/tcp_syn_retries)
 [1] 
 http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2563) Re-open sockets to handle connect timeouts

2014-07-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065735#comment-14065735
 ] 

Shivaram Venkataraman edited comment on SPARK-2563 at 7/28/14 5:43 PM:
---

More details about the bug is -https://github.com/apache/spark/pull/1471-


was (Author: shivaram):
https://github.com/apache/spark/pull/1471

 Re-open sockets to handle connect timeouts
 --

 Key: SPARK-2563
 URL: https://issues.apache.org/jira/browse/SPARK-2563
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman
Priority: Minor

 In a large EC2 cluster, I often see the first shuffle stage in a job fail due 
 to connection timeout exceptions. 
  If the connection attempt times out, the socket gets closed and from [1] we 
 get a ClosedChannelException.  We should check if the Socket was closed due 
 to a timeout and open a new socket and try to connect. 
 FWIW, I was able to work around my problems by increasing the number of SYN 
 retries in Linux. (I ran echo 8  /proc/sys/net/ipv4/tcp_syn_retries)
 [1] 
 http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2723) Block Manager should catch exceptions in putValues

2014-07-28 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-2723:


 Summary: Block Manager should catch exceptions in putValues
 Key: SPARK-2723
 URL: https://issues.apache.org/jira/browse/SPARK-2723
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Shivaram Venkataraman


The BlockManager should catch exceptions encountered while writing out files to 
disk. Right now these exceptions get counted as user-level task failures and 
the job is aborted after failing 4 times. We should either fail the executor or 
handle this better to prevent the job from dying.

I ran into an issue where one disk on a large EC2 cluster failed and this 
resulted in a long running job terminating. Longer term, we should also look at 
black-listing local directories when one of them become unusable ?

Exception pasted below:

14/07/29 00:55:39 WARN scheduler.TaskSetManager: Loss was due to 
java.io.FileNotFoundException
java.io.FileNotFoundException: 
/mnt2/spark/spark-local-20140728175256-e7cb/28/broadcast_264_piece20 
(Input/output error)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:221)
at java.io.FileOutputStream.init(FileOutputStream.java:171)
at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:79)
at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:66)
at 
org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:847)
at 
org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:267)
at 
org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:256)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.storage.MemoryStore.ensureFreeSpace(MemoryStore.scala:256)
at org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:179)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:76)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:663)
at org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2774) Set preferred locations for reduce tasks

2014-07-31 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-2774:


 Summary: Set preferred locations for reduce tasks
 Key: SPARK-2774
 URL: https://issues.apache.org/jira/browse/SPARK-2774
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman


Currently we do not set preferred locations for reduce tasks in Spark. This 
patch proposes setting preferred locations based on the map output sizes and 
locations tracked by the MapOutputTracker. This is useful in two conditions

1. When you have a small job in a large cluster it can be useful to co-locate 
map and reduce tasks to avoid going over the network
2. If there is a lot of data skew in the map stage outputs, then it is 
beneficial to place the reducer close to the largest output.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2950) Add GC time and Shuffle Write time to JobLogger output

2014-08-09 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-2950:


 Summary: Add GC time and Shuffle Write time to JobLogger output
 Key: SPARK-2950
 URL: https://issues.apache.org/jira/browse/SPARK-2950
 Project: Spark
  Issue Type: Improvement
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Minor


The JobLogger is very useful for performing offline performance profiling of 
Spark jobs. GC Time and Shuffle Write time are available in TaskMetrics but are 
currently missed from the JobLogger output. This change adds these two fields.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2950) Add GC time and Shuffle Write time to JobLogger output

2014-08-10 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-2950:
-

Fix Version/s: 1.2.0

 Add GC time and Shuffle Write time to JobLogger output
 --

 Key: SPARK-2950
 URL: https://issues.apache.org/jira/browse/SPARK-2950
 Project: Spark
  Issue Type: Improvement
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Minor
 Fix For: 1.2.0


 The JobLogger is very useful for performing offline performance profiling of 
 Spark jobs. GC Time and Shuffle Write time are available in TaskMetrics but 
 are currently missed from the JobLogger output. This change adds these two 
 fields.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2950) Add GC time and Shuffle Write time to JobLogger output

2014-08-10 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-2950.
--

Resolution: Fixed

 Add GC time and Shuffle Write time to JobLogger output
 --

 Key: SPARK-2950
 URL: https://issues.apache.org/jira/browse/SPARK-2950
 Project: Spark
  Issue Type: Improvement
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Minor
 Fix For: 1.2.0


 The JobLogger is very useful for performing offline performance profiling of 
 Spark jobs. GC Time and Shuffle Write time are available in TaskMetrics but 
 are currently missed from the JobLogger output. This change adds these two 
 fields.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext

2014-08-27 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112855#comment-14112855
 ] 

Shivaram Venkataraman commented on SPARK-3215:
--

This looks very interesting -- One thing that would be very useful is to make 
the RPC interface language agnostic. This would make it possible to submit 
Python or R jobs to a SparkContext without embedding a JVM in the driver 
process. Could we use Thrift or Protocol Buffers or something like that ? 

Also it'll be great to make a tentative list of RPCs that are required to get a 
simple application to work.

 Add remote interface for SparkContext
 -

 Key: SPARK-3215
 URL: https://issues.apache.org/jira/browse/SPARK-3215
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Marcelo Vanzin
  Labels: hive
 Attachments: RemoteSparkContext.pdf


 A quick description of the issue: as part of running Hive jobs on top of 
 Spark, it's desirable to have a SparkContext that is running in the 
 background and listening for job requests for a particular user session.
 Running multiple contexts in the same JVM is not a very good solution. Not 
 only SparkContext currently has issues sharing the same JVM among multiple 
 instances, but that turns the JVM running the contexts into a huge bottleneck 
 in the system.
 So I'm proposing a solution where we have a SparkContext that is running in a 
 separate process, and listening for requests from the client application via 
 some RPC interface (most probably Akka).
 I'll attach a document shortly with the current proposal. Let's use this bug 
 to discuss the proposal and any other suggestions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3659) Set EC2 version to 1.1.0 in master branch

2014-09-23 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-3659:


 Summary: Set EC2 version to 1.1.0 in master branch
 Key: SPARK-3659
 URL: https://issues.apache.org/jira/browse/SPARK-3659
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Minor


Master branch should be in sync with branch-1.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3674) Add support for launching YARN clusters in spark-ec2

2014-09-23 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-3674:


 Summary: Add support for launching YARN clusters in spark-ec2
 Key: SPARK-3674
 URL: https://issues.apache.org/jira/browse/SPARK-3674
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman


Right now spark-ec2 only supports launching Spark Standalone clusters. While 
this is sufficient for basic usage it is hard to test features or do 
performance benchmarking on YARN. It will be good to add support for 
installing, configuring a Apache YARN cluster at a fixed version -- say the 
latest stable version 2.4.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3522) Make spark-ec2 verbosity configurable

2014-09-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151348#comment-14151348
 ] 

Shivaram Venkataraman commented on SPARK-3522:
--

It would be good, but I think most of the output in spark-ec2 comes from the 
shell scripts that install things like HDFS, Spark etc.  So this would be less 
of a python logging change and more of a change in shell scripts in spark-ec2.

Also the other thing to consider is that the output is often the only way to 
figure out what / why things went wrong during cluster launch. So it might be 
better to save it to a file (spark-ec2-cluster-name-launch.log) as sometimes 
re-running spark-ec2 with more logging could be expensive.

 Make spark-ec2 verbosity configurable
 -

 Key: SPARK-3522
 URL: https://issues.apache.org/jira/browse/SPARK-3522
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 When launching a cluster, {{spark-ec2}} spits out a lot of stuff that feels 
 like debug output. It would be better for the user if {{spark-ec2}} did the 
 following:
 * default to info output level
 * allow option to increase verbosity and include debug output
 This will require converting most of the {{print}} statements in the script 
 to use Python's {{logging}} module and setting output levels ({{INFO}}, 
 {{WARN}}, {{DEBUG}}) for each statement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2008) Enhance spark-ec2 to be able to add and remove slaves to an existing cluster

2014-09-30 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153770#comment-14153770
 ] 

Shivaram Venkataraman commented on SPARK-2008:
--

This will be a very useful feature for spark-ec2 and is a good issue to work 
on. I think removing slaves should be relatively easy to implement as systems 
like HDFS, Spark should be resistant to slaves being removed. 

For adding slaves we'll need a new script that'll run setup-slave.sh 
https://github.com/mesos/spark-ec2/blob/v3/setup-slave.sh and bring up 
Datanodes, Spark workers etc.

 Enhance spark-ec2 to be able to add and remove slaves to an existing cluster
 

 Key: SPARK-2008
 URL: https://issues.apache.org/jira/browse/SPARK-2008
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Priority: Minor

 Per [the discussion 
 here|http://apache-spark-user-list.1001560.n3.nabble.com/Having-spark-ec2-join-new-slaves-to-existing-cluster-td3783.html]:
 {quote}
 I would like to be able to use spark-ec2 to launch new slaves and add them to 
 an existing, running cluster. Similarly, I would also like to remove slaves 
 from an existing cluster.
 Use cases include:
 * Oh snap, I sized my cluster incorrectly. Let me add/remove some slaves.
 * During scheduled batch processing, I want to add some new slaves, perhaps 
 on spot instances. When that processing is done, I want to kill them. (Cruel, 
 I know.)
 I gather this is not possible at the moment. spark-ec2 appears to be able to 
 launch new slaves for an existing cluster only if the master is stopped. I 
 also do not see any ability to remove slaves from a cluster.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-09-30 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153774#comment-14153774
 ] 

Shivaram Venkataraman commented on SPARK-3434:
--

I'll post a design doc by sometime tonight. We also have a reference 
implementation that I will add a link to and we can base our discussion off 
that.

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2014-10-07 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14162755#comment-14162755
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

1. Yes - the same stuff is installed on master and slaves. In fact they have 
the same AMI.

2. The base Spark AMI is created using `create_image.sh` (from a base Amazon 
AMI) -- After that we pass in the AMI-ID to `spark_ec2.py` which calls 
`setup.sh` on the master.  

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas

 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-10-10 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167478#comment-14167478
 ] 

Shivaram Venkataraman commented on SPARK-3434:
--

~brkyvz -- We are just adding a few more test cases to classes to make sure our 
interfaces look fine. I'll also create a simple design doc and post it here.

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3434) Distributed block matrix

2014-10-10 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167478#comment-14167478
 ] 

Shivaram Venkataraman edited comment on SPARK-3434 at 10/10/14 8:45 PM:


[~brkyvz] -- We are just adding a few more test cases to classes to make sure 
our interfaces look fine. I'll also create a simple design doc and post it here.


was (Author: shivaram):
~brkyvz -- We are just adding a few more test cases to classes to make sure our 
interfaces look fine. I'll also create a simple design doc and post it here.

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3434) Distributed block matrix

2014-10-14 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-3434:


Assignee: Shivaram Venkataraman

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Shivaram Venkataraman

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-10-14 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171272#comment-14171272
 ] 

Shivaram Venkataraman commented on SPARK-3434:
--

Sorry for the delay in getting back -- I've posted a design doc at 
http://goo.gl/0eE5fh and a reference implementation at 
https://github.com/amplab/ml-matrix. The design doc and the reference 
implementation use Spark as a library -- so this works as a standalone library 
in case somebody wants to try it out.

Some more points to note regarding the integration:
1. The existing implementation uses breeze matrices in the interface but we 
will change that to use local Matrix trait already present in Spark.
2. The matrix layouts will also extend the DistributedMatrix class in MLLib and 
we could create a new interface BlockDistributedMatrix from the interface in 
amplab/ml-matrix
3. We can also use this JIRA or create a new JIRA to discuss what algorithms / 
operations should be merged into Spark. I think TSQR, NormalEquations should be 
pretty useful. Other algorithms like 2-D BlockQR and BlockCoordinateDescent can 
be merged later if we feel its useful (these haven't been pushed to ml-matrix 
yet).

I will create a first patch for the matrix formats in a couple of days. Please 
let me know if there are any questions / clarifications.

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI

2014-10-16 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173954#comment-14173954
 ] 

Shivaram Venkataraman commented on SPARK-3957:
--

I think it needs to be tracked in the Block Manager -- However we also need to 
track this on a per-executor basis and not just at the driver. Right now AFAIK, 
executors do not report new broadcast blocks to the master to reduce 
communication. However we could add broadcast blocks to some periodic report. 
[~andrewor] might know more.

 Broadcast variable memory usage not reflected in UI
 ---

 Key: SPARK-3957
 URL: https://issues.apache.org/jira/browse/SPARK-3957
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Web UI
Affects Versions: 1.0.2, 1.1.0
Reporter: Shivaram Venkataraman
Assignee: Nan Zhu

 Memory used by broadcast variables are not reflected in the memory usage 
 reported in the WebUI. For example, the executors tab shows memory used in 
 each executor but this number doesn't include memory used by broadcast 
 variables. Similarly the storage tab only shows list of rdds cached and how 
 much memory they use.  
 We should add a separate column / tab for broadcast variables to make it 
 easier to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3973) Print callSite information for broadcast variables

2014-10-16 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-3973:


 Summary: Print callSite information for broadcast variables
 Key: SPARK-3973
 URL: https://issues.apache.org/jira/browse/SPARK-3973
 Project: Spark
  Issue Type: Bug
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Minor
 Fix For: 1.2.0


Printing call site information for broadcast variables will help in debugging 
which variables are used, when they are used etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4030) `destroy` method in Broadcast should be public

2014-10-20 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-4030:


 Summary: `destroy` method in Broadcast should be public
 Key: SPARK-4030
 URL: https://issues.apache.org/jira/browse/SPARK-4030
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Shivaram Venkataraman


The destroy method in Broadcast.scala 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala#L91]
 is right now marked as private[spark]

This prevents long-running applications from cleaning up memory used by 
broadcast variables on the driver.  Also as broadcast variables are always 
created with persistence MEMORY_DISK, this slows down jobs when old broadcast 
variables are flushed to disk. 

Making `destroy` public can help applications control the lifetime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4031) Read broadcast variables on use

2014-10-20 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-4031:


 Summary: Read broadcast variables on use
 Key: SPARK-4031
 URL: https://issues.apache.org/jira/browse/SPARK-4031
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman


This is a proposal to change the broadcast variable implementations in Spark to 
only read values when they are used rather than on deserializing.

This change will be very helpful (and in our use cases required) for complex 
applications which have a large number of broadcast variables. For example if 
broadcast variables are class members, they are captured in closures even when 
they are not used.

We could also consider cleaning closures more aggressively, but that might be a 
more complex change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4030) `destroy` method in Broadcast should be public

2014-10-21 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178532#comment-14178532
 ] 

Shivaram Venkataraman commented on SPARK-4030:
--

Yes - there is a bunch of logic around `valid` which checks for destroyed 
broadcast variables. I don't mind having a more esoteric option that is harder 
to use -- like unpersist(dropFromMaster=true) -- which you can't use by 
mistake. 

 `destroy` method in Broadcast should be public
 --

 Key: SPARK-4030
 URL: https://issues.apache.org/jira/browse/SPARK-4030
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Shivaram Venkataraman

 The destroy method in Broadcast.scala 
 [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala#L91]
  is right now marked as private[spark]
 This prevents long-running applications from cleaning up memory used by 
 broadcast variables on the driver.  Also as broadcast variables are always 
 created with persistence MEMORY_DISK, this slows down jobs when old broadcast 
 variables are flushed to disk. 
 Making `destroy` public can help applications control the lifetime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4030) `destroy` method in Broadcast should be public

2014-10-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182181#comment-14182181
 ] 

Shivaram Venkataraman commented on SPARK-4030:
--

Great -- I'll send a PR and also include the change to capture the callSite and 
print it out if `assertValid` fails.

 `destroy` method in Broadcast should be public
 --

 Key: SPARK-4030
 URL: https://issues.apache.org/jira/browse/SPARK-4030
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Shivaram Venkataraman

 The destroy method in Broadcast.scala 
 [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala#L91]
  is right now marked as private[spark]
 This prevents long-running applications from cleaning up memory used by 
 broadcast variables on the driver.  Also as broadcast variables are always 
 created with persistence MEMORY_DISK, this slows down jobs when old broadcast 
 variables are flushed to disk. 
 Making `destroy` public can help applications control the lifetime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4030) `destroy` method in Broadcast should be public

2014-10-24 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reassigned SPARK-4030:


Assignee: Shivaram Venkataraman

 `destroy` method in Broadcast should be public
 --

 Key: SPARK-4030
 URL: https://issues.apache.org/jira/browse/SPARK-4030
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman

 The destroy method in Broadcast.scala 
 [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala#L91]
  is right now marked as private[spark]
 This prevents long-running applications from cleaning up memory used by 
 broadcast variables on the driver.  Also as broadcast variables are always 
 created with persistence MEMORY_DISK, this slows down jobs when old broadcast 
 variables are flushed to disk. 
 Making `destroy` public can help applications control the lifetime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4030) `destroy` method in Broadcast should be public

2014-10-27 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-4030.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 `destroy` method in Broadcast should be public
 --

 Key: SPARK-4030
 URL: https://issues.apache.org/jira/browse/SPARK-4030
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
 Fix For: 1.2.0


 The destroy method in Broadcast.scala 
 [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala#L91]
  is right now marked as private[spark]
 This prevents long-running applications from cleaning up memory used by 
 broadcast variables on the driver.  Also as broadcast variables are always 
 created with persistence MEMORY_DISK, this slows down jobs when old broadcast 
 variables are flushed to disk. 
 Making `destroy` public can help applications control the lifetime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4030) `destroy` method in Broadcast should be public

2014-10-27 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185263#comment-14185263
 ] 

Shivaram Venkataraman commented on SPARK-4030:
--

Issue resolved by pull request 2922
https://github.com/apache/spark/pull/2922

 `destroy` method in Broadcast should be public
 --

 Key: SPARK-4030
 URL: https://issues.apache.org/jira/browse/SPARK-4030
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
 Fix For: 1.2.0


 The destroy method in Broadcast.scala 
 [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala#L91]
  is right now marked as private[spark]
 This prevents long-running applications from cleaning up memory used by 
 broadcast variables on the driver.  Also as broadcast variables are always 
 created with persistence MEMORY_DISK, this slows down jobs when old broadcast 
 variables are flushed to disk. 
 Making `destroy` public can help applications control the lifetime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4031) Read broadcast variables on use

2014-10-28 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-4031.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 Read broadcast variables on use
 ---

 Key: SPARK-4031
 URL: https://issues.apache.org/jira/browse/SPARK-4031
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
 Fix For: 1.2.0


 This is a proposal to change the broadcast variable implementations in Spark 
 to only read values when they are used rather than on deserializing.
 This change will be very helpful (and in our use cases required) for complex 
 applications which have a large number of broadcast variables. For example if 
 broadcast variables are class members, they are captured in closures even 
 when they are not used.
 We could also consider cleaning closures more aggressively, but that might be 
 a more complex change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4031) Read broadcast variables on use

2014-10-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187122#comment-14187122
 ] 

Shivaram Venkataraman commented on SPARK-4031:
--

Issue resolved by pull request 2871
https://github.com/apache/spark/pull/2871

 Read broadcast variables on use
 ---

 Key: SPARK-4031
 URL: https://issues.apache.org/jira/browse/SPARK-4031
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
 Fix For: 1.2.0


 This is a proposal to change the broadcast variable implementations in Spark 
 to only read values when they are used rather than on deserializing.
 This change will be very helpful (and in our use cases required) for complex 
 applications which have a large number of broadcast variables. For example if 
 broadcast variables are class members, they are captured in closures even 
 when they are not used.
 We could also consider cleaning closures more aggressively, but that might be 
 a more complex change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4137) Relative paths don't get handled correctly by spark-ec2

2014-11-05 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-4137:
-
Assignee: Nicholas Chammas

 Relative paths don't get handled correctly by spark-ec2
 ---

 Key: SPARK-4137
 URL: https://issues.apache.org/jira/browse/SPARK-4137
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2014-11-08 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203757#comment-14203757
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

[~nchammas] Thanks for putting this together -- This is looking great ! I just 
had a couple of quick questions, clarifications

1. My preference would be to just have a single AMI across Spark versions for a 
couple of reasons. First it reduces steps for every release (even though 
creating AMIs is definitely much simpler now !). Also the number of AMIs we 
maintain could get large if we do this for every minor and major release like 
1.1.1. [~pwendell] could probably comment more on the release process etc.

2. Could you clarify if Hadoop is pre-installed in new AMIs or are is it still 
installed on startup ? The flexibility we right now have of switching between 
Hadoop 1, Hadoop 2, YARN etc. is useful for testing. (Related packer question: 
Are the [init scripts| 
https://github.com/nchammas/spark-ec2/blob/packer/packer/spark-packer.json#L129]
 run during AMI creation or during startup ?)

3. Do you have some benchmarks for the new AMI without Spark 1.1.0 
pre-installed ? [We right now have old AMI vs. new AMI with 
spark|https://github.com/nchammas/spark-ec2/blob/packer/packer/proposal.md#new-amis---latest-os-updates-and-spark-110-pre-installed-single-run]
 . I see a couple of huge wins in the new AMI (from SSH wait time, ganglia init 
etc.) which I guess we should get even without Spark being pre-installed.

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2014-11-10 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205131#comment-14205131
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

Regarding reducing init time, I think there are simple things we can do in 
init.sh that will get us most of the way there. For example, we can download 
the tar.gz files for Hadoop, Spark on each machine and untar in parallel 
instead of rsync-ing at the end. But we can revisit this in a separate change I 
guess  

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.

2014-11-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215640#comment-14215640
 ] 

Shivaram Venkataraman commented on SPARK-3337:
--

[~andrewor14] can we pull this in to 1.1.1 ? A lot of people ran into this bug 
in the AMPCamp exercises as their install paths had spaces.

 Paranoid quoting in shell to allow install dirs with spaces within.
 ---

 Key: SPARK-3337
 URL: https://issues.apache.org/jira/browse/SPARK-3337
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext

2014-12-01 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230878#comment-14230878
 ] 

Shivaram Venkataraman commented on SPARK-3963:
--

[~pwendell] This looks pretty useful -- Was this postponed from 1.2 ? I have a 
use case that needs Hadoop file names and was wondering if there was a 
workaround before this is implemented.

 Support getting task-scoped properties from TaskContext
 ---

 Key: SPARK-3963
 URL: https://issues.apache.org/jira/browse/SPARK-3963
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell

 This is a proposal for a minor feature. Given stabilization of the 
 TaskContext API, it would be nice to have a mechanism for Spark jobs to 
 access properties that are defined based on task-level scope by Spark RDD's. 
 I'd like to propose adding a simple properties hash map with some standard 
 spark properties that users can access. Later it would be nice to support 
 users setting these properties, but for now to keep it simple in 1.2. I'd 
 prefer users not be able to set them.
 The main use case is providing the file name from Hadoop RDD's, a very common 
 request. But I'd imagine us using this for other things later on. We could 
 also use this to expose some of the taskMetrics, such as e.g. the input bytes.
 {code}
 val data = sc.textFile(s3n//..2014/*/*/*.json)
 data.mapPartitions { 
   val tc = TaskContext.get
   val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME)
   val parts = fileName.split(/)
   val (year, month, day) = (parts[3], parts[4], parts[5])
   ...
 }
 {code}
 Internally we'd have a method called setProperty, but this wouldn't be 
 exposed initially. This is structured as a simple (String, String) hash map 
 for ease of porting to python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext

2014-12-01 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230900#comment-14230900
 ] 

Shivaram Venkataraman commented on SPARK-3963:
--

Thanks. I somehow missed `mapPartitionsWithInputSplit` -- that will work for 
now.

 Support getting task-scoped properties from TaskContext
 ---

 Key: SPARK-3963
 URL: https://issues.apache.org/jira/browse/SPARK-3963
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell

 This is a proposal for a minor feature. Given stabilization of the 
 TaskContext API, it would be nice to have a mechanism for Spark jobs to 
 access properties that are defined based on task-level scope by Spark RDD's. 
 I'd like to propose adding a simple properties hash map with some standard 
 spark properties that users can access. Later it would be nice to support 
 users setting these properties, but for now to keep it simple in 1.2. I'd 
 prefer users not be able to set them.
 The main use case is providing the file name from Hadoop RDD's, a very common 
 request. But I'd imagine us using this for other things later on. We could 
 also use this to expose some of the taskMetrics, such as e.g. the input bytes.
 {code}
 val data = sc.textFile(s3n//..2014/*/*/*.json)
 data.mapPartitions { 
   val tc = TaskContext.get
   val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME)
   val parts = fileName.split(/)
   val (year, month, day) = (parts[3], parts[4], parts[5])
   ...
 }
 {code}
 Internally we'd have a method called setProperty, but this wouldn't be 
 exposed initially. This is structured as a simple (String, String) hash map 
 for ease of porting to python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-12-18 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252053#comment-14252053
 ] 

Shivaram Venkataraman commented on SPARK-2075:
--

So just to make sure I understand things correctly, is it the case that the jar 
published to maven (spark-core-1.1.1) is built using Hadoop2 dependencies while 
the Hadoop1 assembly jar that is distributed is built using Hadoop 1 
(obviously...) ?

[~srowen] While I see that we officially support submitting jobs using 
spark-submit, it is surprising to me that other deployment methods would fail 
this way (from the user's perspective the Spark versions presumably at compile 
time and run time presumably match up ?). We should at the very least document 
this, but it would also be good to see if there is a work around.

 Anonymous classes are missing from Spark distribution
 -

 Key: SPARK-2075
 URL: https://issues.apache.org/jira/browse/SPARK-2075
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 1.0.0
Reporter: Paul R. Brown
Priority: Critical

 Running a job built against the Maven dep for 1.0.0 and the hadoop1 
 distribution produces:
 {code}
 java.lang.ClassNotFoundException:
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
 {code}
 Here's what's in the Maven dep as of 1.0.0:
 {code}
 jar tvf 
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}
 And here's what's in the hadoop1 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
 {code}
 I.e., it's not there.  It is in the hadoop2 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-12-18 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252360#comment-14252360
 ] 

Shivaram Venkataraman commented on SPARK-2075:
--

Hmm -- looking at the release steps it looks like the release on maven should 
be from Hadoop 1.0.4 [~pwendell] or [~andrewor14] might be able to throw more 
light on this. (BTW I wonder if we can trace the source of this mismatch for 
the case reported by [~sunrui] where the distribution with Hadoop1 of Spark 
1.1.1 doesn't work with the Maven central jar)

I see your high level point that this is not about spark-submit per se, but 
about having the exact same binary on the server and as a compile-time 
dependency. Its just unfortunate that having the same Spark version number 
isn't sufficient. Also is the workaround right now to rebuild Spark from source 
using `make-distribution`, do `mvn install`, rebuild the application and deploy 
Spark using the assembly jar ?

 Anonymous classes are missing from Spark distribution
 -

 Key: SPARK-2075
 URL: https://issues.apache.org/jira/browse/SPARK-2075
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 1.0.0
Reporter: Paul R. Brown
Priority: Critical

 Running a job built against the Maven dep for 1.0.0 and the hadoop1 
 distribution produces:
 {code}
 java.lang.ClassNotFoundException:
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
 {code}
 Here's what's in the Maven dep as of 1.0.0:
 {code}
 jar tvf 
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}
 And here's what's in the hadoop1 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
 {code}
 I.e., it's not there.  It is in the hadoop2 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-12-21 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255364#comment-14255364
 ] 

Shivaram Venkataraman commented on SPARK-2075:
--

[~sunrui] What I can see from this JIRA discussion (and [~srowen] please 
correct me if I am wrong) is that Hadoop 1 vs. Hadoop 2 is one of the causes of 
incompatibility. It is _not the only_ reason and I don't think we exactly know 
why the pre-built binary for 1.1.0 is different from the maven version.  

I think the best practice advice is to use the exact same jar in the 
application and in the runtime. Marking Spark a provided dependency in the 
application build and using spark-submit is one way of achieving this. Or one 
can publish a local build to maven and use the same local build to start the 
cluster etc.

 Anonymous classes are missing from Spark distribution
 -

 Key: SPARK-2075
 URL: https://issues.apache.org/jira/browse/SPARK-2075
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 1.0.0
Reporter: Paul R. Brown
Assignee: Shixiong Zhu
Priority: Critical

 Running a job built against the Maven dep for 1.0.0 and the hadoop1 
 distribution produces:
 {code}
 java.lang.ClassNotFoundException:
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
 {code}
 Here's what's in the Maven dep as of 1.0.0:
 {code}
 jar tvf 
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}
 And here's what's in the hadoop1 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
 {code}
 I.e., it's not there.  It is in the hadoop2 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4977) spark-ec2 start resets all the spark/conf configurations

2014-12-27 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259480#comment-14259480
 ] 

Shivaram Venkataraman commented on SPARK-4977:
--

I've run into this before too, but its not very easy to fix. The reason most 
conf files get overwritten is that hostnames change on EC2 when machines are 
stopped and started, and we need to update the hostnames in the config files. I 
guess there are a couple of solutions I can think of

1. To provide an extension-like mechanism where we source script which contains 
user-defined options (like spark-env-extensions.sh) and we don't overwrite this 
file during start / stop. 
2.  To separate out conf files which need hostname changes vs. those that don't 
and only overwrrite the former. This will need changes to `deploy_templates.py` 
in our current setup.

 spark-ec2 start resets all the spark/conf configurations
 

 Key: SPARK-4977
 URL: https://issues.apache.org/jira/browse/SPARK-4977
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Noah Young
Priority: Minor

 Running `spark-ec2 start` to restart an already-launched cluster causes the 
 cluster setup scripts to be run, which reset any existing spark configuration 
 files on the remote machines. The expected behavior is that all the modules 
 (tachyon, hadoop, spark itself) should be restarted, and perhaps the master 
 configuration copy-dir'd out, but anything in spark/conf should (at least 
 optionally) be left alone.
 As far as I know, one must create and execute their own init script to set 
 all spark configurables as needed after restarting a cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-02 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263193#comment-14263193
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

Yeah you are right that the times are pretty close for Packer, base AMI. I was 
just curious if I was missing some thing. I don't think there is much else I 
had in mind -- having the full cluster launch times for existing AMI vs. Packer 
would be good and it would also be good to see how Packer compares to images 
created using 
[create_image.sh|https://github.com/mesos/spark-ec2/blob/v4/create_image.sh]

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-02 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263181#comment-14263181
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

[~nchammas] Thanks for the benchmark. One thing I am curious about is why the 
Packer AMI is faster than launching just the base Amazon AMI. Is this because 
we spend some time installing things on the base AMI that we avoid with Packer 
? 

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289679#comment-14289679
 ] 

Shivaram Venkataraman commented on SPARK-5386:
--

Couple of things that might be worth inspecting

1. It might be interesting to see if this is a problem in `reduce` or in the 
`map` stage. i.e. Does running a count after the parallelize work ?

2. The error message indicates requesting around 2.3G of memory which seems to 
indicate that a bunch of these vectors are being created at once ? It'd be 
interesting to see what happens when say p = 2 in your script

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289731#comment-14289731
 ] 

Shivaram Venkataraman commented on SPARK-5386:
--

Note that having 2 worker instances and 2 cores per worker would make it 4 
tasks per machine. And if the `count` works and `reduce` fails, then it looks 
like it has something to do with allocating extra vectors to hold the result in 
each partition ([1]) etc. I don't know much about the scala implementation of 
reduceLeft or ways to trace down where the memory allocations are coming from.

[1] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L865

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.count()
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289957#comment-14289957
 ] 

Shivaram Venkataraman commented on SPARK-5386:
--

Results are merged on the driver one at at time. You can see the merge function 
that is called right below at 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L873

However I dont know if there is anything that limits the rate at which results 
are fetched etc.


 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.count()
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5654) Integrate SparkR into Apache Spark

2015-02-06 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-5654:


 Summary: Integrate SparkR into Apache Spark
 Key: SPARK-5654
 URL: https://issues.apache.org/jira/browse/SPARK-5654
 Project: Spark
  Issue Type: New Feature
Reporter: Shivaram Venkataraman


The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
from R. The project was started at the AMPLab around a year ago and has been 
incubated as its own project to make sure it can be easily merged into upstream 
Spark, i.e. not introduce any external dependencies etc. SparkR’s goals are 
similar to PySpark and shares a similar design pattern as described in our 
meetup talk[2], Spark Summit presentation[3].

Integrating SparkR into the Apache project will enable R users to use Spark out 
of the box and given R’s large user base, it will help the Spark project reach 
more users.  Additionally, work in progress features like providing R 
integration with ML Pipelines and Dataframes can be better achieved by 
development in a unified code base.

SparkR is available under the Apache 2.0 License and does not have any external 
dependencies other than requiring users to have R and Java installed on their 
machines.  SparkR’s developers come from many organizations including UC 
Berkeley, Alteryx, Intel and we will support future development, maintenance 
after the integration.

[1] https://github.com/amplab-extras/SparkR-pkg
[2] http://files.meetup.com/3138542/SparkR-meetup.pdf
[3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve

2015-01-18 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282117#comment-14282117
 ] 

Shivaram Venkataraman commented on SPARK-5246:
--

Yes - this can be resolved. However I can't seem to assign this to [~vgrigor]. 
Not sure if this needs some JIRA permissions.

 spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does 
 not resolve
 --

 Key: SPARK-5246
 URL: https://issues.apache.org/jira/browse/SPARK-5246
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Vladimir Grigor

 How to reproduce: 
 1)  http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html 
 should be sufficient to setup VPC for this bug. After you followed that 
 guide, start new instance in VPC, ssh to it (though NAT server)
 2) user starts a cluster in VPC:
 {code}
 ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
 --subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
 Setting up security groups...
 
 (omitted for brevity)
 10.1.1.62
 10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop
 no org.apache.spark.deploy.master.Master to stop
 starting org.apache.spark.deploy.master.Master, logging to 
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
 failed to launch org.apache.spark.deploy.master.Master:
   at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
   ... 12 more
 full log in 
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
 10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to 
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
 10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker:
 10.1.1.62:at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
 10.1.1.62:... 12 more
 10.1.1.62: full log in 
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
 [timing] spark-standalone setup:  00h 00m 28s
  
 (omitted for brevity)
 {code}
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
 {code}
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp 
 :::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar
  -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
 org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 
 --webui-port 8080
 
 15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, 
 HUP, INT]
 Exception in thread main java.net.UnknownHostException: ip-10-1-1-151: 
 ip-10-1-1-151: Name or service not known
 at java.net.InetAddress.getLocalHost(InetAddress.java:1473)
 at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620)
 at 
 org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612)
 at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612)
 at 
 org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613)
 at 
 org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613)
 at 
 org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
 at 
 org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.util.Utils$.localHostName(Utils.scala:665)
 at 
 org.apache.spark.deploy.master.MasterArguments.init(MasterArguments.scala:27)
 at org.apache.spark.deploy.master.Master$.main(Master.scala:819)
 at org.apache.spark.deploy.master.Master.main(Master.scala)
 Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not 
 known
 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
 at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
 at 
 java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
 at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
 ... 12 more
 {code}
 Problem is that instance launched in VPC may be not able to resolve own local 
 hostname. Please see  
 https://forums.aws.amazon.com/thread.jspa?threadID=92092.
 I am going to submit a fix for this problem since I need this functionality 
 asap.



--
This message was sent by 

[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-14 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277334#comment-14277334
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

Regarding the pre-built distributions, AFAIK we don't support full Hadoop2 as 
in YARN. We run CDH4 which has some parts of Hadoop2, but with MapReduce. There 
is an open PR to add support for Hadoop2 at 
https://github.com/mesos/spark-ec2/pull/77 and and you can see that it gets the 
right [prebuilt 
Spark|https://github.com/mesos/spark-ec2/pull/77/files#diff-1d040c3294246f2b59643d63868fc2adR97]
 in that case 

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1302) httpd doesn't start in spark-ec2 (cc2.8xlarge)

2015-02-11 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316541#comment-14316541
 ] 

Shivaram Venkataraman commented on SPARK-1302:
--

[~soid] Could you let us which Spark version you were using to launch the 
cluster. The fix for spark-ec2 was merged into `branch-1.3` (and the master 
branch)

 httpd doesn't start in spark-ec2 (cc2.8xlarge)
 --

 Key: SPARK-1302
 URL: https://issues.apache.org/jira/browse/SPARK-1302
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 0.9.0
Reporter: Shivaram Venkataraman
Priority: Minor

 In a cc2.8xlarge EC2 cluster launched from master branch, httpd won't start 
 (i.e ganglia doesn't work). The reason seems to be httpd.conf is wrong (newer 
 httpd version ?).  The config file contains a bunch of non-existent modules 
 and this happens because we overwrite the default conf with our config file 
 from spark-ec2. We could explore using patch or something like that to just 
 apply the diff we need 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2015-02-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325354#comment-14325354
 ] 

Shivaram Venkataraman commented on SPARK-5629:
--

This sounds fine to me and I really like YAML -- Does Python have native 
support for printing out YAML ?
One thing we should do is probably marking this as experimental as we might not 
be able maintain backwards compatibility etc. (On that note are YAML parsers 
backwards compatible ? i.e. if we add a new field in the next release will it 
break things ?)

 Add spark-ec2 action to return info about an existing cluster
 -

 Key: SPARK-5629
 URL: https://issues.apache.org/jira/browse/SPARK-5629
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 You can launch multiple clusters using spark-ec2. At some point, you might 
 just want to get some information about an existing cluster.
 Use cases include:
 * Wanting to check something about your cluster in the EC2 web console.
 * Wanting to feed information about your cluster to another tool (e.g. as 
 described in [SPARK-5627]).
 So, in addition to the [existing 
 actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
 * {{launch}}
 * {{destroy}}
 * {{login}}
 * {{stop}}
 * {{start}}
 * {{get-master}}
 * {{reboot-slaves}}
 We add a new action, {{describe}}, which describes an existing cluster if 
 given a cluster name, and all clusters if not.
 Some examples:
 {code}
 # describes all clusters launched by spark-ec2
 spark-ec2 describe
 {code}
 {code}
 # describes cluster-1
 spark-ec2 describe cluster-1
 {code}
 In combination with the proposal in [SPARK-5627]:
 {code}
 # describes cluster-3 in a machine-readable way (e.g. JSON)
 spark-ec2 describe cluster-3 --machine-readable
 {code}
 Parallels in similar tools include:
 * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
 * [{{starcluster 
 listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-757) Deserialization Exception partway into long running job with Netty - MLbase

2015-02-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324687#comment-14324687
 ] 

Shivaram Venkataraman commented on SPARK-757:
-

Since the shuffle implementation has changed recently I think this can be 
marked as obsolete

 Deserialization Exception partway into long running job with Netty - MLbase
 ---

 Key: SPARK-757
 URL: https://issues.apache.org/jira/browse/SPARK-757
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.0
Reporter: Evan Sparks
Assignee: Shivaram Venkataraman
 Attachments: imgnet_shiv6.log.gz, joblogs.tgz, newerlogs.tgz, 
 newworklog.log, shivlogs.tgz


 Using Netty for communication I see some derserialization errors that are 
 crashing my job about 30% of the way through an iterative 10-step job. 
 Happens reliably around the same point of the job after multiple attempts.
 Logs on master and a couple of affected workers attached per request from 
 Shivaram.
 13/05/31 23:19:12 INFO cluster.TaskSetManager: Serialized task 11.0:454 as 
 3414 bytes in 0 ms
 13/05/31 23:19:14 INFO cluster.TaskSetManager: Finished TID 11344 in 55289 ms 
 (progress: 312/1000)
 13/05/31 23:19:14 INFO scheduler.DAGScheduler: Completed ResultTask(11, 344)
 13/05/31 23:19:14 INFO cluster.ClusterScheduler: 
 parentName:,name:TaskSet_11,runningTasks:143
 13/05/31 23:19:14 INFO cluster.TaskSetManager: Starting task 11.0:455 as TID 
 11455 on slave 8: ip-10-60-217-218.ec2.internal:56262 (NODE_LOCAL)
 13/05/31 23:19:14 INFO cluster.TaskSetManager: Serialized task 11.0:455 as 
 3414 bytes in 0 ms
 13/05/31 23:19:17 INFO cluster.TaskSetManager: Lost TID 11412 (task 11.0:412)
 13/05/31 23:19:17 INFO cluster.TaskSetManager: Loss was due to 
 java.io.EOFException
 java.io.EOFException
 at 
 java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2322)
 at 
 java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2791)
 at 
 java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:798)
 at java.io.ObjectInputStream.init(ObjectInputStream.java:298)
 at 
 spark.JavaDeserializationStream$$anon$1.init(JavaSerializer.scala:18)
 at spark.JavaDeserializationStream.init(JavaSerializer.scala:18)
 at 
 spark.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:53)
 at spark.storage.BlockManager.dataDeserialize(BlockManager.scala:925)
 at 
 spark.storage.BlockFetcherIterator$NettyBlockFetcherIterator$$anonfun$5.apply(BlockFetcherIterator.scala:279)
 at 
 spark.storage.BlockFetcherIterator$NettyBlockFetcherIterator$$anonfun$5.apply(BlockFetcherIterator.scala:279)
 at 
 spark.storage.BlockFetcherIterator$NettyBlockFetcherIterator.next(BlockFetcherIterator.scala:318)
 at 
 spark.storage.BlockFetcherIterator$NettyBlockFetcherIterator.next(BlockFetcherIterator.scala:239)
 at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440)
 at spark.util.CompletionIterator.hasNext(CompletionIterator.scala:9)
 at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:457)
 at scala.collection.Iterator$class.foreach(Iterator.scala:772)
 at scala.collection.Iterator$$anon$22.foreach(Iterator.scala:451)
 at spark.Aggregator.combineCombinersByKey(Aggregator.scala:33)
 at 
 spark.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:72)
 at 
 spark.PairRDDFunctions$$anonfun$combineByKey$2.apply(PairRDDFunctions.scala:72)
 at spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:19)
 at spark.RDD.computeOrReadCheckpoint(RDD.scala:220)
 at spark.RDD.iterator(RDD.scala:209)
 at spark.scheduler.ResultTask.run(ResultTask.scala:84)
 at spark.executor.Executor$TaskRunner.run(Executor.scala:104)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:679)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2015-02-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324907#comment-14324907
 ] 

Shivaram Venkataraman commented on SPARK-5629:
--

Is there an example output for `describe` you have in mind ? And I am not sure 
it'll be easy to list all the clusters as spark-ec2 looks up clusters by the 
security group / cluster-id ?

 Add spark-ec2 action to return info about an existing cluster
 -

 Key: SPARK-5629
 URL: https://issues.apache.org/jira/browse/SPARK-5629
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 You can launch multiple clusters using spark-ec2. At some point, you might 
 just want to get some information about an existing cluster.
 Use cases include:
 * Wanting to check something about your cluster in the EC2 web console.
 * Wanting to feed information about your cluster to another tool (e.g. as 
 described in [SPARK-5627]).
 So, in addition to the [existing 
 actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
 * {{launch}}
 * {{destroy}}
 * {{login}}
 * {{stop}}
 * {{start}}
 * {{get-master}}
 * {{reboot-slaves}}
 We add a new action, {{describe}}, which describes an existing cluster if 
 given a cluster name, and all clusters if not.
 Some examples:
 {code}
 # describes all clusters launched by spark-ec2
 spark-ec2 describe
 {code}
 {code}
 # describes cluster-1
 spark-ec2 describe cluster-1
 {code}
 In combination with the proposal in [SPARK-5627]:
 {code}
 # describes cluster-3 in a machine-readable way (e.g. JSON)
 spark-ec2 describe cluster-3 --machine-readable
 {code}
 Parallels in similar tools include:
 * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
 * [{{starcluster 
 listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5008) Persistent HDFS does not recognize EBS Volumes

2015-01-10 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272830#comment-14272830
 ] 

Shivaram Venkataraman commented on SPARK-5008:
--

Hmm I think https://github.com/mesos/spark-ec2/pull/66 probably broke this in 
some way. We made some tweaks to keep spark-ec2 backwards compatible by 
symlinking /vol3 to /vol -- However I think the new behavior is now broken as 
persistent-hdfs expects /vol to be exist and can't find it.

I think one fix might be to create a symlink from /vol0 to /vol if /vol3 
doesn't exist -- Or we could also change core-site.xml in persistent-hdfs to 
pick up all the volumes 

 Persistent HDFS does not recognize EBS Volumes
 --

 Key: SPARK-5008
 URL: https://issues.apache.org/jira/browse/SPARK-5008
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
 Environment: 8 Node Cluster Generated from 1.2.0 spark-ec2 script.
 -m c3.2xlarge -t c3.8xlarge --ebs-vol-size 300 --ebs-vol-type gp2 
 --ebs-vol-num 1
Reporter: Brad Willard

 Cluster is built with correct size EBS volumes. It creates the volume at 
 /dev/xvds and it mounted to /vol0. However when you start persistent hdfs 
 with start-all script, it starts but it isn't correctly configured to use the 
 EBS volume.
 I'm assuming some sym links or expected mounts are not correctly configured.
 This has worked flawlessly on all previous versions of spark.
 I have a stupid workaround by installing pssh and mucking with it by mounting 
 it to /vol, which worked, however it doesn't not work between restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5122) Remove Shark from spark-ec2

2015-01-06 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267296#comment-14267296
 ] 

Shivaram Venkataraman commented on SPARK-5122:
--

Yes I think removing shark should be fine. We can also get rid of the Spark - 
Shark version map in spark_ec2.py

 Remove Shark from spark-ec2
 ---

 Key: SPARK-5122
 URL: https://issues.apache.org/jira/browse/SPARK-5122
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 Since Shark has been replaced by Spark SQL, we don't need it in {{spark-ec2}} 
 anymore. (?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4948) Use pssh instead of bash-isms and remove unnecessary operations

2015-01-06 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-4948:
-
Assignee: Nicholas Chammas

 Use pssh instead of bash-isms and remove unnecessary operations
 ---

 Key: SPARK-4948
 URL: https://issues.apache.org/jira/browse/SPARK-4948
 Project: Spark
  Issue Type: Sub-task
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 Remove unnecessarily high sleep times in {{setup.sh}}, as well as unnecessary 
 SSH calls to pre-approve keys.
 Replace bash-isms like {{while ... command ...  wait}} with {{pssh}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276505#comment-14276505
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

[~nchammas] Yes -- That sounds good

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313269#comment-14313269
 ] 

Shivaram Venkataraman commented on SPARK-5676:
--

Yes - it is managed as a self-contained project. However bugs in that project 
are often experienced by Spark users, so we end up with issues created here. I 
think filing these issues under the component EC2 is a fine thing to do as it 
does affect Spark usage on EC2.

 License missing from spark-ec2 repo
 ---

 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein

 There is no LICENSE file or licence headers in the code in the spark-ec2 
 repo. Also, I believe there is no contributor license agreement notification 
 in place (like there is in the main spark repo).
 It would be great to fix this (sooner better than later while contributors 
 list is small), so that users wishing to use this part of Spark are not in 
 doubt over licensing issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5676) License missing from spark-ec2 repo

2015-02-09 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313210#comment-14313210
 ] 

Shivaram Venkataraman commented on SPARK-5676:
--

Just to clarify some things: 

- Having a separate repo for cluster launch scripts was a conscious decision in 
order to separate out release level changes from runtime settings (like Ganglia 
config etc.)
- Though the repository exists in mesos/spark-ec2, AFAIK it is only used by the 
Spark EC2 scripts. In fact we do track some bugs in that repo using issues in 
the Spark JIRA. 
- However from what I can see, I don't think the repository's license affects 
the Spark project in anyway. It is not distributed as a part of any artifact we 
build and EC2 support is in itself a strictly optional component. That said it 
is good to have a LICENSE file, so we will add one.

 License missing from spark-ec2 repo
 ---

 Key: SPARK-5676
 URL: https://issues.apache.org/jira/browse/SPARK-5676
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Florian Verhein

 There is no LICENSE file or licence headers in the code in the spark-ec2 
 repo. Also, I believe there is no contributor license agreement notification 
 in place (like there is in the main spark repo).
 It would be great to fix this (sooner better than later while contributors 
 list is small), so that users wishing to use this part of Spark are not in 
 doubt over licensing issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6231) Join on two tables (generated from same one) is broken

2015-03-20 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372447#comment-14372447
 ] 

Shivaram Venkataraman commented on SPARK-6231:
--

[~marmbrus] I've sent the dataset to you by email. The code that used to cause 
this bug is at https://gist.github.com/shivaram/4ff0a9c226dda2030507

 Join on two tables (generated from same one) is broken
 --

 Key: SPARK-6231
 URL: https://issues.apache.org/jira/browse/SPARK-6231
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Davies Liu
Assignee: Michael Armbrust
Priority: Blocker
  Labels: DataFrame

 If the two column used in joinExpr come from the same table, they have the 
 same id, then the joniExpr is explained in wrong way.
 {code}
 val df = sqlContext.load(path, parquet)
 val txns = df.groupBy(cust_id).agg($cust_id, 
 countDistinct($day_num).as(txns))
 val spend = df.groupBy(cust_id).agg($cust_id, 
 sum($extended_price).as(spend))
 val rmJoin = txns.join(spend, txns(cust_id) === spend(cust_id), inner)
 scala rmJoin.explain
 == Physical Plan ==
 CartesianProduct
  Filter (cust_id#0 = cust_id#0)
   Aggregate false, [cust_id#0], [cust_id#0,CombineAndCount(partialSets#25) AS 
 txns#7L]
Exchange (HashPartitioning [cust_id#0], 200)
 Aggregate true, [cust_id#0], [cust_id#0,AddToHashSet(day_num#2L) AS 
 partialSets#25]
  PhysicalRDD [cust_id#0,day_num#2L], MapPartitionsRDD[1] at map at 
 newParquet.scala:542
  Aggregate false, [cust_id#17], [cust_id#17,SUM(PartialSum#38) AS spend#8]
   Exchange (HashPartitioning [cust_id#17], 200)
Aggregate true, [cust_id#17], [cust_id#17,SUM(extended_price#20) AS 
 PartialSum#38]
 PhysicalRDD [cust_id#17,extended_price#20], MapPartitionsRDD[3] at map at 
 newParquet.scala:542
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6246) spark-ec2 can't handle clusters with 100 nodes

2015-03-12 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355403#comment-14355403
 ] 

Shivaram Venkataraman commented on SPARK-6246:
--

Hmm - This seems like a bad problem. And it looks like a AWS side change rather 
than a boto change I guess.
[~nchammas] Similar to the EC2Box issue above, can we also batch calls to 
`get_instances` 100 instances at a time ?

 spark-ec2 can't handle clusters with  100 nodes
 

 Key: SPARK-6246
 URL: https://issues.apache.org/jira/browse/SPARK-6246
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.3.0
Reporter: Nicholas Chammas
Priority: Minor

 This appears to be a new restriction, perhaps resulting from our upgrade of 
 boto. Maybe it's a new restriction from EC2. Not sure yet.
 We didn't have this issue around the Spark 1.1.0 time frame from what I can 
 remember. I'll track down where the issue is and when it started.
 Attempting to launch a cluster with 100 slaves yields the following:
 {code}
 Spark AMI: ami-35b1885c
 Launching instances...
 Launched 100 slaves in us-east-1c, regid = r-9c408776
 Launched master in us-east-1c, regid = r-92408778
 Waiting for AWS to propagate instance metadata...
 Waiting for cluster to enter 'ssh-ready' state.ERROR:boto:400 Bad Request
 ERROR:boto:?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the 
 maximum number of instance IDs that can be specificied (100). Please specify 
 fewer than 100 instance 
 IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response
 Traceback (most recent call last):
   File ./ec2/spark_ec2.py, line 1338, in module
 main()
   File ./ec2/spark_ec2.py, line 1330, in main
 real_main()
   File ./ec2/spark_ec2.py, line 1170, in real_main
 cluster_state='ssh-ready'
   File ./ec2/spark_ec2.py, line 795, in wait_for_cluster_state
 statuses = conn.get_all_instance_status(instance_ids=[i.id for i in 
 cluster_instances])
   File /path/apache/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 
 737, in get_all_instance_status
 InstanceStatusSet, verb='POST')
   File /path/apache/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 
 1204, in get_object
 raise self.ResponseError(response.status, response.reason, body)
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the 
 maximum number of instance IDs that can be specificied (100). Please specify 
 fewer than 100 instance 
 IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response
 {code}
 This problem seems to be with {{get_all_instance_status()}}, though I am not 
 sure if other methods are affected too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+

2015-03-08 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352317#comment-14352317
 ] 

Shivaram Venkataraman commented on SPARK-5134:
--

Yeah so this did change in 1.2 and I think I mentioned it to Patrick when it 
affected a couple of other projects of mine. The main problem there was that 
even if you have an explicit Hadoop 1 dependency in your project, SBT picks up 
the highest version required while building an assembly jar for the project -- 
Thus with Spark linked against Hadoop 2.2, one would require an exclusion rule 
to use Hadoop 1. It might be good to add this to the docs or to some of the 
example Quick Start documentation we have

 Bump default Hadoop version to 2+
 -

 Key: SPARK-5134
 URL: https://issues.apache.org/jira/browse/SPARK-5134
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 [~srowen] and I discussed bumping [the default hadoop version in the parent 
 POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]
  from {{1.0.4}} to something more recent.
 There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-03-08 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352514#comment-14352514
 ] 

Shivaram Venkataraman commented on SPARK-6220:
--

Seems like a good idea and the syntax sounds good to me.  Just curious: Are 
these the only two boto calls we use ?

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5134) Bump default Hadoop version to 2+

2015-03-08 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352374#comment-14352374
 ] 

Shivaram Venkataraman commented on SPARK-5134:
--

Yeah if you exclude Spark's Hadoop dependency things work correctly for 
Hadoop1. There are some additional issues that come up in 1.2 if due to the 
Guava changes, but those are not related to the default Hadoop version change. 
I think the documentation to update would be [1] but I am thinking it would be 
good to mention this in the Quick Start guide [2] as well

[1] 
https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/docs/hadoop-third-party-distributions.md#linking-applications-to-the-hadoop-version
[2] 
https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/docs/quick-start.md#self-contained-applications

 Bump default Hadoop version to 2+
 -

 Key: SPARK-5134
 URL: https://issues.apache.org/jira/browse/SPARK-5134
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 [~srowen] and I discussed bumping [the default hadoop version in the parent 
 POM|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]
  from {{1.0.4}} to something more recent.
 There doesn't seem to be a good reason that it was set/kept at {{1.0.4}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6088) UI is malformed when tasks fetch remote results

2015-02-28 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6088:
-
Attachment: Screenshot 2015-02-28 18.24.42.png

 UI is malformed when tasks fetch remote results
 ---

 Key: SPARK-6088
 URL: https://issues.apache.org/jira/browse/SPARK-6088
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Attachments: Screenshot 2015-02-28 18.24.42.png


 There are two issues when tasks get remote results:
 (1) The status never changes from GET_RESULT to SUCCEEDED
 (2) The time to get the result is shown as the absolute time (resulting in a 
 non-sensical output that says getting the result took 1 million hours) 
 rather than the elapsed time
 cc [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6089) Size of task result fetched can't be found in UI

2015-02-28 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-6089:


 Summary: Size of task result fetched can't be found in UI
 Key: SPARK-6089
 URL: https://issues.apache.org/jira/browse/SPARK-6089
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Shivaram Venkataraman


When you do a large collect the amount of data fetched as task result from each 
task is not present in the WebUI. 

We should make this appear under the 'Output' column (both per-task and in 
executor-level aggregation)

[cc ~kayousterhout]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6088) UI is malformed when tasks fetch remote results

2015-02-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341896#comment-14341896
 ] 

Shivaram Venkataraman commented on SPARK-6088:
--

Also for some reason the get result time is also included in the Scheduler 
Delay. Screen shot attached shows how the get result took 33 mins and how this 
shows up in scheduler delay.

 UI is malformed when tasks fetch remote results
 ---

 Key: SPARK-6088
 URL: https://issues.apache.org/jira/browse/SPARK-6088
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Attachments: Screenshot 2015-02-28 18.24.42.png


 There are two issues when tasks get remote results:
 (1) The status never changes from GET_RESULT to SUCCEEDED
 (2) The time to get the result is shown as the absolute time (resulting in a 
 non-sensical output that says getting the result took 1 million hours) 
 rather than the elapsed time
 cc [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6089) Size of task result fetched can't be found in UI

2015-02-28 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6089:
-
Description: 
When you do a large collect the amount of data fetched as task result from each 
task is not present in the WebUI. 

We should make this appear under the 'Output' column (both per-task and in 
executor-level aggregation)

cc [~kayousterhout]

  was:
When you do a large collect the amount of data fetched as task result from each 
task is not present in the WebUI. 

We should make this appear under the 'Output' column (both per-task and in 
executor-level aggregation)

[cc ~kayousterhout]


 Size of task result fetched can't be found in UI
 

 Key: SPARK-6089
 URL: https://issues.apache.org/jira/browse/SPARK-6089
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Shivaram Venkataraman

 When you do a large collect the amount of data fetched as task result from 
 each task is not present in the WebUI. 
 We should make this appear under the 'Output' column (both per-task and in 
 executor-level aggregation)
 cc [~kayousterhout]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint

2015-04-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-6881.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5493
[https://github.com/apache/spark/pull/5493]

 Change the checkpoint directory name from checkpoints to checkpoint
 ---

 Key: SPARK-6881
 URL: https://issues.apache.org/jira/browse/SPARK-6881
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Hao
Priority: Trivial
 Fix For: 1.4.0


 Name checkpoint instead of checkpoints is included in .gitignore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6818) Support column deletion in SparkR DataFrame API

2015-04-23 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6818:
-
Assignee: Sun Rui

 Support column deletion in SparkR DataFrame API
 ---

 Key: SPARK-6818
 URL: https://issues.apache.org/jira/browse/SPARK-6818
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman
Assignee: Sun Rui
 Fix For: 1.4.0


 We should support deleting columns using traditional R syntax i.e. something 
 like df$age - NULL
 should delete the `age` column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6818) Support column deletion in SparkR DataFrame API

2015-04-23 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-6818.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5655
[https://github.com/apache/spark/pull/5655]

 Support column deletion in SparkR DataFrame API
 ---

 Key: SPARK-6818
 URL: https://issues.apache.org/jira/browse/SPARK-6818
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman
 Fix For: 1.4.0


 We should support deleting columns using traditional R syntax i.e. something 
 like df$age - NULL
 should delete the `age` column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6797) Add support for YARN cluster mode

2015-04-22 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6797:
-
Assignee: Sun Rui

 Add support for YARN cluster mode
 -

 Key: SPARK-6797
 URL: https://issues.apache.org/jira/browse/SPARK-6797
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Sun Rui
Priority: Critical

 SparkR currently does not work in YARN cluster mode as the R package is not 
 shipped along with the assembly jar to the YARN AM. We could try to use the 
 support for archives in YARN to send out the R package as a zip file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7033) Use JavaRDD.partitions() instead of JavaRDD.splits()

2015-04-24 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-7033.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5628
[https://github.com/apache/spark/pull/5628]

 Use JavaRDD.partitions() instead of JavaRDD.splits()
 

 Key: SPARK-7033
 URL: https://issues.apache.org/jira/browse/SPARK-7033
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui
Priority: Minor
 Fix For: 1.4.0


 In numPartitions(), JavaRDD.splits() is called to get the number of 
 partitions in an RDD. But JavaRDD.splits() is deprecated. Use 
 JavaRDD.partitions() instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6852) Accept numeric as numPartitions in SparkR

2015-04-24 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-6852.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5613
[https://github.com/apache/spark/pull/5613]

 Accept numeric as numPartitions in SparkR
 -

 Key: SPARK-6852
 URL: https://issues.apache.org/jira/browse/SPARK-6852
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Davies Liu
Priority: Critical
 Fix For: 1.4.0


 All the API should accept numeric as numPartitions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6852) Accept numeric as numPartitions in SparkR

2015-04-24 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6852:
-
Assignee: Sun Rui

 Accept numeric as numPartitions in SparkR
 -

 Key: SPARK-6852
 URL: https://issues.apache.org/jira/browse/SPARK-6852
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Davies Liu
Assignee: Sun Rui
Priority: Critical
 Fix For: 1.4.0


 All the API should accept numeric as numPartitions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6824) Fill the docs for DataFrame API in SparkR

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6824:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-7228

 Fill the docs for DataFrame API in SparkR
 -

 Key: SPARK-6824
 URL: https://issues.apache.org/jira/browse/SPARK-6824
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Blocker

 Some of the DataFrame functions in SparkR do not have complete roxygen docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6815) Support accumulators in R

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6815:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Support accumulators in R
 -

 Key: SPARK-6815
 URL: https://issues.apache.org/jira/browse/SPARK-6815
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 SparkR doesn't support acccumulators right now.  It might be good to add 
 support for this to get feature parity with PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6825) Data sources implementation to support `sequenceFile`

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6825:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Data sources implementation to support `sequenceFile`
 -

 Key: SPARK-6825
 URL: https://issues.apache.org/jira/browse/SPARK-6825
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman

 SequenceFiles are a widely used input format and right now they are not 
 supported in SparkR. 
 It would be good to add support for SequenceFiles by implementing a new data 
 source that can create a DataFrame from a SequenceFile. However as 
 SequenceFiles can have arbitrary types, we probably need to map them to 
 User-defined types in SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6816) Add SparkConf API to configure SparkR

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6816:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Add SparkConf API to configure SparkR
 -

 Key: SPARK-6816
 URL: https://issues.apache.org/jira/browse/SPARK-6816
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Right now the only way to configure SparkR is to pass in arguments to 
 sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python 
 to make configuration easier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6838) Explore using Reference Classes instead of S4 objects

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6838:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Explore using Reference Classes instead of S4 objects
 -

 Key: SPARK-6838
 URL: https://issues.apache.org/jira/browse/SPARK-6838
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 The current RDD and PipelinedRDD are represented in S4 objects. R has a new 
 OO system: Reference Class (RC or R5). It seems to be a more message-passing 
 OO and instances are mutable objects. It is not an important issue, but it 
 should also require trivial work. It could also remove the kind-of awkward 
 @ operator in S4.
 R6 is also worth checking out. Feels closer to your ordinary object oriented 
 language. https://github.com/wch/R6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6803) [SparkR] Support SparkR Streaming

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6803:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 [SparkR] Support SparkR Streaming
 -

 Key: SPARK-6803
 URL: https://issues.apache.org/jira/browse/SPARK-6803
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, Streaming
Reporter: Hao
 Fix For: 1.4.0


 Adds R API for Spark Streaming.
 A experimental version is presented in repo [1]. which follows the PySpark 
 streaming design. Also, this PR can be further broken down into sub task 
 issues.
 [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6833) Extend `addPackage` so that any given R file can be sourced in the worker before functions are run.

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6833:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Extend `addPackage` so that any given R file can be sourced in the worker 
 before functions are run.
 ---

 Key: SPARK-6833
 URL: https://issues.apache.org/jira/browse/SPARK-6833
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Similar to how extra python files or packages can be specified (in zip / egg 
 formats), it will be good to support the ability to add extra R files to the 
 executors working directory.
 One thing that needs to be investigated is if this will just work out of the 
 box using the spark-submit flag --files ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6813) SparkR style guide

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6813:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 SparkR style guide
 --

 Key: SPARK-6813
 URL: https://issues.apache.org/jira/browse/SPARK-6813
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman

 We should develop a SparkR style guide document based on the some of the 
 guidelines we use and some of the best practices in R.
 Some examples of R style guide are:
 http://r-pkgs.had.co.nz/r.html#style 
 http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html
 A related issue is to work on a automatic style checking tool. 
 https://github.com/jimhester/lintr seems promising
 We could have a R style guide based on the one from google [1], and adjust 
 some of them with the conversation in Spark:
 1. Line Length: maximum 100 characters
 2. no limit on function name (API should be similar as in other languages)
 3. Allow S4 objects/methods



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6809) Make numPartitions optional in pairRDD APIs

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6809:
-
Priority: Major  (was: Critical)

 Make numPartitions optional in pairRDD APIs
 ---

 Key: SPARK-6809
 URL: https://issues.apache.org/jira/browse/SPARK-6809
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Davies Liu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6820) Convert NAs to null type in SparkR DataFrames

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6820:
-
Priority: Critical  (was: Major)

 Convert NAs to null type in SparkR DataFrames
 -

 Key: SPARK-6820
 URL: https://issues.apache.org/jira/browse/SPARK-6820
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman
Priority: Critical

 While converting RDD or local R DataFrame to a SparkR DataFrame we need to 
 handle missing values or NAs.
 We should convert NAs to SparkSQL's null type to handle the conversion 
 correctly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6799) Add dataframe examples for SparkR

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6799:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-7228

 Add dataframe examples for SparkR
 -

 Key: SPARK-6799
 URL: https://issues.apache.org/jira/browse/SPARK-6799
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Critical

 We should add more data frame usage examples for SparkR . This can be similar 
 to the python examples at 
 https://github.com/apache/spark/blob/1b2aab8d5b9cc2ff702506038bd71aa8debe7ca0/examples/src/main/python/sql.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6809) Make numPartitions optional in pairRDD APIs

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6809:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Make numPartitions optional in pairRDD APIs
 ---

 Key: SPARK-6809
 URL: https://issues.apache.org/jira/browse/SPARK-6809
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Davies Liu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6826) `hashCode` support for arbitrary R objects

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6826:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 `hashCode` support for arbitrary R objects
 --

 Key: SPARK-6826
 URL: https://issues.apache.org/jira/browse/SPARK-6826
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Shivaram Venkataraman

 From the SparkR JIRA
 digest::digest looks interesting, but it seems to be more heavyweight than 
 our requirements. One relatively easy way to do this is to serialize the 
 given R object into a string (serialize(object, ascii=T)) and then just call 
 the string hashCode function on this. FWIW it looks like digest follows a 
 similar strategy where the md5sum / shasum etc. are calculated on serialized 
 objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-04-29 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-7230:


 Summary: Make RDD API private in SparkR for Spark 1.4
 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical


This ticket proposes making the RDD API in SparkR private for the 1.4 release. 
The motivation for doing so are discussed in a larger design document aimed at 
a more top-down design of the SparkR APIs. A first cut that discusses 
motivation and proposed changes can be found at http://goo.gl/GLHKZI

The main points in that document that relate to this ticket are:
- The RDD API requires knowledge of the distributed system and is pretty low 
level. This is not very suitable for a number of R users who are used to more 
high-level packages that work out of the box.
- The RDD implementation in SparkR is not fully robust right now: we are 
missing features like spilling for aggregation, handling partitions which don't 
fit in memory etc. There are further limitations like lack of hashCode for 
non-native types etc. which might affect user experience.

The only change we will make for now is to not export the RDD functions as 
public methods in the SparkR package and I will create another ticket for 
discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6814) Support sorting for any data type in SparkR

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6814:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Support sorting for any data type in SparkR
 ---

 Key: SPARK-6814
 URL: https://issues.apache.org/jira/browse/SPARK-6814
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Critical

 I get various return status == 0 is false and unimplemented type errors 
 trying to get data out of any rdd with top() or collect(). The errors are not 
 consistent. I think spark is installed properly because some operations do 
 work. I apologize if I'm missing something easy or not providing the right 
 diagnostic info – I'm new to SparkR, and this seems to be the only resource 
 for SparkR issues.
 Some logs:
 {code}
 Browse[1] top(estep.rdd, 1L)
 Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : 
   unimplemented type 'list' in 'orderVector1'
 Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order
 Execution halted
 15/02/13 19:11:57 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
 org.apache.spark.SparkException: R computation failed with
  Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : 
   unimplemented type 'list' in 'orderVector1'
 Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order
 Execution halted
   at edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 15/02/13 19:11:57 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 14, 
 localhost): org.apache.spark.SparkException: R computation failed with
  Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : 
   unimplemented type 'list' in 'orderVector1'
 Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order
 Execution halted
 edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7228) SparkR public API for 1.4 release

2015-04-29 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-7228:


 Summary: SparkR public API for 1.4 release
 Key: SPARK-7228
 URL: https://issues.apache.org/jira/browse/SPARK-7228
 Project: Spark
  Issue Type: Umbrella
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical


This in an umbrella ticket to track the public APIs and documentation to be 
released as a part of SparkR in the 1.4 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6832) Handle partial reads in SparkR JVM to worker communication

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6832:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Handle partial reads in SparkR JVM to worker communication
 --

 Key: SPARK-6832
 URL: https://issues.apache.org/jira/browse/SPARK-6832
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 After we move to use socket between R worker and JVM, it's possible that 
 readBin() in R will return partial results (for example, interrupted by 
 signal).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7226) Support math functions in R DataFrame

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-7226:
-
Priority: Critical  (was: Major)

 Support math functions in R DataFrame
 -

 Key: SPARK-7226
 URL: https://issues.apache.org/jira/browse/SPARK-7226
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Reynold Xin
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >