[jira] [Commented] (SPARK-6599) Improve reliability and usability of Kinesis-based Spark Streaming

2015-07-31 Thread Arun Ramakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650054#comment-14650054
 ] 

Arun Ramakrishnan commented on SPARK-6599:
--

[~tdas] Curious about the design docs for this. 

 Improve reliability and usability of Kinesis-based Spark Streaming
 --

 Key: SPARK-6599
 URL: https://issues.apache.org/jira/browse/SPARK-6599
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 Currently, the KinesisReceiver can loose some data in the case of certain 
 failures (receiver and driver failures). Using the write ahead logs can 
 mitigate some of the problem, but it is not ideal because WALs dont work with 
 S3 (eventually consistency, etc.) which is the most likely file system to be 
 used in the EC2 environment. Hence, we have to take a different approach to 
 improving reliability for Kinesis.
 A detailed design doc on how this can be achieved will be added later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator

2014-07-01 Thread Arun Ramakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049435#comment-14049435
 ] 

Arun Ramakrishnan commented on SPARK-2047:
--

I can work on this. [~matei] Can you assign this to me. 

 Use less memory in AppendOnlyMap.destructiveSortedIterator
 --

 Key: SPARK-2047
 URL: https://issues.apache.org/jira/browse/SPARK-2047
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia

 This method tries to sort an the key-value pairs in the map in-place but ends 
 up allocating a Tuple2 object for each one, which allocates a nontrivial 
 amount of memory (32 or more bytes per entry on a 64-bit JVM). We could 
 instead try to sort the objects in-place within the data array, or allocate 
 an int array with the indices and sort those using a custom comparator. The 
 latter is probably easiest to begin with.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-993) Don't reuse Writable objects in SequenceFile by default

2014-04-25 Thread Arun Ramakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980747#comment-13980747
 ] 

Arun Ramakrishnan commented on SPARK-993:
-

How does one reproduce this issue ?

I tried a few things on the spark shell locally

{noformat}
import java.io.File
import com.google.common.io.Files
import org.apache.hadoop.io._
val tempDir = Files.createTempDir()
val outputDir = new File(tempDir, output).getAbsolutePath
val num = 100
val nums = sc.makeRDD(1 to num).map(x = (a * x, x)) 
nums.saveAsSequenceFile(outputDir)

val output = sc.sequenceFile[String,Int](outputDir)
assert(output.collect().toSet.size == num)

val t = sc.sequenceFile(outputDir, classOf[Text], classOf[IntWritable])
assert( t.map { case (k,v) = (k.toString, v.get) }.collect().toSet.size == 
num )
{noformat}

But, asserts seem to be fine. 

 Don't reuse Writable objects in SequenceFile by default
 ---

 Key: SPARK-993
 URL: https://issues.apache.org/jira/browse/SPARK-993
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
  Labels: Starter

 Right now we reuse them as an optimization, which leads to weird results when 
 you call collect() on a file with distinct items. We should instead make that 
 behavior optional through a flag.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1438) Update RDD.sample() API to make seed parameter optional

2014-04-23 Thread Arun Ramakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13977874#comment-13977874
 ] 

Arun Ramakrishnan commented on SPARK-1438:
--

NEW PR at https://github.com/apache/spark/pull/477

 Update RDD.sample() API to make seed parameter optional
 ---

 Key: SPARK-1438
 URL: https://issues.apache.org/jira/browse/SPARK-1438
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Blocker
  Labels: Starter
 Fix For: 1.0.0


 When a seed is not given, it should pick one based on Math.random().
 This needs to be done in Java and Python as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1438) Update RDD.sample() API to make seed parameter optional

2014-04-21 Thread Arun Ramakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975486#comment-13975486
 ] 

Arun Ramakrishnan commented on SPARK-1438:
--

pull request at https://github.com/apache/spark/pull/462

 Update RDD.sample() API to make seed parameter optional
 ---

 Key: SPARK-1438
 URL: https://issues.apache.org/jira/browse/SPARK-1438
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Blocker
  Labels: Starter
 Fix For: 1.0.0


 When a seed is not given, it should pick one based on Math.random().
 This needs to be done in Java and Python as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)