[jira] [Commented] (SPARK-6599) Improve reliability and usability of Kinesis-based Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650054#comment-14650054 ] Arun Ramakrishnan commented on SPARK-6599: -- [~tdas] Curious about the design docs for this. Improve reliability and usability of Kinesis-based Spark Streaming -- Key: SPARK-6599 URL: https://issues.apache.org/jira/browse/SPARK-6599 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Currently, the KinesisReceiver can loose some data in the case of certain failures (receiver and driver failures). Using the write ahead logs can mitigate some of the problem, but it is not ideal because WALs dont work with S3 (eventually consistency, etc.) which is the most likely file system to be used in the EC2 environment. Hence, we have to take a different approach to improving reliability for Kinesis. A detailed design doc on how this can be achieved will be added later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2047) Use less memory in AppendOnlyMap.destructiveSortedIterator
[ https://issues.apache.org/jira/browse/SPARK-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049435#comment-14049435 ] Arun Ramakrishnan commented on SPARK-2047: -- I can work on this. [~matei] Can you assign this to me. Use less memory in AppendOnlyMap.destructiveSortedIterator -- Key: SPARK-2047 URL: https://issues.apache.org/jira/browse/SPARK-2047 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia This method tries to sort an the key-value pairs in the map in-place but ends up allocating a Tuple2 object for each one, which allocates a nontrivial amount of memory (32 or more bytes per entry on a 64-bit JVM). We could instead try to sort the objects in-place within the data array, or allocate an int array with the indices and sort those using a custom comparator. The latter is probably easiest to begin with. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-993) Don't reuse Writable objects in SequenceFile by default
[ https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980747#comment-13980747 ] Arun Ramakrishnan commented on SPARK-993: - How does one reproduce this issue ? I tried a few things on the spark shell locally {noformat} import java.io.File import com.google.common.io.Files import org.apache.hadoop.io._ val tempDir = Files.createTempDir() val outputDir = new File(tempDir, output).getAbsolutePath val num = 100 val nums = sc.makeRDD(1 to num).map(x = (a * x, x)) nums.saveAsSequenceFile(outputDir) val output = sc.sequenceFile[String,Int](outputDir) assert(output.collect().toSet.size == num) val t = sc.sequenceFile(outputDir, classOf[Text], classOf[IntWritable]) assert( t.map { case (k,v) = (k.toString, v.get) }.collect().toSet.size == num ) {noformat} But, asserts seem to be fine. Don't reuse Writable objects in SequenceFile by default --- Key: SPARK-993 URL: https://issues.apache.org/jira/browse/SPARK-993 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Labels: Starter Right now we reuse them as an optimization, which leads to weird results when you call collect() on a file with distinct items. We should instead make that behavior optional through a flag. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1438) Update RDD.sample() API to make seed parameter optional
[ https://issues.apache.org/jira/browse/SPARK-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13977874#comment-13977874 ] Arun Ramakrishnan commented on SPARK-1438: -- NEW PR at https://github.com/apache/spark/pull/477 Update RDD.sample() API to make seed parameter optional --- Key: SPARK-1438 URL: https://issues.apache.org/jira/browse/SPARK-1438 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Priority: Blocker Labels: Starter Fix For: 1.0.0 When a seed is not given, it should pick one based on Math.random(). This needs to be done in Java and Python as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1438) Update RDD.sample() API to make seed parameter optional
[ https://issues.apache.org/jira/browse/SPARK-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975486#comment-13975486 ] Arun Ramakrishnan commented on SPARK-1438: -- pull request at https://github.com/apache/spark/pull/462 Update RDD.sample() API to make seed parameter optional --- Key: SPARK-1438 URL: https://issues.apache.org/jira/browse/SPARK-1438 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Priority: Blocker Labels: Starter Fix For: 1.0.0 When a seed is not given, it should pick one based on Math.random(). This needs to be done in Java and Python as well. -- This message was sent by Atlassian JIRA (v6.2#6252)