[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001320#comment-15001320 ] Akshat Aranya commented on SPARK-7708: -- Kryo 3.x will resolve this issue, but it can't be used (yet) because Spark relies on Chill and Chill is pegged to Kryo 2.2.1. > Incorrect task serialization with Kryo closure serializer > - > > Key: SPARK-7708 > URL: https://issues.apache.org/jira/browse/SPARK-7708 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.2 >Reporter: Akshat Aranya > > I've been investigating the use of Kryo for closure serialization with Spark > 1.2, and it seems like I've hit upon a bug: > When a task is serialized before scheduling, the following log message is > generated: > [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, > , PROCESS_LOCAL, 302 bytes) > This message comes from TaskSetManager which serializes the task using the > closure serializer. Before the message is sent out, the TaskDescription > (which included the original task as a byte array), is serialized again into > a byte array with the closure serializer. I added a log message for this in > CoarseGrainedSchedulerBackend, which produces the following output: > [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 > The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ > than serialized task that it contains (302 bytes). This implies that > TaskDescription.buffer is not getting serialized correctly. > On the executor side, the deserialization produces a null value for > TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564919#comment-14564919 ] Akshat Aranya commented on SPARK-7708: -- Thanks, Josh. I'll look into it. I can't spend all my time on this either, but I'll continue with my PR when I get the time. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563786#comment-14563786 ] Akshat Aranya commented on SPARK-7708: -- [~joshrosen] I tried the test once again with your new code merged in, and it seems like it's not a problem with resetting the Kryo object. In my test, I serialize same object twice with the same KryoSerializerInstance, but I end up with two different serialized buffers: {noformat} serialized.limit=369 serialized.limit=366 {noformat} Clearly, there is some state inside the serializer that isn't reset even after calling {{reset()}}. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564092#comment-14564092 ] Akshat Aranya commented on SPARK-7708: -- Wow, that's some serious sleuthing! I will try the newer version of Kryo and see if the rest of my serialization problems go away. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559549#comment-14559549 ] Akshat Aranya commented on SPARK-7708: -- [~joshrosen] I am tracking down a problem with Kryo serialization of closures, and I have distilled it to a unit test that has been failing for me: https://github.com/coolfrood/spark/blob/topic/kryo-closure-test/core/src/test/scala/org/apache/spark/serializer/KryoClosureSerializerSuite.scala In this test, I serialize and deserialize a class containing a {{SerializableWritable}} twice with the same serializer instance. The first time around, the ser/deser works fine. The second time around, it fails due to some previous state contained in the serializer: {noformat} 00:00 TRACE: [kryo] Object graph complete. 00:00 TRACE: [kryo] Register class name: org.apache.spark.serializer.TestPartition (com.esotericsoftware.kryo.serializers.FieldSerializer) 00:00 TRACE: [kryo] Write class name: org.apache.spark.serializer.TestPartition 00:00 TRACE: [kryo] Write initial object reference 0: org.apache.spark.serializer.TestPartition 00:00 DEBUG: [kryo] Write: org.apache.spark.serializer.TestPartition 00:00 TRACE: [kryo] Write field: foobar (org.apache.spark.serializer.TestPartition) 00:00 TRACE: [kryo] Write class 31: org.apache.spark.SerializableWritable 00:00 TRACE: [kryo] Write initial object reference 1: /foo:0+100 00:00 DEBUG: [kryo] Write: /foo:0+100 00:00 TRACE: [kryo] Object graph complete. serialized.limit=369 00:00 TRACE: [kryo] Read class name: org.apache.spark.serializer.TestPartition 00:00 TRACE: [kryo] Read initial object reference 0: org.apache.spark.serializer.TestPartition 00:00 TRACE: [kryo] Read field: foobar (org.apache.spark.serializer.TestPartition) 00:00 TRACE: [kryo] Read class 31: org.apache.spark.SerializableWritable 00:00 TRACE: [kryo] Read initial object reference 1: org.apache.spark.SerializableWritable 00:00 DEBUG: [kryo] Read: /foo:0+100 00:00 DEBUG: [kryo] Read: org.apache.spark.serializer.TestPartition 00:00 TRACE: [kryo] Object graph complete. is=/foo:0+100 00:00 TRACE: [kryo] Object graph complete. 00:00 TRACE: [kryo] Write class name: org.apache.spark.serializer.TestPartition 00:00 TRACE: [kryo] Write initial object reference 0: org.apache.spark.serializer.TestPartition 00:00 DEBUG: [kryo] Write: org.apache.spark.serializer.TestPartition 00:00 TRACE: [kryo] Write field: foobar (org.apache.spark.serializer.TestPartition) 00:00 TRACE: [kryo] Write class 31: org.apache.spark.SerializableWritable 00:00 TRACE: [kryo] Write initial object reference 1: /foo:0+100 00:00 DEBUG: [kryo] Write: /foo:0+100 00:00 TRACE: [kryo] Object graph complete. serialized.limit=366 00:00 TRACE: [kryo] Read class name: org.apache.spark.serializer.TestPartition 00:00 TRACE: [kryo] Read initial object reference 0: org.apache.spark.serializer.TestPartition 00:00 TRACE: [kryo] Read field: foobar (org.apache.spark.serializer.TestPartition) 00:00 TRACE: [kryo] Read initial object reference 1: org.apache.spark.SerializableWritable 00:00 TRACE: [kryo] Object graph complete. [info] - serialize various things using kryo closure serializer *** FAILED *** (175 milliseconds) [info] com.esotericsoftware.kryo.KryoException: Error during Java deserialization. [info] Serialization trace: [info] foobar (org.apache.spark.serializer.TestPartition) [info] at com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:42) [info] at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) [info] at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) [info] at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) [info] at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) [info] at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:193) [info] at org.apache.spark.serializer.KryoClosureSerializerSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(KryoClosureSerializerSuite.scala:62) [info] at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) [info] at org.apache.spark.serializer.KryoClosureSerializerSuite$$anonfun$1.apply$mcV$sp(KryoClosureSerializerSuite.scala:59) [info] at org.apache.spark.serializer.KryoClosureSerializerSuite$$anonfun$1.apply(KryoClosureSerializerSuite.scala:51) [info] at org.apache.spark.serializer.KryoClosureSerializerSuite$$anonfun$1.apply(KryoClosureSerializerSuite.scala:51) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556634#comment-14556634 ] Akshat Aranya commented on SPARK-7708: -- [~joshrosen], here are some fixes for getting Kryo serialization working for closures. It's not fully working yet; I found problems while running the word count example that I'm trying to figure out, but I thought I'd post this to get some feedback. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554797#comment-14554797 ] Akshat Aranya edited comment on SPARK-7708 at 5/21/15 6:22 PM: --- I was able to get this working with a couple of fixes: 1. Implementing serialization methods for Kryo in SerializableBuffer. An alternative is to register SerializableBuffer with JavaSerialization in Kryo, but that defeats the purpose. 2. The second part is a bit hokey because tasks within one executor process are deserialized from a shared broadcast variable. Kryo deserialization modifies the input buffer, so it isn't thread-safe (https://code.google.com/p/kryo/issues/detail?id=128). I worked around this by copying the broadcast buffer to a local buffer before deserializing. This fixes are for 1.2, so I'll see if I can port them to master and write a test for them. was (Author: aaranya): I was able to get this working with a couple of fixes: 1. Implementing serialization methods for Kryo in SerializableBuffer. An alternative is to register SerializableBuffer with JavaSerialization in Kryo, but that defeats the purpose. 2. The second part is a bit hokey because tasks within one executor process are deserialized from a shared broadcast variable. Kryo deserialization modifies the input buffer, so it isn't thread-safe (https://code.google.com/p/kryo/issues/detail?id=128). I worked around this by copying the broadcast buffer to a local buffer before deserializing. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554797#comment-14554797 ] Akshat Aranya commented on SPARK-7708: -- I was able to get this working with a couple of fixes: 1. Implementing serialization methods for Kryo in SerializableBuffer. An alternative is to register SerializableBuffer with JavaSerialization in Kryo, but that defeats the purpose. 2. The second part is a bit hokey because tasks within one executor process are deserialized from a shared broadcast variable. Kryo deserialization modifies the input buffer, so it isn't thread-safe (https://code.google.com/p/kryo/issues/detail?id=128). I worked around this by copying the broadcast buffer to a local buffer before deserializing. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554857#comment-14554857 ] Akshat Aranya commented on SPARK-7708: -- I don't have the numbers with me right now, but I _did_ see that serialization of byte arrays was still faster with native Kryo serialization as compared to JavaSerialization. I'll clean up my fixes and submit a PR. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7795) Speed up task serialization in standalone mode
Akshat Aranya created SPARK-7795: Summary: Speed up task serialization in standalone mode Key: SPARK-7795 URL: https://issues.apache.org/jira/browse/SPARK-7795 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Akshat Aranya My experiments with scheduling very short tasks in standalone cluster mode indicated that a significant amount of time was being spent in scheduling the tasks (500ms for 256 tasks). I found that most of the time was being spent in creating a new instance of serializer for each task. Changing this to just one serializer brought down the scheduling time to 8ms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
Akshat Aranya created SPARK-7708: Summary: Incorrect task serialization with Kryo closure serializer Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer
[ https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548282#comment-14548282 ] Akshat Aranya commented on SPARK-7708: -- This happens because the TaskDescription contains a SerializableBuffer, which has no fields from the point of view of Kryo. It only has a transient var and a def. When this is serialized by Kryo, it finds no member fields to serialize, so it doesn't write out the underlying buffer. Incorrect task serialization with Kryo closure serializer - Key: SPARK-7708 URL: https://issues.apache.org/jira/browse/SPARK-7708 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2 Reporter: Akshat Aranya I've been investigating the use of Kryo for closure serialization with Spark 1.2, and it seems like I've hit upon a bug: When a task is serialized before scheduling, the following log message is generated: [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, host, PROCESS_LOCAL, 302 bytes) This message comes from TaskSetManager which serializes the task using the closure serializer. Before the message is sent out, the TaskDescription (which included the original task as a byte array), is serialized again into a byte array with the closure serializer. I added a log message for this in CoarseGrainedSchedulerBackend, which produces the following output: [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ than serialized task that it contains (302 bytes). This implies that TaskDescription.buffer is not getting serialized correctly. On the executor side, the deserialization produces a null value for TaskDescription.buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4478) totalRegisteredExecutors not updated properly
Akshat Aranya created SPARK-4478: Summary: totalRegisteredExecutors not updated properly Key: SPARK-4478 URL: https://issues.apache.org/jira/browse/SPARK-4478 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Akshat Aranya I'm trying to write a new scheduler backend that relies on org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend. Specifically, I want to use the field totalRegisteredExecutors, but this field is only incremented; not reduced when an executor is lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173904#comment-14173904 ] Akshat Aranya commented on SPARK-2365: -- This looks great! I have been using IndexedRDD for a while, to great effect. I have one suggestion: it would be nice to override setName() in IndexedRDDLike {code} override def setName(_name: String): this.type = { partitionsRDD.setName(_name) this } {code} so that the IndexedRDD shows up with friendly names in the storage UI, just like regular, cached RDDs do. Add IndexedRDD, an efficient updatable key-value store -- Key: SPARK-2365 URL: https://issues.apache.org/jira/browse/SPARK-2365 Project: Spark Issue Type: New Feature Components: GraphX, Spark Core Reporter: Ankur Dave Assignee: Ankur Dave Attachments: 2014-07-07-IndexedRDD-design-review.pdf RDDs currently provide a bulk-updatable, iterator-based interface. This imposes minimal requirements on the storage layer, which only needs to support sequential access, enabling on-disk and serialized storage. However, many applications would benefit from a richer interface. Efficient support for point lookups would enable serving data out of RDDs, but it currently requires iterating over an entire partition to find the desired element. Point updates similarly require copying an entire iterator. Joins are also expensive, requiring a shuffle and local hash joins. To address these problems, we propose IndexedRDD, an efficient key-value store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key uniqueness and pre-indexing the entries for efficient joins and point lookups, updates, and deletions. It would be implemented by (1) hash-partitioning the entries by key, (2) maintaining a hash index within each partition, and (3) using purely functional (immutable and efficiently updatable) data structures to enable efficient modifications and deletions. GraphX would be the first user of IndexedRDD, since it currently implements a limited form of this functionality in VertexRDD. We envision a variety of other uses for IndexedRDD, including streaming updates to RDDs, direct serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org