[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-11-11 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001320#comment-15001320
 ] 

Akshat Aranya commented on SPARK-7708:
--

Kryo 3.x will resolve this issue, but it can't be used (yet) because Spark 
relies on Chill and Chill is pegged to Kryo 2.2.1. 

> Incorrect task serialization with Kryo closure serializer
> -
>
> Key: SPARK-7708
> URL: https://issues.apache.org/jira/browse/SPARK-7708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.2
>Reporter: Akshat Aranya
>
> I've been investigating the use of Kryo for closure serialization with Spark 
> 1.2, and it seems like I've hit upon a bug:
> When a task is serialized before scheduling, the following log message is 
> generated:
> [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
> , PROCESS_LOCAL, 302 bytes)
> This message comes from TaskSetManager which serializes the task using the 
> closure serializer.  Before the message is sent out, the TaskDescription 
> (which included the original task as a byte array), is serialized again into 
> a byte array with the closure serializer.  I added a log message for this in 
> CoarseGrainedSchedulerBackend, which produces the following output:
> [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
> The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
> than serialized task that it contains (302 bytes). This implies that 
> TaskDescription.buffer is not getting serialized correctly.
> On the executor side, the deserialization produces a null value for 
> TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-29 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564919#comment-14564919
 ] 

Akshat Aranya commented on SPARK-7708:
--

Thanks, Josh. I'll look into it. I can't spend all my time on this either, but 
I'll continue with my PR when I get the time.

 Incorrect task serialization with Kryo closure serializer
 -

 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

 I've been investigating the use of Kryo for closure serialization with Spark 
 1.2, and it seems like I've hit upon a bug:
 When a task is serialized before scheduling, the following log message is 
 generated:
 [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
 host, PROCESS_LOCAL, 302 bytes)
 This message comes from TaskSetManager which serializes the task using the 
 closure serializer.  Before the message is sent out, the TaskDescription 
 (which included the original task as a byte array), is serialized again into 
 a byte array with the closure serializer.  I added a log message for this in 
 CoarseGrainedSchedulerBackend, which produces the following output:
 [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
 than serialized task that it contains (302 bytes). This implies that 
 TaskDescription.buffer is not getting serialized correctly.
 On the executor side, the deserialization produces a null value for 
 TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-28 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563786#comment-14563786
 ] 

Akshat Aranya commented on SPARK-7708:
--

[~joshrosen] I tried the test once again with your new code merged in, and it 
seems like it's not a problem with resetting the Kryo object.  In my test, I 
serialize same object twice with the same KryoSerializerInstance, but I end up 
with two different serialized buffers:

{noformat}
serialized.limit=369
serialized.limit=366
{noformat}

Clearly, there is some state inside the serializer that isn't reset even after 
calling {{reset()}}.

 Incorrect task serialization with Kryo closure serializer
 -

 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

 I've been investigating the use of Kryo for closure serialization with Spark 
 1.2, and it seems like I've hit upon a bug:
 When a task is serialized before scheduling, the following log message is 
 generated:
 [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
 host, PROCESS_LOCAL, 302 bytes)
 This message comes from TaskSetManager which serializes the task using the 
 closure serializer.  Before the message is sent out, the TaskDescription 
 (which included the original task as a byte array), is serialized again into 
 a byte array with the closure serializer.  I added a log message for this in 
 CoarseGrainedSchedulerBackend, which produces the following output:
 [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
 than serialized task that it contains (302 bytes). This implies that 
 TaskDescription.buffer is not getting serialized correctly.
 On the executor side, the deserialization produces a null value for 
 TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-28 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564092#comment-14564092
 ] 

Akshat Aranya commented on SPARK-7708:
--

Wow, that's some serious sleuthing! I will try the newer version of Kryo and 
see if the rest of my serialization problems go away.

 Incorrect task serialization with Kryo closure serializer
 -

 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

 I've been investigating the use of Kryo for closure serialization with Spark 
 1.2, and it seems like I've hit upon a bug:
 When a task is serialized before scheduling, the following log message is 
 generated:
 [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
 host, PROCESS_LOCAL, 302 bytes)
 This message comes from TaskSetManager which serializes the task using the 
 closure serializer.  Before the message is sent out, the TaskDescription 
 (which included the original task as a byte array), is serialized again into 
 a byte array with the closure serializer.  I added a log message for this in 
 CoarseGrainedSchedulerBackend, which produces the following output:
 [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
 than serialized task that it contains (302 bytes). This implies that 
 TaskDescription.buffer is not getting serialized correctly.
 On the executor side, the deserialization produces a null value for 
 TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-26 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559549#comment-14559549
 ] 

Akshat Aranya commented on SPARK-7708:
--

[~joshrosen] I am tracking down a problem with Kryo serialization of closures, 
and I have distilled it to a unit test that has been failing for me:

https://github.com/coolfrood/spark/blob/topic/kryo-closure-test/core/src/test/scala/org/apache/spark/serializer/KryoClosureSerializerSuite.scala

In this test, I serialize and deserialize a class containing a 
{{SerializableWritable}} twice with the same serializer instance.  The first 
time around, the ser/deser works fine.  The second time around, it fails due to 
some previous state contained in the serializer:

{noformat}
00:00 TRACE: [kryo] Object graph complete.
00:00 TRACE: [kryo] Register class name: 
org.apache.spark.serializer.TestPartition 
(com.esotericsoftware.kryo.serializers.FieldSerializer)
00:00 TRACE: [kryo] Write class name: org.apache.spark.serializer.TestPartition
00:00 TRACE: [kryo] Write initial object reference 0: 
org.apache.spark.serializer.TestPartition
00:00 DEBUG: [kryo] Write: org.apache.spark.serializer.TestPartition
00:00 TRACE: [kryo] Write field: foobar 
(org.apache.spark.serializer.TestPartition)
00:00 TRACE: [kryo] Write class 31: org.apache.spark.SerializableWritable
00:00 TRACE: [kryo] Write initial object reference 1: /foo:0+100
00:00 DEBUG: [kryo] Write: /foo:0+100
00:00 TRACE: [kryo] Object graph complete.
serialized.limit=369
00:00 TRACE: [kryo] Read class name: org.apache.spark.serializer.TestPartition
00:00 TRACE: [kryo] Read initial object reference 0: 
org.apache.spark.serializer.TestPartition
00:00 TRACE: [kryo] Read field: foobar 
(org.apache.spark.serializer.TestPartition)
00:00 TRACE: [kryo] Read class 31: org.apache.spark.SerializableWritable
00:00 TRACE: [kryo] Read initial object reference 1: 
org.apache.spark.SerializableWritable
00:00 DEBUG: [kryo] Read: /foo:0+100
00:00 DEBUG: [kryo] Read: org.apache.spark.serializer.TestPartition
00:00 TRACE: [kryo] Object graph complete.
is=/foo:0+100
00:00 TRACE: [kryo] Object graph complete.
00:00 TRACE: [kryo] Write class name: org.apache.spark.serializer.TestPartition
00:00 TRACE: [kryo] Write initial object reference 0: 
org.apache.spark.serializer.TestPartition
00:00 DEBUG: [kryo] Write: org.apache.spark.serializer.TestPartition
00:00 TRACE: [kryo] Write field: foobar 
(org.apache.spark.serializer.TestPartition)
00:00 TRACE: [kryo] Write class 31: org.apache.spark.SerializableWritable
00:00 TRACE: [kryo] Write initial object reference 1: /foo:0+100
00:00 DEBUG: [kryo] Write: /foo:0+100
00:00 TRACE: [kryo] Object graph complete.
serialized.limit=366
00:00 TRACE: [kryo] Read class name: org.apache.spark.serializer.TestPartition
00:00 TRACE: [kryo] Read initial object reference 0: 
org.apache.spark.serializer.TestPartition
00:00 TRACE: [kryo] Read field: foobar 
(org.apache.spark.serializer.TestPartition)
00:00 TRACE: [kryo] Read initial object reference 1: 
org.apache.spark.SerializableWritable
00:00 TRACE: [kryo] Object graph complete.
[info] - serialize various things using kryo closure serializer *** FAILED *** 
(175 milliseconds)
[info]   com.esotericsoftware.kryo.KryoException: Error during Java 
deserialization.
[info] Serialization trace:
[info] foobar (org.apache.spark.serializer.TestPartition)
[info]   at 
com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:42)
[info]   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
[info]   at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
[info]   at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
[info]   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
[info]   at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:193)
[info]   at 
org.apache.spark.serializer.KryoClosureSerializerSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(KryoClosureSerializerSuite.scala:62)
[info]   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
[info]   at 
org.apache.spark.serializer.KryoClosureSerializerSuite$$anonfun$1.apply$mcV$sp(KryoClosureSerializerSuite.scala:59)
[info]   at 
org.apache.spark.serializer.KryoClosureSerializerSuite$$anonfun$1.apply(KryoClosureSerializerSuite.scala:51)
[info]   at 
org.apache.spark.serializer.KryoClosureSerializerSuite$$anonfun$1.apply(KryoClosureSerializerSuite.scala:51)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at 

[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-22 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556634#comment-14556634
 ] 

Akshat Aranya commented on SPARK-7708:
--

[~joshrosen], here are some fixes for getting Kryo serialization working for 
closures. It's not fully working yet; I found problems while running the word 
count example that I'm trying to figure out, but I thought I'd post this to get 
some feedback.

 Incorrect task serialization with Kryo closure serializer
 -

 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

 I've been investigating the use of Kryo for closure serialization with Spark 
 1.2, and it seems like I've hit upon a bug:
 When a task is serialized before scheduling, the following log message is 
 generated:
 [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
 host, PROCESS_LOCAL, 302 bytes)
 This message comes from TaskSetManager which serializes the task using the 
 closure serializer.  Before the message is sent out, the TaskDescription 
 (which included the original task as a byte array), is serialized again into 
 a byte array with the closure serializer.  I added a log message for this in 
 CoarseGrainedSchedulerBackend, which produces the following output:
 [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
 than serialized task that it contains (302 bytes). This implies that 
 TaskDescription.buffer is not getting serialized correctly.
 On the executor side, the deserialization produces a null value for 
 TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-21 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554797#comment-14554797
 ] 

Akshat Aranya edited comment on SPARK-7708 at 5/21/15 6:22 PM:
---

I was able to get this working with a couple of fixes:

1. Implementing serialization methods for Kryo in SerializableBuffer.  An 
alternative is to register SerializableBuffer with JavaSerialization in Kryo, 
but that defeats the purpose.
2. The second part is a bit hokey because tasks within one executor process are 
deserialized from a shared broadcast variable.  Kryo deserialization modifies 
the input buffer, so it isn't thread-safe 
(https://code.google.com/p/kryo/issues/detail?id=128).  I worked around this by 
copying the broadcast buffer to a local buffer before deserializing.

This fixes are for 1.2, so I'll see if I can port them to master and write a 
test for them.


was (Author: aaranya):
I was able to get this working with a couple of fixes:

1. Implementing serialization methods for Kryo in SerializableBuffer.  An 
alternative is to register SerializableBuffer with JavaSerialization in Kryo, 
but that defeats the purpose.
2. The second part is a bit hokey because tasks within one executor process are 
deserialized from a shared broadcast variable.  Kryo deserialization modifies 
the input buffer, so it isn't thread-safe 
(https://code.google.com/p/kryo/issues/detail?id=128).  I worked around this by 
copying the broadcast buffer to a local buffer before deserializing.

 Incorrect task serialization with Kryo closure serializer
 -

 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

 I've been investigating the use of Kryo for closure serialization with Spark 
 1.2, and it seems like I've hit upon a bug:
 When a task is serialized before scheduling, the following log message is 
 generated:
 [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
 host, PROCESS_LOCAL, 302 bytes)
 This message comes from TaskSetManager which serializes the task using the 
 closure serializer.  Before the message is sent out, the TaskDescription 
 (which included the original task as a byte array), is serialized again into 
 a byte array with the closure serializer.  I added a log message for this in 
 CoarseGrainedSchedulerBackend, which produces the following output:
 [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
 than serialized task that it contains (302 bytes). This implies that 
 TaskDescription.buffer is not getting serialized correctly.
 On the executor side, the deserialization produces a null value for 
 TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-21 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554797#comment-14554797
 ] 

Akshat Aranya commented on SPARK-7708:
--

I was able to get this working with a couple of fixes:

1. Implementing serialization methods for Kryo in SerializableBuffer.  An 
alternative is to register SerializableBuffer with JavaSerialization in Kryo, 
but that defeats the purpose.
2. The second part is a bit hokey because tasks within one executor process are 
deserialized from a shared broadcast variable.  Kryo deserialization modifies 
the input buffer, so it isn't thread-safe 
(https://code.google.com/p/kryo/issues/detail?id=128).  I worked around this by 
copying the broadcast buffer to a local buffer before deserializing.

 Incorrect task serialization with Kryo closure serializer
 -

 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

 I've been investigating the use of Kryo for closure serialization with Spark 
 1.2, and it seems like I've hit upon a bug:
 When a task is serialized before scheduling, the following log message is 
 generated:
 [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
 host, PROCESS_LOCAL, 302 bytes)
 This message comes from TaskSetManager which serializes the task using the 
 closure serializer.  Before the message is sent out, the TaskDescription 
 (which included the original task as a byte array), is serialized again into 
 a byte array with the closure serializer.  I added a log message for this in 
 CoarseGrainedSchedulerBackend, which produces the following output:
 [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
 than serialized task that it contains (302 bytes). This implies that 
 TaskDescription.buffer is not getting serialized correctly.
 On the executor side, the deserialization produces a null value for 
 TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-21 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554857#comment-14554857
 ] 

Akshat Aranya commented on SPARK-7708:
--

I don't have the numbers with me right now, but I _did_ see that serialization 
of byte arrays was still faster with native Kryo serialization as compared to 
JavaSerialization.

I'll clean up my fixes and submit a PR.

 Incorrect task serialization with Kryo closure serializer
 -

 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

 I've been investigating the use of Kryo for closure serialization with Spark 
 1.2, and it seems like I've hit upon a bug:
 When a task is serialized before scheduling, the following log message is 
 generated:
 [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
 host, PROCESS_LOCAL, 302 bytes)
 This message comes from TaskSetManager which serializes the task using the 
 closure serializer.  Before the message is sent out, the TaskDescription 
 (which included the original task as a byte array), is serialized again into 
 a byte array with the closure serializer.  I added a log message for this in 
 CoarseGrainedSchedulerBackend, which produces the following output:
 [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
 than serialized task that it contains (302 bytes). This implies that 
 TaskDescription.buffer is not getting serialized correctly.
 On the executor side, the deserialization produces a null value for 
 TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7795) Speed up task serialization in standalone mode

2015-05-21 Thread Akshat Aranya (JIRA)
Akshat Aranya created SPARK-7795:


 Summary: Speed up task serialization in standalone mode
 Key: SPARK-7795
 URL: https://issues.apache.org/jira/browse/SPARK-7795
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Akshat Aranya


My experiments with scheduling very short tasks in standalone cluster mode 
indicated that a significant amount of time was being spent in scheduling the 
tasks (500ms for 256 tasks). I found that most of the time was being spent in 
creating a new instance of serializer for each task. Changing this to just one 
serializer brought down the scheduling time to 8ms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-18 Thread Akshat Aranya (JIRA)
Akshat Aranya created SPARK-7708:


 Summary: Incorrect task serialization with Kryo closure serializer
 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya


I've been investigating the use of Kryo for closure serialization with Spark 
1.2, and it seems like I've hit upon a bug:

When a task is serialized before scheduling, the following log message is 
generated:

[info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
host, PROCESS_LOCAL, 302 bytes)

This message comes from TaskSetManager which serializes the task using the 
closure serializer.  Before the message is sent out, the TaskDescription (which 
included the original task as a byte array), is serialized again into a byte 
array with the closure serializer.  I added a log message for this in 
CoarseGrainedSchedulerBackend, which produces the following output:

[info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132

The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
than serialized task that it contains (302 bytes). This implies that 
TaskDescription.buffer is not getting serialized correctly.

On the executor side, the deserialization produces a null value for 
TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-18 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548282#comment-14548282
 ] 

Akshat Aranya commented on SPARK-7708:
--

This happens because the TaskDescription contains a SerializableBuffer, which 
has no fields from the point of view of Kryo.  It only has a transient var and 
a def.  When this is serialized by Kryo, it finds no member fields to 
serialize, so it doesn't write out the underlying buffer.

 Incorrect task serialization with Kryo closure serializer
 -

 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

 I've been investigating the use of Kryo for closure serialization with Spark 
 1.2, and it seems like I've hit upon a bug:
 When a task is serialized before scheduling, the following log message is 
 generated:
 [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
 host, PROCESS_LOCAL, 302 bytes)
 This message comes from TaskSetManager which serializes the task using the 
 closure serializer.  Before the message is sent out, the TaskDescription 
 (which included the original task as a byte array), is serialized again into 
 a byte array with the closure serializer.  I added a log message for this in 
 CoarseGrainedSchedulerBackend, which produces the following output:
 [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
 than serialized task that it contains (302 bytes). This implies that 
 TaskDescription.buffer is not getting serialized correctly.
 On the executor side, the deserialization produces a null value for 
 TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4478) totalRegisteredExecutors not updated properly

2014-11-18 Thread Akshat Aranya (JIRA)
Akshat Aranya created SPARK-4478:


 Summary: totalRegisteredExecutors not updated properly
 Key: SPARK-4478
 URL: https://issues.apache.org/jira/browse/SPARK-4478
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Akshat Aranya


I'm trying to write a new scheduler backend that relies on 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.  
Specifically, I want to use the field totalRegisteredExecutors, but this field 
is only incremented; not reduced when an executor is lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2014-10-16 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173904#comment-14173904
 ] 

Akshat Aranya commented on SPARK-2365:
--

This looks great!  I have been using IndexedRDD for a while, to great effect.  
I have one suggestion: it would be nice to override setName() in IndexedRDDLike

{code}
override def setName(_name: String): this.type = {
  partitionsRDD.setName(_name)
  this
}
{code}

so that the IndexedRDD shows up with friendly names in the storage UI, just 
like regular, cached RDDs do.


 Add IndexedRDD, an efficient updatable key-value store
 --

 Key: SPARK-2365
 URL: https://issues.apache.org/jira/browse/SPARK-2365
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, Spark Core
Reporter: Ankur Dave
Assignee: Ankur Dave
 Attachments: 2014-07-07-IndexedRDD-design-review.pdf


 RDDs currently provide a bulk-updatable, iterator-based interface. This 
 imposes minimal requirements on the storage layer, which only needs to 
 support sequential access, enabling on-disk and serialized storage.
 However, many applications would benefit from a richer interface. Efficient 
 support for point lookups would enable serving data out of RDDs, but it 
 currently requires iterating over an entire partition to find the desired 
 element. Point updates similarly require copying an entire iterator. Joins 
 are also expensive, requiring a shuffle and local hash joins.
 To address these problems, we propose IndexedRDD, an efficient key-value 
 store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
 uniqueness and pre-indexing the entries for efficient joins and point 
 lookups, updates, and deletions.
 It would be implemented by (1) hash-partitioning the entries by key, (2) 
 maintaining a hash index within each partition, and (3) using purely 
 functional (immutable and efficiently updatable) data structures to enable 
 efficient modifications and deletions.
 GraphX would be the first user of IndexedRDD, since it currently implements a 
 limited form of this functionality in VertexRDD. We envision a variety of 
 other uses for IndexedRDD, including streaming updates to RDDs, direct 
 serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org