[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892016#comment-15892016 ] Giambattista commented on SPARK-17931: -- Thanks, I just opened SPARK-19796 and added required details. > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Guoqiang Li >Assignee: Kay Ousterhout > Fix For: 2.2.0 > > > In the existing code, there are three layers of serialization > involved in sending a task from the scheduler to an executor: > - A Task object is serialized > - The Task object is copied to a byte buffer that also > contains serialized information about any additional JARs, > files, and Properties needed for the task to execute. This > byte buffer is stored as the member variable serializedTask > in the TaskDescription class. > - The TaskDescription is serialized (in addition to the serialized > task + JARs, the TaskDescription class contains the task ID and > other metadata) and sent in a LaunchTask message. > While it is necessary to have two layers of serialization, so that > the JAR, file, and Property info can be deserialized prior to > deserializing the Task object, the third layer of deserialization is > unnecessary (this is as a result of SPARK-2521). We should > eliminate a layer of serialization by moving the JARs, files, and Properties > into the TaskDescription class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890749#comment-15890749 ] Imran Rashid commented on SPARK-17931: -- [~gbloisi] thanks for reporting the issue. Can you go ahead and open another jira for this? Please ping me on it. Since you have a reproduction, it would also be helpful if you could tell us what the offending property is that is going over 64KB. Eg., you could do something like this (untested): {code} if (value.size > 16*1024) { val f = File.createTempFile(s"long_property_$key","txt") logWarning(s"Value for $key has length ${value.size}, writing to $f") val out = new PrintWriter(f) out.println(value) out.close() } {code} and then attach the generated file. your workaround looks pretty dangerous -- if that property were actually important, then just randomly truncating it would be a big problem. This method should be safe to long strings, but we might also want to find the source of that long string and avoid it. > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Guoqiang Li >Assignee: Kay Ousterhout > Fix For: 2.2.0 > > > In the existing code, there are three layers of serialization > involved in sending a task from the scheduler to an executor: > - A Task object is serialized > - The Task object is copied to a byte buffer that also > contains serialized information about any additional JARs, > files, and Properties needed for the task to execute. This > byte buffer is stored as the member variable serializedTask > in the TaskDescription class. > - The TaskDescription is serialized (in addition to the serialized > task + JARs, the TaskDescription class contains the task ID and > other metadata) and sent in a LaunchTask message. > While it is necessary to have two layers of serialization, so that > the JAR, file, and Property info can be deserialized prior to > deserializing the Task object, the third layer of deserialization is > unnecessary (this is as a result of SPARK-2521). We should > eliminate a layer of serialization by moving the JARs, files, and Properties > into the TaskDescription class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890226#comment-15890226 ] Giambattista commented on SPARK-17931: -- I just wanted to report that after this change Spark is failing in executing long SQL statements (my case they were long insert into table statements). The problem I was facing is very well described in this article https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ Eventually, I was able to get them working again with the change below. --- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala @@ -86,7 +86,7 @@ private[spark] object TaskDescription { dataOut.writeInt(taskDescription.properties.size()) taskDescription.properties.asScala.foreach { case (key, value) => dataOut.writeUTF(key) - dataOut.writeUTF(value) + dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4))) } > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Guoqiang Li >Assignee: Kay Ousterhout > Fix For: 2.2.0 > > > In the existing code, there are three layers of serialization > involved in sending a task from the scheduler to an executor: > - A Task object is serialized > - The Task object is copied to a byte buffer that also > contains serialized information about any additional JARs, > files, and Properties needed for the task to execute. This > byte buffer is stored as the member variable serializedTask > in the TaskDescription class. > - The TaskDescription is serialized (in addition to the serialized > task + JARs, the TaskDescription class contains the task ID and > other metadata) and sent in a LaunchTask message. > While it is necessary to have two layers of serialization, so that > the JAR, file, and Property info can be deserialized prior to > deserializing the Task object, the third layer of deserialization is > unnecessary (this is as a result of SPARK-2521). We should > eliminate a layer of serialization by moving the JARs, files, and Properties > into the TaskDescription class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704407#comment-15704407 ] Apache Spark commented on SPARK-17931: -- User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/16053 > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Guoqiang Li > > When taskScheduler instantiates TaskDescription, it calls > `Task.serializeWithDependencies(task, sched.sc.addedFiles, > sched.sc.addedJars, ser)`. It serializes task and its dependency. > But after SPARK-2521 has been merged into the master, the ResultTask class > and ShuffleMapTask class no longer contain rdd and closure objects. > TaskDescription class can be changed as below: > {noformat} > class TaskDescription[T]( > val taskId: Long, > val attemptNumber: Int, > val executorId: String, > val name: String, > val index: Int, > val task: Task[T]) extends Serializable > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579590#comment-15579590 ] Apache Spark commented on SPARK-17931: -- User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/15505 > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Guoqiang Li > > When taskScheduler instantiates TaskDescription, it calls > `Task.serializeWithDependencies(task, sched.sc.addedFiles, > sched.sc.addedJars, ser)`. It serializes task and its dependency. > But after SPARK-2521 has been merged into the master, the ResultTask class > and ShuffleMapTask class no longer contain rdd and closure objects. > TaskDescription class can be changed as below: > {noformat} > class TaskDescription[T]( > val taskId: Long, > val attemptNumber: Int, > val executorId: String, > val name: String, > val index: Int, > val task: Task[T]) extends Serializable > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org