[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization

2017-03-02 Thread Giambattista (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892016#comment-15892016
 ] 

Giambattista commented on SPARK-17931:
--

Thanks, I just opened SPARK-19796 and added required details.

> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
>Assignee: Kay Ousterhout
> Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization

2017-03-01 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890749#comment-15890749
 ] 

Imran Rashid commented on SPARK-17931:
--

[~gbloisi] thanks for reporting the issue.  Can you go ahead and open another 
jira for this?  Please ping me on it.  Since you have a reproduction, it would 
also be helpful if you could tell us what the offending property is that is 
going over 64KB.  Eg., you could do something like this (untested):

{code}
if (value.size > 16*1024) {
  val f = File.createTempFile(s"long_property_$key","txt")
  logWarning(s"Value for $key has length ${value.size}, writing to $f")
  val out = new PrintWriter(f)
  out.println(value)
  out.close()  
}
{code}

and then attach the generated file.

your workaround looks pretty dangerous -- if that property were actually 
important, then just randomly truncating it would be a big problem.  This 
method should be safe to long strings, but we might also want to find the 
source of that long string and avoid it.

> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
>Assignee: Kay Ousterhout
> Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization

2017-03-01 Thread Giambattista (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890226#comment-15890226
 ] 

Giambattista commented on SPARK-17931:
--

I just wanted to report that after this change Spark is failing in executing 
long SQL statements (my case they were long insert into table statements).
The problem I was facing is very well described in this article 
https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/
Eventually, I was able to get them working again with the change below.

--- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
@@ -86,7 +86,7 @@ private[spark] object TaskDescription {
 dataOut.writeInt(taskDescription.properties.size())
 taskDescription.properties.asScala.foreach { case (key, value) =>
   dataOut.writeUTF(key)
-  dataOut.writeUTF(value)
+  dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4)))
 }



> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
>Assignee: Kay Ousterhout
> Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization

2016-11-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704407#comment-15704407
 ] 

Apache Spark commented on SPARK-17931:
--

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/16053

> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> When taskScheduler instantiates TaskDescription, it calls 
> `Task.serializeWithDependencies(task, sched.sc.addedFiles, 
> sched.sc.addedJars, ser)`.  It serializes task and its dependency. 
> But after SPARK-2521 has been merged into the master, the ResultTask class 
> and ShuffleMapTask  class no longer contain rdd and closure objects. 
> TaskDescription class can be changed as below:
> {noformat}
> class TaskDescription[T](
> val taskId: Long,
> val attemptNumber: Int,
> val executorId: String,
> val name: String,
> val index: Int, 
> val task: Task[T]) extends Serializable
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization

2016-10-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579590#comment-15579590
 ] 

Apache Spark commented on SPARK-17931:
--

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/15505

> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> When taskScheduler instantiates TaskDescription, it calls 
> `Task.serializeWithDependencies(task, sched.sc.addedFiles, 
> sched.sc.addedJars, ser)`.  It serializes task and its dependency. 
> But after SPARK-2521 has been merged into the master, the ResultTask class 
> and ShuffleMapTask  class no longer contain rdd and closure objects. 
> TaskDescription class can be changed as below:
> {noformat}
> class TaskDescription[T](
> val taskId: Long,
> val attemptNumber: Int,
> val executorId: String,
> val name: String,
> val index: Int, 
> val task: Task[T]) extends Serializable
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org