[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364712#comment-14364712
 ] 

Apache Spark commented on SPARK-5523:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/5064

 TaskMetrics and TaskInfo have innumerable copies of the hostname string
 ---

 Key: SPARK-5523
 URL: https://issues.apache.org/jira/browse/SPARK-5523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das

  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
 As these are created (directly or through deserialization of RPC messages), 
 each of them have a separate String object for the hostname even though most 
 of them have the same string data in them. This results in thousands of 
 string objects, increasing memory requirement of the driver. 
 This can be easily deduped when deserializing a TaskMetrics object, or when 
 creating a TaskInfo object.
 This affects streaming particularly bad due to the rate of job/stage/task 
 generation. 
 For solution, see how this dedup is done for StorageLevel. 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string

2015-03-16 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364279#comment-14364279
 ] 

Tathagata Das commented on SPARK-5523:
--

As long as the hostname object is short-lived its cool. That's the same
strategy used for StorageLevel. So it is fine.

On Mon, Mar 16, 2015 at 12:36 AM, Saisai Shao (JIRA) j...@apache.org



 TaskMetrics and TaskInfo have innumerable copies of the hostname string
 ---

 Key: SPARK-5523
 URL: https://issues.apache.org/jira/browse/SPARK-5523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das

  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
 As these are created (directly or through deserialization of RPC messages), 
 each of them have a separate String object for the hostname even though most 
 of them have the same string data in them. This results in thousands of 
 string objects, increasing memory requirement of the driver. 
 This can be easily deduped when deserializing a TaskMetrics object, or when 
 creating a TaskInfo object.
 This affects streaming particularly bad due to the rate of job/stage/task 
 generation. 
 For solution, see how this dedup is done for StorageLevel. 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string

2015-03-16 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362867#comment-14362867
 ] 

Saisai Shao commented on SPARK-5523:


After investigating a little on the implementation details of host in 
{{TaskMetrics}} and {{TaskInfo}}. 

I think for {{TaskInfo}}, Spark already uses {{hostPortParseResults}} in 
Utils.scala to cache the host name, so we don't need to do this again. Also 
{{TaskInfo}} will only reside in driver, so RPC related object recreation is 
not existed.

For {{TaskMetrics}}, the problem lies in deserialization, mainly we could 
rewrite {{readObject()}} function to use a cached host name like this way:

{code}
  @throws(classOf[IOException])
  private def readObject(in: ObjectInputStream): Unit = Utils.tryOrIOException {
in.defaultReadObject()
_hostname = getHostFromPool(_hostname)
  }
{code}

The question is that we still create a {{_hostname}} String object, though very 
shot life. I'm not sure is there a way we could reuse the cached hostname 
without even creating a very short-life one. Any suggestions? Thanks a lot.

 TaskMetrics and TaskInfo have innumerable copies of the hostname string
 ---

 Key: SPARK-5523
 URL: https://issues.apache.org/jira/browse/SPARK-5523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das

  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
 As these are created (directly or through deserialization of RPC messages), 
 each of them have a separate String object for the hostname even though most 
 of them have the same string data in them. This results in thousands of 
 string objects, increasing memory requirement of the driver. 
 This can be easily deduped when deserializing a TaskMetrics object, or when 
 creating a TaskInfo object.
 This affects streaming particularly bad due to the rate of job/stage/task 
 generation. 
 For solution, see how this dedup is done for StorageLevel. 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string

2015-03-15 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362681#comment-14362681
 ] 

Saisai Shao commented on SPARK-5523:


Hi [~tdas], will this large number of string brings heavy overhead to GC or 
something others? Assuming these objects are out-of-date very soon, they will 
be in the young gen and GCed fastly, I'm not sure how large overhead will this 
bring in?

 TaskMetrics and TaskInfo have innumerable copies of the hostname string
 ---

 Key: SPARK-5523
 URL: https://issues.apache.org/jira/browse/SPARK-5523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das

  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
 As these are created (directly or through deserialization of RPC messages), 
 each of them have a separate String object for the hostname even though most 
 of them have the same string data in them. This results in thousands of 
 string objects, increasing memory requirement of the driver. 
 This can be easily deduped when deserializing a TaskMetrics object, or when 
 creating a TaskInfo object.
 This affects streaming particularly bad due to the rate of job/stage/task 
 generation. 
 For solution, see how this dedup is done for StorageLevel. 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string

2015-03-09 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353847#comment-14353847
 ] 

Tathagata Das commented on SPARK-5523:
--

[~jerryshao] You can take a crack at this if you want. 



 TaskMetrics and TaskInfo have innumerable copies of the hostname string
 ---

 Key: SPARK-5523
 URL: https://issues.apache.org/jira/browse/SPARK-5523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das

  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
 As these are created (directly or through deserialization of RPC messages), 
 each of them have a separate String object for the hostname even though most 
 of them have the same string data in them. This results in thousands of 
 string objects, increasing memory requirement of the driver. 
 This can be easily deduped when deserializing a TaskMetrics object, or when 
 creating a TaskInfo object.
 This affects streaming particularly bad due to the rate of job/stage/task 
 generation. 
 For solution, see how this dedup is done for StorageLevel. 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string

2015-03-09 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354078#comment-14354078
 ] 

Saisai Shao commented on SPARK-5523:


Hi [~tdas], I will take a look at this issue and try to find a solution, looks 
like a basic solution is to use HashMap to cache the data like {{StorageLevel}}.

 TaskMetrics and TaskInfo have innumerable copies of the hostname string
 ---

 Key: SPARK-5523
 URL: https://issues.apache.org/jira/browse/SPARK-5523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das

  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
 As these are created (directly or through deserialization of RPC messages), 
 each of them have a separate String object for the hostname even though most 
 of them have the same string data in them. This results in thousands of 
 string objects, increasing memory requirement of the driver. 
 This can be easily deduped when deserializing a TaskMetrics object, or when 
 creating a TaskInfo object.
 This affects streaming particularly bad due to the rate of job/stage/task 
 generation. 
 For solution, see how this dedup is done for StorageLevel. 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string

2015-02-02 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301089#comment-14301089
 ] 

Tathagata Das commented on SPARK-5523:
--

That is a little risky. In Java 6 internalized strings go to Perm gen
space, which is already oversubscribed thanks to he huge set of
dependencies of Spark - Hadoop, Hive, etc etc etc. It might be better to do
a very specific solution that we can bound the performance penalty of
(bounded HashMap for hostname cache).



 TaskMetrics and TaskInfo have innumerable copies of the hostname string
 ---

 Key: SPARK-5523
 URL: https://issues.apache.org/jira/browse/SPARK-5523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das

  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
 As these are created (directly or through deserialization of RPC messages), 
 each of them have a separate String object for the hostname even though most 
 of them have the same string data in them. This results in thousands of 
 string objects, increasing memory requirement of the driver. 
 This can be easily deduped when deserializing a TaskMetrics object, or when 
 creating a TaskInfo object.
 This affects streaming particularly bad due to the rate of job/stage/task 
 generation. 
 For solution, see how this dedup is done for StorageLevel. 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string

2015-02-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301073#comment-14301073
 ] 

Sean Owen commented on SPARK-5523:
--

How about `String.intern()` here, as a native implementation the flyweight 
pattern? you pay some extra overhead of consulting the interned string pool 
every time, but potentially save memory.

 TaskMetrics and TaskInfo have innumerable copies of the hostname string
 ---

 Key: SPARK-5523
 URL: https://issues.apache.org/jira/browse/SPARK-5523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das

  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
 As these are created (directly or through deserialization of RPC messages), 
 each of them have a separate String object for the hostname even though most 
 of them have the same string data in them. This results in thousands of 
 string objects, increasing memory requirement of the driver. 
 This can be easily deduped when deserializing a TaskMetrics object, or when 
 creating a TaskInfo object.
 This affects streaming particularly bad due to the rate of job/stage/task 
 generation. 
 For solution, see how this dedup is done for StorageLevel. 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org