[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string
[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364712#comment-14364712 ] Apache Spark commented on SPARK-5523: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/5064 TaskMetrics and TaskInfo have innumerable copies of the hostname string --- Key: SPARK-5523 URL: https://issues.apache.org/jira/browse/SPARK-5523 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Tathagata Das TaskMetrics and TaskInfo objects have the hostname associated with the task. As these are created (directly or through deserialization of RPC messages), each of them have a separate String object for the hostname even though most of them have the same string data in them. This results in thousands of string objects, increasing memory requirement of the driver. This can be easily deduped when deserializing a TaskMetrics object, or when creating a TaskInfo object. This affects streaming particularly bad due to the rate of job/stage/task generation. For solution, see how this dedup is done for StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string
[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364279#comment-14364279 ] Tathagata Das commented on SPARK-5523: -- As long as the hostname object is short-lived its cool. That's the same strategy used for StorageLevel. So it is fine. On Mon, Mar 16, 2015 at 12:36 AM, Saisai Shao (JIRA) j...@apache.org TaskMetrics and TaskInfo have innumerable copies of the hostname string --- Key: SPARK-5523 URL: https://issues.apache.org/jira/browse/SPARK-5523 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Tathagata Das TaskMetrics and TaskInfo objects have the hostname associated with the task. As these are created (directly or through deserialization of RPC messages), each of them have a separate String object for the hostname even though most of them have the same string data in them. This results in thousands of string objects, increasing memory requirement of the driver. This can be easily deduped when deserializing a TaskMetrics object, or when creating a TaskInfo object. This affects streaming particularly bad due to the rate of job/stage/task generation. For solution, see how this dedup is done for StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string
[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362867#comment-14362867 ] Saisai Shao commented on SPARK-5523: After investigating a little on the implementation details of host in {{TaskMetrics}} and {{TaskInfo}}. I think for {{TaskInfo}}, Spark already uses {{hostPortParseResults}} in Utils.scala to cache the host name, so we don't need to do this again. Also {{TaskInfo}} will only reside in driver, so RPC related object recreation is not existed. For {{TaskMetrics}}, the problem lies in deserialization, mainly we could rewrite {{readObject()}} function to use a cached host name like this way: {code} @throws(classOf[IOException]) private def readObject(in: ObjectInputStream): Unit = Utils.tryOrIOException { in.defaultReadObject() _hostname = getHostFromPool(_hostname) } {code} The question is that we still create a {{_hostname}} String object, though very shot life. I'm not sure is there a way we could reuse the cached hostname without even creating a very short-life one. Any suggestions? Thanks a lot. TaskMetrics and TaskInfo have innumerable copies of the hostname string --- Key: SPARK-5523 URL: https://issues.apache.org/jira/browse/SPARK-5523 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Tathagata Das TaskMetrics and TaskInfo objects have the hostname associated with the task. As these are created (directly or through deserialization of RPC messages), each of them have a separate String object for the hostname even though most of them have the same string data in them. This results in thousands of string objects, increasing memory requirement of the driver. This can be easily deduped when deserializing a TaskMetrics object, or when creating a TaskInfo object. This affects streaming particularly bad due to the rate of job/stage/task generation. For solution, see how this dedup is done for StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string
[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362681#comment-14362681 ] Saisai Shao commented on SPARK-5523: Hi [~tdas], will this large number of string brings heavy overhead to GC or something others? Assuming these objects are out-of-date very soon, they will be in the young gen and GCed fastly, I'm not sure how large overhead will this bring in? TaskMetrics and TaskInfo have innumerable copies of the hostname string --- Key: SPARK-5523 URL: https://issues.apache.org/jira/browse/SPARK-5523 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Tathagata Das TaskMetrics and TaskInfo objects have the hostname associated with the task. As these are created (directly or through deserialization of RPC messages), each of them have a separate String object for the hostname even though most of them have the same string data in them. This results in thousands of string objects, increasing memory requirement of the driver. This can be easily deduped when deserializing a TaskMetrics object, or when creating a TaskInfo object. This affects streaming particularly bad due to the rate of job/stage/task generation. For solution, see how this dedup is done for StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string
[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353847#comment-14353847 ] Tathagata Das commented on SPARK-5523: -- [~jerryshao] You can take a crack at this if you want. TaskMetrics and TaskInfo have innumerable copies of the hostname string --- Key: SPARK-5523 URL: https://issues.apache.org/jira/browse/SPARK-5523 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Tathagata Das TaskMetrics and TaskInfo objects have the hostname associated with the task. As these are created (directly or through deserialization of RPC messages), each of them have a separate String object for the hostname even though most of them have the same string data in them. This results in thousands of string objects, increasing memory requirement of the driver. This can be easily deduped when deserializing a TaskMetrics object, or when creating a TaskInfo object. This affects streaming particularly bad due to the rate of job/stage/task generation. For solution, see how this dedup is done for StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string
[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354078#comment-14354078 ] Saisai Shao commented on SPARK-5523: Hi [~tdas], I will take a look at this issue and try to find a solution, looks like a basic solution is to use HashMap to cache the data like {{StorageLevel}}. TaskMetrics and TaskInfo have innumerable copies of the hostname string --- Key: SPARK-5523 URL: https://issues.apache.org/jira/browse/SPARK-5523 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Tathagata Das TaskMetrics and TaskInfo objects have the hostname associated with the task. As these are created (directly or through deserialization of RPC messages), each of them have a separate String object for the hostname even though most of them have the same string data in them. This results in thousands of string objects, increasing memory requirement of the driver. This can be easily deduped when deserializing a TaskMetrics object, or when creating a TaskInfo object. This affects streaming particularly bad due to the rate of job/stage/task generation. For solution, see how this dedup is done for StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string
[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301089#comment-14301089 ] Tathagata Das commented on SPARK-5523: -- That is a little risky. In Java 6 internalized strings go to Perm gen space, which is already oversubscribed thanks to he huge set of dependencies of Spark - Hadoop, Hive, etc etc etc. It might be better to do a very specific solution that we can bound the performance penalty of (bounded HashMap for hostname cache). TaskMetrics and TaskInfo have innumerable copies of the hostname string --- Key: SPARK-5523 URL: https://issues.apache.org/jira/browse/SPARK-5523 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Tathagata Das TaskMetrics and TaskInfo objects have the hostname associated with the task. As these are created (directly or through deserialization of RPC messages), each of them have a separate String object for the hostname even though most of them have the same string data in them. This results in thousands of string objects, increasing memory requirement of the driver. This can be easily deduped when deserializing a TaskMetrics object, or when creating a TaskInfo object. This affects streaming particularly bad due to the rate of job/stage/task generation. For solution, see how this dedup is done for StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string
[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301073#comment-14301073 ] Sean Owen commented on SPARK-5523: -- How about `String.intern()` here, as a native implementation the flyweight pattern? you pay some extra overhead of consulting the interned string pool every time, but potentially save memory. TaskMetrics and TaskInfo have innumerable copies of the hostname string --- Key: SPARK-5523 URL: https://issues.apache.org/jira/browse/SPARK-5523 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Tathagata Das TaskMetrics and TaskInfo objects have the hostname associated with the task. As these are created (directly or through deserialization of RPC messages), each of them have a separate String object for the hostname even though most of them have the same string data in them. This results in thousands of string objects, increasing memory requirement of the driver. This can be easily deduped when deserializing a TaskMetrics object, or when creating a TaskInfo object. This affects streaming particularly bad due to the rate of job/stage/task generation. For solution, see how this dedup is done for StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org