[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs
[ https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293836#comment-16293836 ] Sergei Lebedev edited comment on SPARK-22805 at 12/16/17 11:17 PM: --- I've emulated the effect of SPARK-20923 by removing all {{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event log. The table below compares uncompressed/compressed sizes of this log with and without the patch proposed in this issue: ||Mode||Size|| |Decompressed|157M| |Decompressed with patch|155M| - *Update*: turns out {{SparkTaskEndEvent}} carries the list of updated blocks twice (!): as part of the {{"Accumulables"}} and in {{"Task Metrics"}}. [~andrewor14], [~srowen] do you know if there is a reason for that? It looks like a bug to me. *Update*: I'm recomputing the numbers for a fully updatedBlockStatuses-free log. *Update*: the effect of SPARK-20923 is much more noticeable than I thought initially. Removing {{"internal.metrics.updatedBlockStatuses"}} from {{"Accumulables"}} and {{"Updated Blocks"}} from {{"Task Metrics"}} reduced the log size to 160M. The storage level compression now just shaves of a few M (see updated table). was (Author: lebedev): I've emulated the effect of SPARK-20923 by removing all {{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event log. The table below compares uncompressed/compressed sizes of this log with and without the patch proposed in this issue: ||Mode||Size|| |LZ4-compressed|2.3G| |Decompressed|25G| |LZ4-compressed with patch|2.3G| |Decompressed with patch|16G| *Update*: turns out {{SparkTaskEndEvent}} carries the list of updated blocks twice (!): as part of the {{"Accumulables"}} and in {{"Task Metrics"}}. [~andrewor14], [~srowen] do you know if there is a reason for that? It looks like a bug to me. *Update*: I'm recomputing the numbers for a fully updatedBlockStatuses-free log. > Use aliases for StorageLevel in event logs > -- > > Key: SPARK-22805 > URL: https://issues.apache.org/jira/browse/SPARK-22805 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Sergei Lebedev >Priority: Minor > > Fact 1: {{StorageLevel}} has a private constructor, therefore a list of > predefined levels is not extendable (by the users). > Fact 2: The format of event logs uses redundant representation for storage > levels > {code} > >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, > >>> "Replication": 1}') > 79 > >>> len('DISK_ONLY') > 9 > {code} > Fact 3: This leads to excessive log sizes for workloads with lots of > partitions, because every partition would have the storage level field which > is 60-70 bytes more than it should be. > Suggested quick win: use the names of the predefined levels to identify them > in the event log. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs
[ https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293836#comment-16293836 ] Sergei Lebedev edited comment on SPARK-22805 at 12/16/17 10:54 PM: --- I've emulated the effect of SPARK-20923 by removing all {{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event log. The table below compares uncompressed/compressed sizes of this log with and without the patch proposed in this issue: ||Mode||Size|| |LZ4-compressed|2.3G| |Decompressed|25G| |LZ4-compressed with patch|2.3G| |Decompressed with patch|16G| *Update*: turns out {{SparkTaskEndEvent}} carries the list of updated blocks twice (!): as part of the {{"Accumulables"}} and in {{"Task Metrics"}}. [~andrewor14], [~srowen] do you know if there is a reason for that? It looks like a bug to me. *Update*: I'm recomputing the numbers for a fully updatedBlockStatuses-free log. was (Author: lebedev): I've emulated the effect of SPARK-20923 by removing all {{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event log. The table below compares uncompressed/compressed sizes of this log with and without the patch proposed in this issue: ||Mode||Size|| |LZ4-compressed|2.3G| |Decompressed|25G| |LZ4-compressed with patch|2.3G| |Decompressed with patch|16G| *Update*: turns out {{SparkTaskEndEvent}} carries the list of updated blocks twice (!): as part of the {{"Accumulables"}} and in {{"Task Metrics"}}. [~andrewor14], [~srowen] do you know if there is a reason for that? It looks like a bug to me. > Use aliases for StorageLevel in event logs > -- > > Key: SPARK-22805 > URL: https://issues.apache.org/jira/browse/SPARK-22805 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Sergei Lebedev >Priority: Minor > > Fact 1: {{StorageLevel}} has a private constructor, therefore a list of > predefined levels is not extendable (by the users). > Fact 2: The format of event logs uses redundant representation for storage > levels > {code} > >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, > >>> "Replication": 1}') > 79 > >>> len('DISK_ONLY') > 9 > {code} > Fact 3: This leads to excessive log sizes for workloads with lots of > partitions, because every partition would have the storage level field which > is 60-70 bytes more than it should be. > Suggested quick win: use the names of the predefined levels to identify them > in the event log. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs
[ https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293836#comment-16293836 ] Sergei Lebedev edited comment on SPARK-22805 at 12/16/17 8:37 PM: -- I've emulated the effect of SPARK-20923 by removing all {{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event log. The table below compares uncompressed/compressed sizes of this log with and without the patch proposed in this issue: ||Mode||Size|| |LZ4-compressed|2.3G| |Decompressed|25G| |LZ4-compressed with patch|2.3G| |Decompressed with patch|16G| *Update*: turns out {{SparkTaskEndEvent}} carries the list of updated blocks twice (!): as part of the {{"Accumulables"}} and in {{"Task Metrics"}}. [~andrewor14], [~srowen] do you know if there is a reason for that? It looks like a bug to me. was (Author: lebedev): I've emulated the effect of SPARK-20923 by removing all {{"internal.metrics.updatedBlockStatuses"}} entries from the original 79G event log. The table below compares uncompressed/compressed sizes of this log with and without the patch proposed in this issue: ||Mode||Size|| |LZ4-compressed|2.3G| |Decompressed|25G| |LZ4-compressed with patch|2.3G| |Decompressed with patch|16G| > Use aliases for StorageLevel in event logs > -- > > Key: SPARK-22805 > URL: https://issues.apache.org/jira/browse/SPARK-22805 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Sergei Lebedev >Priority: Minor > > Fact 1: {{StorageLevel}} has a private constructor, therefore a list of > predefined levels is not extendable (by the users). > Fact 2: The format of event logs uses redundant representation for storage > levels > {code} > >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, > >>> "Replication": 1}') > 79 > >>> len('DISK_ONLY') > 9 > {code} > Fact 3: This leads to excessive log sizes for workloads with lots of > partitions, because every partition would have the storage level field which > is 60-70 bytes more than it should be. > Suggested quick win: use the names of the predefined levels to identify them > in the event log. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs
[ https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292758#comment-16292758 ] Sergei Lebedev edited comment on SPARK-22805 at 12/15/17 9:44 PM: -- I have a patch which preserves backward compatibility. Will post some number a bit later. Also, note that the format is flexible "in theory", in practice, it always contains one of the predefined levels. *Update*: turns out there's {{StorageLevel.apply}}, so "always" above should be read as "almost always". was (Author: lebedev): I have a patch which preserves backward compatibility. Will post some number a bit later. Also, note that the format is flexible "in theory", in practice, it always contains one of the predefined levels. **Update**: turns out there's {{StorageLevel.apply}}, so "always" above should be read as "almost always". > Use aliases for StorageLevel in event logs > -- > > Key: SPARK-22805 > URL: https://issues.apache.org/jira/browse/SPARK-22805 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Sergei Lebedev >Priority: Minor > > Fact 1: {{StorageLevel}} has a private constructor, therefore a list of > predefined levels is not extendable (by the users). > Fact 2: The format of event logs uses redundant representation for storage > levels > {code} > >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, > >>> "Replication": 1}') > 79 > >>> len('DISK_ONLY') > 9 > {code} > Fact 3: This leads to excessive log sizes for workloads with lots of > partitions, because every partition would have the storage level field which > is 60-70 bytes more than it should be. > Suggested quick win: use the names of the predefined levels to identify them > in the event log. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs
[ https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292758#comment-16292758 ] Sergei Lebedev edited comment on SPARK-22805 at 12/15/17 9:43 PM: -- I have a patch which preserves backward compatibility. Will post some number a bit later. Also, note that the format is flexible "in theory", in practice, it always contains one of the predefined levels. **Update**: turns out there's {{StorageLevel.apply}}, so "always" above should be read as "almost always". was (Author: lebedev): I have a patch which preserves backward compatibility. Will post some number a bit later. Also, note that the format is flexible "in theory", in practice, it always contains one of the predefined levels. > Use aliases for StorageLevel in event logs > -- > > Key: SPARK-22805 > URL: https://issues.apache.org/jira/browse/SPARK-22805 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Sergei Lebedev >Priority: Minor > > Fact 1: {{StorageLevel}} has a private constructor, therefore a list of > predefined levels is not extendable (by the users). > Fact 2: The format of event logs uses redundant representation for storage > levels > {code} > >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, > >>> "Replication": 1}') > 79 > >>> len('DISK_ONLY') > 9 > {code} > Fact 3: This leads to excessive log sizes for workloads with lots of > partitions, because every partition would have the storage level field which > is 60-70 bytes more than it should be. > Suggested quick win: use the names of the predefined levels to identify them > in the event log. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs
[ https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292848#comment-16292848 ] Sergei Lebedev edited comment on SPARK-22805 at 12/15/17 5:25 PM: -- Here're results for a single application with 6K partitions. Admittedly, this is not generalizable to any application, but it gives an idea of the redundancy due to {{StorageLevel}}: || Mode || Size|| |LZ4-compressed|8.1G| |Decompressed|79G| |LZ4-compressed with patch|7.2G| |Decompressed with patch|49G| was (Author: lebedev): Here're results for a single application with 6K partitions: || Mode || Size|| |LZ4-compressed|8.1G| |Decompressed|79G| |LZ4-compressed with patch|7.2G| |Decompressed with patch|49G| > Use aliases for StorageLevel in event logs > -- > > Key: SPARK-22805 > URL: https://issues.apache.org/jira/browse/SPARK-22805 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Sergei Lebedev >Priority: Minor > > Fact 1: {{StorageLevel}} has a private constructor, therefore a list of > predefined levels is not extendable (by the users). > Fact 2: The format of event logs uses redundant representation for storage > levels > {code} > >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, > >>> "Replication": 1}') > 79 > >>> len('DISK_ONLY') > 9 > {code} > Fact 3: This leads to excessive log sizes for workloads with lots of > partitions, because every partition would have the storage level field which > is 60-70 bytes more than it should be. > Suggested quick win: use the names of the predefined levels to identify them > in the event log. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22805) Use aliases for StorageLevel in event logs
[ https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292758#comment-16292758 ] Sergei Lebedev edited comment on SPARK-22805 at 12/15/17 4:28 PM: -- I have a patch which preserves backward compatibility. Will post some number a bit later. Also, note that the format is flexible "in theory", in practice, it always contains one of the predefined levels. was (Author: lebedev): I have a patch which preserves backward compatibility. Will post some number a bit later. > Use aliases for StorageLevel in event logs > -- > > Key: SPARK-22805 > URL: https://issues.apache.org/jira/browse/SPARK-22805 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Sergei Lebedev >Priority: Minor > > Fact 1: {{StorageLevel}} has a private constructor, therefore a list of > predefined levels is not extendable (by the users). > Fact 2: The format of event logs uses redundant representation for storage > levels > {code} > >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, > >>> "Replication": 1}') > 79 > >>> len('DISK_ONLY') > 9 > {code} > Fact 3: This leads to excessive log sizes for workloads with lots of > partitions, because every partition would have the storage level field which > is 60-70 bytes more than it should be. > Suggested quick win: use the names of the predefined levels to identify them > in the event log. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org