[GitHub] spark issue #21811: [SPARK-24801][CORE] Avoid memory waste by empty byte[] a...

2018-07-23 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21811 @kiszk the situation "before" is well understood. In the respective SPARK-24801 ticket I present a fragment from the analysis of this heap dump by jxray (www.jxray.com). It shows t

[GitHub] spark issue #21811: [SPARK-24801][CORE] Avoid memory waste by empty byte[] a...

2018-07-23 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21811 Thank you very much for your responses, @squito. I agree with all you said. @kiszk the heap dump that prompted me to make this change was obtained from a customer, who probably ran

[GitHub] spark issue #21811: [SPARK-24801][CORE] Avoid memory waste by empty byte[] a...

2018-07-18 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21811 Yes. On Wed, Jul 18, 2018 at 4:43 PM, UCB AMPLab wrote: > Can one of the admins verify this patch? > > — > You are receiving this because you authore

[GitHub] spark pull request #21811: [SPARK-24801][CORE] Avoid memory waste by empty b...

2018-07-18 Thread countmdm
GitHub user countmdm opened a pull request: https://github.com/apache/spark/pull/21811 [SPARK-24801][CORE] Avoid memory waste by empty byte[] arrays in SaslEncryption$EncryptedMessage ## What changes were proposed in this pull request? Initialize SaslEncryption

[GitHub] spark pull request #21456: [SPARK-24356] [CORE] Duplicate strings in File.pa...

2018-06-02 Thread countmdm
Github user countmdm commented on a diff in the pull request: https://github.com/apache/spark/pull/21456#discussion_r192576365 --- Diff: common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolverSuite.java --- @@ -135,4 +136,23 @@ public

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 Just modified the code to use regexp and pushed the updates. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 Ok, if you believe this is not a performance problem here, then it's fine with me. To save us some possible further bouncing of this review, can you please share here your pattern/regex code? I

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 @squito wrt. the added code for path normalization: exactly as you say. This is just a precaution in case spark (or even some code that above spark) ends up generating pathnames that contain

[GitHub] spark pull request #21456: [SPARK-24356] [CORE] Duplicate strings in File.pa...

2018-05-31 Thread countmdm
Github user countmdm commented on a diff in the pull request: https://github.com/apache/spark/pull/21456#discussion_r192217237 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java --- @@ -272,6 +273,57 @@ void close

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 If we don't do normalization ourselves, we may potentially run into the following: path = ... // Produces "foo//bar" path = path.intern(); // Ok, no separate copies of

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 @srowen yes, I am pretty sure that this code generates all these duplicate objects. I've analyzed a heap dump from a real customer, so I cannot publish the entire jxray report, since it may

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-31 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 I confirm that the -XX:+UseStringDeduplication option is available only with G1 GC, and it is off by default. So if we decide to use it, I guess we won't be able to enforce it reliably, especially

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread countmdm
Github user countmdm commented on the issue: https://github.com/apache/spark/pull/21456 Yes. On Tue, May 29, 2018 at 1:18 PM, UCB AMPLab wrote: > Can one of the admins verify this patch? > > — > You are receiving this because you authore

[GitHub] spark pull request #21456: [SPARK-24356] [CORE] Duplicate strings in File.pa...

2018-05-29 Thread countmdm
GitHub user countmdm opened a pull request: https://github.com/apache/spark/pull/21456 [SPARK-24356] [CORE] Duplicate strings in File.path managed by FileSegmentManagedBuffer This patch eliminates duplicate strings that come from the 'path' field of java.io.File objects created