[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18343 thanks, merging to master/2.2! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled an

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18343 Agreed. The `hugeBlockSizes` map is not supposed to have too many records but only few huge blocks. LGTM --- If your project is set up for it, you can reply to this email and have your repl

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread vanzin
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/18343 I ran a local test and a `Map[Int, Byte]` with 2000 elements serialized with Kryo ends up at a little less than 14kB. That would be `4 + 4 * 2000 + 1 * 2000 = 9004B` using the custom serialization. T

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread vanzin
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/18343 True. I guess since the same serializer instance is not reused, you don't get the benefits of the optimizations that don't require sending the class name after it first shows up.. But back t

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18343 I was talking about the classname for the internal members. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not hav

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread vanzin
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/18343 > since the custom logic doesn't need to write the full classname out Hmmm... from http://docs.oracle.com/javase/8/docs/api/java/io/Externalizable.html: "Only the identity of the class o

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18343 It's obvious it will reduce data size with custom serialization, since the custom logic doesn't need to write the full classname out which the java default one does. I don't think Kryo knows w

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18343 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18343 **[Test build #78254 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78254/testReport)** for PR 18343 at commit [`7a4e6ec`](https://github.com/apache/spark/commit/7

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18343 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78254/ Test PASSed. ---

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18343 **[Test build #78254 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78254/testReport)** for PR 18343 at commit [`7a4e6ec`](https://github.com/apache/spark/commit/7a

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18343 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18343 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78251/ Test FAILed. ---

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18343 **[Test build #78251 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78251/testReport)** for PR 18343 at commit [`e045bef`](https://github.com/apache/spark/commit/e

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/18343 I don't quibble with custom serialization logic, but you can do that with `Serializable` too. And Kryo has its own marker interface too. I wonder what the purpose of `Externalizable` is then. Actuall

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/18343 It seems `Externalizable` is kind of abused in Spark, we should benchmark and make sure that these "customized serialization logic" is faster than the default one of java serializer. For

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/18343 OK I get it. Hm, I wonder why some classes in the code extend `Externalizable` instead of `Serializable`? I see a comment about controlling serialization, but `Serializable` also lets you do that.

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18343 **[Test build #78251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78251/testReport)** for PR 18343 at commit [`e045bef`](https://github.com/apache/spark/commit/e0

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18343 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78248/ Test FAILed. ---

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18343 **[Test build #78248 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78248/testReport)** for PR 18343 at commit [`facca95`](https://github.com/apache/spark/commit/f

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18343 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18343 **[Test build #78248 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78248/testReport)** for PR 18343 at commit [`facca95`](https://github.com/apache/spark/commit/fa

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18343 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18343 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78239/ Test PASSed. ---

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18343 **[Test build #78239 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78239/testReport)** for PR 18343 at commit [`e2816ec`](https://github.com/apache/spark/commit/e

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18343 @wangyum Can you also add a test for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18343 @wangyum Thanks for updating. Can you try to disable kyro and try it again? So we can verify it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHu

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18343 Because we write/read `hugeBlockSizes` in `writeExternal`/`readExternal`, it seems to me that it is intended to be serialized. So I think removing `transient` should be ok. LGTM cc @cloud-fa

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread wangyum
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/18343 @viirya Yes, I' using `org.apache.spark.serializer.KryoSerializer`, [master branch](https://github.com/apache/spark/tree/ce49428ef7d640c1734e91ffcddc49dbc8547ba7) still has this issue, error logs:

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18343 I think this should be addressed before 2.2. I already asked notice of other committers on dev mailing list. --- If your project is set up for it, you can reply to this email and have your reply app

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18343 @wangyum Are you using kyro serializer? I think it is why you hit this issue. Once you use kyro, I think the `readExternal` in `HighlyCompressedMapStatus` won't be used to deserialize the ob

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18343 Is this still re-producable in current codebase? In the error message above, there is the call to `MapOutputTrackerMaster.getSerializedMapOutputStatuses`, however, this method is removed in recent c

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18343 **[Test build #78239 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78239/testReport)** for PR 18343 at commit [`e2816ec`](https://github.com/apache/spark/commit/e2

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18343 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18343 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78230/ Test PASSed. ---

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18343 **[Test build #78230 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78230/testReport)** for PR 18343 at commit [`75a9bf1`](https://github.com/apache/spark/commit/7

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread wangyum
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/18343 @jinxing64 `big_table` may be need big enough, my `big_table` is 270.7 G: ```sql spark-sql -e " set spark.sql.shuffle.partitions=2001; drop table if exists spark_hcms_npe; cr

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18343 **[Test build #78230 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78230/testReport)** for PR 18343 at commit [`75a9bf1`](https://github.com/apache/spark/commit/75

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread jinxing64
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/18343 Thanks for ping. If I understand correctly, `HighlyCompressedStatus` is initialized when 2 situations: 1. Creating `MapStatus` when shuffle-write and the reduce partitions is over 2000;

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18343 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78224/ Test FAILed. ---

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18343 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18343 **[Test build #78224 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78224/testReport)** for PR 18343 at commit [`4cf3532`](https://github.com/apache/spark/commit/4

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread wangyum
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/18343 cc @jinxing64 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18343 **[Test build #78224 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78224/testReport)** for PR 18343 at commit [`4cf3532`](https://github.com/apache/spark/commit/4c

[GitHub] spark issue #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeE...

2017-06-18 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/18343 I'm not sure that's a valid fix. This makes this field serialize, when it wasn't intended to. It's either supposed to be recreated on demand, or else, the code needs to deal with it not existing. -