[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18343 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/18343#discussion_r122637320 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -141,8 +143,8 @@ private[spark] class HighlyCompressedMapStatus private ( private[this] var numNonEmptyBlocks: Int, private[this] var emptyBlocks: RoaringBitmap, private[this] var avgSize: Long, -@transient private var hugeBlockSizes: Map[Int, Byte]) - extends MapStatus with Externalizable { +@transient private[this] var hugeBlockSizes: Map[Int, Byte]) + extends MapStatus with Externalizable with KryoSerializable { --- End diff -- OK, I have manual tests `remove @transient` and `Extends KryoSerializable`, both worked fine, I'm doing UT. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18343#discussion_r122636117 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -141,8 +143,8 @@ private[spark] class HighlyCompressedMapStatus private ( private[this] var numNonEmptyBlocks: Int, private[this] var emptyBlocks: RoaringBitmap, private[this] var avgSize: Long, -@transient private var hugeBlockSizes: Map[Int, Byte]) - extends MapStatus with Externalizable { +@transient private[this] var hugeBlockSizes: Map[Int, Byte]) + extends MapStatus with Externalizable with KryoSerializable { --- End diff -- I think the previous version already worked... When we have `writeExternal` and `readExternal`, `@transient` doesn't matter for java serializer, so removing `@transient` to make it work with kryo is a valid fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/18343#discussion_r122633016 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -175,6 +175,7 @@ class KryoSerializer(conf: SparkConf) kryo.register(None.getClass) kryo.register(Nil.getClass) kryo.register(Utils.classForName("scala.collection.immutable.$colon$colon")) + kryo.register(Utils.classForName("scala.collection.immutable.Map$EmptyMap$")) --- End diff -- Because this test failed: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78224/testReport/org.apache.spark.serializer/KryoSerializerSuite/registration_of_HighlyCompressedMapStatus/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18343#discussion_r122625325 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -175,6 +175,7 @@ class KryoSerializer(conf: SparkConf) kryo.register(None.getClass) kryo.register(Nil.getClass) kryo.register(Utils.classForName("scala.collection.immutable.$colon$colon")) + kryo.register(Utils.classForName("scala.collection.immutable.Map$EmptyMap$")) --- End diff -- why `Map$EmptyMap$`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18343#discussion_r122625202 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -141,7 +141,7 @@ private[spark] class HighlyCompressedMapStatus private ( private[this] var numNonEmptyBlocks: Int, private[this] var emptyBlocks: RoaringBitmap, private[this] var avgSize: Long, -@transient private var hugeBlockSizes: Map[Int, Byte]) +private[this] var hugeBlockSizes: Map[Int, Byte]) --- End diff -- oh seems it is now, then LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18343#discussion_r122623575 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -141,7 +141,7 @@ private[spark] class HighlyCompressedMapStatus private ( private[this] var numNonEmptyBlocks: Int, private[this] var emptyBlocks: RoaringBitmap, private[this] var avgSize: Long, -@transient private var hugeBlockSizes: Map[Int, Byte]) +private[this] var hugeBlockSizes: Map[Int, Byte]) --- End diff -- if you can figure out a way to make it serializable with kryo and still keep the customized serialization logic for java serializer, I'm ok with it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18343#discussion_r122623298 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -141,7 +141,7 @@ private[spark] class HighlyCompressedMapStatus private ( private[this] var numNonEmptyBlocks: Int, private[this] var emptyBlocks: RoaringBitmap, private[this] var avgSize: Long, -@transient private var hugeBlockSizes: Map[Int, Byte]) +private[this] var hugeBlockSizes: Map[Int, Byte]) --- End diff -- actually we can do better: use a bitmap to track which block id has size info, and a byte array to store these size data. so the format can be: `[num blocks] [bit map], [size array]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18343#discussion_r122621682 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -141,7 +141,7 @@ private[spark] class HighlyCompressedMapStatus private ( private[this] var numNonEmptyBlocks: Int, private[this] var emptyBlocks: RoaringBitmap, private[this] var avgSize: Long, -@transient private var hugeBlockSizes: Map[Int, Byte]) +private[this] var hugeBlockSizes: Map[Int, Byte]) --- End diff -- Sounds good to me. However, the customized serialization logic looks similar to kyro's default way to serialize map. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18343#discussion_r122620862 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -141,7 +141,7 @@ private[spark] class HighlyCompressedMapStatus private ( private[this] var numNonEmptyBlocks: Int, private[this] var emptyBlocks: RoaringBitmap, private[this] var avgSize: Long, -@transient private var hugeBlockSizes: Map[Int, Byte]) +private[this] var hugeBlockSizes: Map[Int, Byte]) --- End diff -- we do want to serialize `hugeBlockSizes`, but with customized logic, that why we marked it `@transient`. I think the corrected fix is, make this class implements `KryoSerializable`, and copy the customized serialization logic for `hugeBlockSizes` to kryo serialization hooks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/18343 [SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeExternal throws NPE ## What changes were proposed in this pull request? Fix HighlyCompressedMapStatus#writeExternal NPE: ``` 17/06/18 15:00:27 ERROR Utils: Exception encountered java.lang.NullPointerException at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310) at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) ... 17 more 17/06/18 15:00:27 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 10.17.47.20:50188 17/06/18 15:00:27 ERROR Utils: Exception