[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...

2017-06-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18343


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...

2017-06-19 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/18343#discussion_r122637320
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -141,8 +143,8 @@ private[spark] class HighlyCompressedMapStatus private (
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
 private[this] var avgSize: Long,
-@transient private var hugeBlockSizes: Map[Int, Byte])
-  extends MapStatus with Externalizable {
+@transient private[this] var hugeBlockSizes: Map[Int, Byte])
+  extends MapStatus with Externalizable with KryoSerializable {
--- End diff --

OK, I have manual tests `remove @transient` and `Extends KryoSerializable`, 
both worked fine,  I'm doing UT.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...

2017-06-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18343#discussion_r122636117
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -141,8 +143,8 @@ private[spark] class HighlyCompressedMapStatus private (
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
 private[this] var avgSize: Long,
-@transient private var hugeBlockSizes: Map[Int, Byte])
-  extends MapStatus with Externalizable {
+@transient private[this] var hugeBlockSizes: Map[Int, Byte])
+  extends MapStatus with Externalizable with KryoSerializable {
--- End diff --

I think the previous version already worked...  When we have 
`writeExternal` and `readExternal`, `@transient` doesn't matter for java 
serializer, so removing `@transient` to make it work with kryo is a valid fix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...

2017-06-19 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/18343#discussion_r122633016
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -175,6 +175,7 @@ class KryoSerializer(conf: SparkConf)
 kryo.register(None.getClass)
 kryo.register(Nil.getClass)
 
kryo.register(Utils.classForName("scala.collection.immutable.$colon$colon"))
+
kryo.register(Utils.classForName("scala.collection.immutable.Map$EmptyMap$"))
--- End diff --

Because this test failed: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78224/testReport/org.apache.spark.serializer/KryoSerializerSuite/registration_of_HighlyCompressedMapStatus/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...

2017-06-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18343#discussion_r122625325
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -175,6 +175,7 @@ class KryoSerializer(conf: SparkConf)
 kryo.register(None.getClass)
 kryo.register(Nil.getClass)
 
kryo.register(Utils.classForName("scala.collection.immutable.$colon$colon"))
+
kryo.register(Utils.classForName("scala.collection.immutable.Map$EmptyMap$"))
--- End diff --

why `Map$EmptyMap$`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...

2017-06-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18343#discussion_r122625202
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -141,7 +141,7 @@ private[spark] class HighlyCompressedMapStatus private (
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
 private[this] var avgSize: Long,
-@transient private var hugeBlockSizes: Map[Int, Byte])
+private[this] var hugeBlockSizes: Map[Int, Byte])
--- End diff --

oh seems it is now, then LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...

2017-06-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18343#discussion_r122623575
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -141,7 +141,7 @@ private[spark] class HighlyCompressedMapStatus private (
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
 private[this] var avgSize: Long,
-@transient private var hugeBlockSizes: Map[Int, Byte])
+private[this] var hugeBlockSizes: Map[Int, Byte])
--- End diff --

if you can figure out a way to make it serializable with kryo and still 
keep the customized serialization logic for java serializer, I'm ok with it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...

2017-06-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18343#discussion_r122623298
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -141,7 +141,7 @@ private[spark] class HighlyCompressedMapStatus private (
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
 private[this] var avgSize: Long,
-@transient private var hugeBlockSizes: Map[Int, Byte])
+private[this] var hugeBlockSizes: Map[Int, Byte])
--- End diff --

actually we can do better: use a bitmap to track which block id has size 
info, and a byte array to store these size data. so the format can be: `[num 
blocks] [bit map], [size array]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...

2017-06-18 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18343#discussion_r122621682
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -141,7 +141,7 @@ private[spark] class HighlyCompressedMapStatus private (
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
 private[this] var avgSize: Long,
-@transient private var hugeBlockSizes: Map[Int, Byte])
+private[this] var hugeBlockSizes: Map[Int, Byte])
--- End diff --

Sounds good to me. However, the customized serialization logic looks 
similar to kyro's default way to serialize map.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...

2017-06-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18343#discussion_r122620862
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -141,7 +141,7 @@ private[spark] class HighlyCompressedMapStatus private (
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
 private[this] var avgSize: Long,
-@transient private var hugeBlockSizes: Map[Int, Byte])
+private[this] var hugeBlockSizes: Map[Int, Byte])
--- End diff --

we do want to serialize `hugeBlockSizes`, but with customized logic, that 
why we marked it `@transient`.

I think the corrected fix is, make this class implements 
`KryoSerializable`, and copy the customized serialization logic for 
`hugeBlockSizes` to kryo serialization hooks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18343: [SPARK-21133][CORE] Fix HighlyCompressedMapStatus...

2017-06-18 Thread wangyum
GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/18343

[SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeExternal throws NPE

## What changes were proposed in this pull request?

Fix HighlyCompressedMapStatus#writeExternal NPE:
```
17/06/18 15:00:27 ERROR Utils: Exception encountered
java.lang.NullPointerException
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
at 
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/06/18 15:00:27 ERROR MapOutputTrackerMaster: 
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
at 
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
... 17 more
17/06/18 15:00:27 INFO MapOutputTrackerMasterEndpoint: Asked to send map 
output locations for shuffle 0 to 10.17.47.20:50188
17/06/18 15:00:27 ERROR Utils: Exception