Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18343
thanks, merging to master/2.2!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled an
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18343
Agreed. The `hugeBlockSizes` map is not supposed to have too many records
but only few huge blocks.
LGTM
---
If your project is set up for it, you can reply to this email and have your
repl
Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/18343
I ran a local test and a `Map[Int, Byte]` with 2000 elements serialized
with Kryo ends up at a little less than 14kB. That would be `4 + 4 * 2000 + 1 *
2000 = 9004B` using the custom serialization. T
Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/18343
True. I guess since the same serializer instance is not reused, you don't
get the benefits of the optimizations that don't require sending the class name
after it first shows up..
But back t
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/18343
I was talking about the classname for the internal members.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not hav
Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/18343
> since the custom logic doesn't need to write the full classname out
Hmmm... from
http://docs.oracle.com/javase/8/docs/api/java/io/Externalizable.html:
"Only the identity of the class o
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/18343
It's obvious it will reduce data size with custom serialization, since the
custom logic doesn't need to write the full classname out which the java
default one does.
I don't think Kryo knows w
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18343
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18343
**[Test build #78254 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78254/testReport)**
for PR 18343 at commit
[`7a4e6ec`](https://github.com/apache/spark/commit/7
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18343
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78254/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18343
**[Test build #78254 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78254/testReport)**
for PR 18343 at commit
[`7a4e6ec`](https://github.com/apache/spark/commit/7a
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18343
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18343
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78251/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18343
**[Test build #78251 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78251/testReport)**
for PR 18343 at commit
[`e045bef`](https://github.com/apache/spark/commit/e
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/18343
I don't quibble with custom serialization logic, but you can do that with
`Serializable` too. And Kryo has its own marker interface too. I wonder what
the purpose of `Externalizable` is then. Actuall
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/18343
It seems `Externalizable` is kind of abused in Spark, we should benchmark
and make sure that these "customized serialization logic" is faster than the
default one of java serializer.
For
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/18343
OK I get it. Hm, I wonder why some classes in the code extend
`Externalizable` instead of `Serializable`? I see a comment about controlling
serialization, but `Serializable` also lets you do that.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18343
**[Test build #78251 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78251/testReport)**
for PR 18343 at commit
[`e045bef`](https://github.com/apache/spark/commit/e0
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18343
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78248/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18343
**[Test build #78248 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78248/testReport)**
for PR 18343 at commit
[`facca95`](https://github.com/apache/spark/commit/f
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18343
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18343
**[Test build #78248 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78248/testReport)**
for PR 18343 at commit
[`facca95`](https://github.com/apache/spark/commit/fa
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18343
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18343
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78239/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18343
**[Test build #78239 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78239/testReport)**
for PR 18343 at commit
[`e2816ec`](https://github.com/apache/spark/commit/e
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18343
@wangyum Can you also add a test for this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18343
@wangyum Thanks for updating. Can you try to disable kyro and try it again?
So we can verify it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHu
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18343
Because we write/read `hugeBlockSizes` in `writeExternal`/`readExternal`,
it seems to me that it is intended to be serialized. So I think removing
`transient` should be ok.
LGTM cc @cloud-fa
Github user wangyum commented on the issue:
https://github.com/apache/spark/pull/18343
@viirya Yes, I' using `org.apache.spark.serializer.KryoSerializer`, [master
branch](https://github.com/apache/spark/tree/ce49428ef7d640c1734e91ffcddc49dbc8547ba7)
still has this issue, error logs:
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18343
I think this should be addressed before 2.2. I already asked notice of
other committers on dev mailing list.
---
If your project is set up for it, you can reply to this email and have your
reply app
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18343
@wangyum Are you using kyro serializer? I think it is why you hit this
issue.
Once you use kyro, I think the `readExternal` in
`HighlyCompressedMapStatus` won't be used to deserialize the ob
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18343
Is this still re-producable in current codebase? In the error message
above, there is the call to
`MapOutputTrackerMaster.getSerializedMapOutputStatuses`, however, this method
is removed in recent c
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18343
**[Test build #78239 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78239/testReport)**
for PR 18343 at commit
[`e2816ec`](https://github.com/apache/spark/commit/e2
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18343
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18343
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78230/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18343
**[Test build #78230 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78230/testReport)**
for PR 18343 at commit
[`75a9bf1`](https://github.com/apache/spark/commit/7
Github user wangyum commented on the issue:
https://github.com/apache/spark/pull/18343
@jinxing64 `big_table` may be need big enough, my `big_table` is 270.7 G:
```sql
spark-sql -e "
set spark.sql.shuffle.partitions=2001;
drop table if exists spark_hcms_npe;
cr
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18343
**[Test build #78230 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78230/testReport)**
for PR 18343 at commit
[`75a9bf1`](https://github.com/apache/spark/commit/75
Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/18343
Thanks for ping.
If I understand correctly, `HighlyCompressedStatus` is initialized when 2
situations:
1. Creating `MapStatus` when shuffle-write and the reduce partitions is
over 2000;
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18343
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78224/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/18343
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18343
**[Test build #78224 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78224/testReport)**
for PR 18343 at commit
[`4cf3532`](https://github.com/apache/spark/commit/4
Github user wangyum commented on the issue:
https://github.com/apache/spark/pull/18343
cc @jinxing64
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/18343
**[Test build #78224 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78224/testReport)**
for PR 18343 at commit
[`4cf3532`](https://github.com/apache/spark/commit/4c
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/18343
I'm not sure that's a valid fix. This makes this field serialize, when it
wasn't intended to. It's either supposed to be recreated on demand, or else,
the code needs to deal with it not existing.
-
45 matches
Mail list logo