Github user thunterdb commented on the issue:
https://github.com/apache/spark/pull/17419
I am going to close this PR, since this is being taken over by
@WeichenXu123 in #18798 .
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/17419
The copy problem is fixed in https://github.com/apache/spark/pull/18483 , I
think we can remove this workaround in `ObjectHashAggregateExec`.
---
If your project is set up for it, you can reply
Github user liancheng commented on the issue:
https://github.com/apache/spark/pull/17419
@WeichenXu123 and I did some profiling using `jvisualvm` and found that 40%
of the time is spent in the copy performed by [this `safeProjection`][1]. This
is a known issue used to fight against
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17419
As the dataframe version is much slower than RDD version (currently test
against vector of size 1)
I also guess there is some performance issue in
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #75406 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75406/testReport)**
for PR 17419 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75406/
Test FAILed.
---
Github user thunterdb commented on the issue:
https://github.com/apache/spark/pull/17419
I looked a bit deeper into the performance aspect. Here are some quick
insights:
- there was an immediate bottleneck in `VectorUDT`, which boosts the
performance already by 3x
- it is
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #75406 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75406/testReport)**
for PR 17419 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/17419
> RDD = [2482 ~ 46150 ~ 48354] records / milli
the number is so varied?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75285/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #75285 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75285/testReport)**
for PR 17419 at commit
Github user thunterdb commented on the issue:
https://github.com/apache/spark/pull/17419
I have added a small perf test to find the performance bottlenecks. Note
that this test works on the worst case (vectors of size 1) from the perspective
of overhead. Here are the numbers I
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #75285 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75285/testReport)**
for PR 17419 at commit
Github user thunterdb commented on the issue:
https://github.com/apache/spark/pull/17419
@sethah it would have been nice, but I do not think we should merge it this
late into the release cycle.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user sethah commented on the issue:
https://github.com/apache/spark/pull/17419
Is this being targeted for Spark 2.2?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75268/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #75268 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75268/testReport)**
for PR 17419 at commit
Github user dbtsai commented on the issue:
https://github.com/apache/spark/pull/17419
Will be really interested to see the performance benchmark durning the QA
period so users can know when to use the dataframe apis or existing rdd apis.
Thanks.
---
If your project is set up for
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #75268 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75268/testReport)**
for PR 17419 at commit
Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/17419
I'll take a look later today. Also CCing @dbtsai I know you worked on the
old summarizer, so I thought I'd ping you here. + @yanboliang and @holdenk
since you commented on the design doc.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #3614 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3614/testReport)**
for PR 17419 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #3614 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3614/testReport)**
for PR 17419 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75190/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #75190 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75190/testReport)**
for PR 17419 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75189/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #75189 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75189/testReport)**
for PR 17419 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75187/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #75187 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75187/testReport)**
for PR 17419 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75188/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17419
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #75188 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75188/testReport)**
for PR 17419 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #75190 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75190/testReport)**
for PR 17419 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17419
**[Test build #75189 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75189/testReport)**
for PR 17419 at commit
39 matches
Mail list logo