Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/14817
Thanks for reporting it!
After CBO, the relation size is not only used for deciding whether a table
can be broadcasted. Maybe we can close this PR now?
---
If your project is set up
Github user Parth-Brahmbhatt commented on the issue:
https://github.com/apache/spark/pull/14817
@hvanhovell We have tables with 5-6 partition columns and data going back
4-5 years and given our data is stored in s3 the listing is paginated.
If you want to wait till CBO work
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14817
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user hvanhovell commented on the issue:
https://github.com/apache/spark/pull/14817
@Parth-Brahmbhatt I am very curious why you have millions of partitions.
What is the use case? You will be in a world of hurt as soon as you do any
listing.
I am not going to merge this
Github user Parth-Brahmbhatt commented on the issue:
https://github.com/apache/spark/pull/14817
Request for review one more time.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user Parth-Brahmbhatt commented on the issue:
https://github.com/apache/spark/pull/14817
Can someone please review this PR? Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user Parth-Brahmbhatt commented on the issue:
https://github.com/apache/spark/pull/14817
@hvanhovell I looked at AlterTableRecoverPartitionsCommand and the
parallelism in listing could help it will still cause huge perf penalty. We
have tables with millions of partitions and
Github user Parth-Brahmbhatt commented on the issue:
https://github.com/apache/spark/pull/14817
@hvanhovell I will take a look at it and update this PR.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user hvanhovell commented on the issue:
https://github.com/apache/spark/pull/14817
@Parth-Brahmbhatt would the approach taken in
`AlterTableRecoverPartitionsCommand` help?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user Parth-Brahmbhatt commented on the issue:
https://github.com/apache/spark/pull/14817
@hvanhovell its because of listing and gets worst as amount increases.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user hvanhovell commented on the issue:
https://github.com/apache/spark/pull/14817
@Parth-Brahmbhatt here is the CBO ticket:
https://issues.apache.org/jira/browse/SPARK-16026
Could you explain why this is so slow? Is this because of listing the
files? Or because of
Github user Parth-Brahmbhatt commented on the issue:
https://github.com/apache/spark/pull/14817
Can one of the committers take a look at this PR?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user Parth-Brahmbhatt commented on the issue:
https://github.com/apache/spark/pull/14817
@hvanhovell can you also point me at the design doc/discuss thread for CBO
work? Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user Parth-Brahmbhatt commented on the issue:
https://github.com/apache/spark/pull/14817
@hvanhovell The behavior in case this fallbackToHdfs is not enabled ( and
by default it is not enabled for performance reason) is to return the value
specified via
Github user hvanhovell commented on the issue:
https://github.com/apache/spark/pull/14817
@Parth-Brahmbhatt we are currently working Cost Based Optimization in
Spark. An important input will be the actual size of the table. Having partial
statistics (what you are suggestion) will not
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14817
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
16 matches
Mail list logo