[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19560 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19560 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91092/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19560 **[Test build #91092 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91092/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19560 **[Test build #91092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91092/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19560 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19560 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88775/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19560 **[Test build #88775 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88775/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19560 **[Test build #88775 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88775/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19560 @wangyum Make sense. You can also try approach in this pr. If there are many(tens of thousands of) ETLs in the warehouse, we cannot afford to give that many hints or fix all the inaccurate table properties in metastore. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/19560 I also hint this issues: ```sql select * from A join B on a.key = b.key ``` table A is small but table B is big and table B's stats are incorrect. so It will Broadcast table B. I try to use Broadcast hint to solve this issues: ```sql select /*+ MAPJOIN(A) */ * from A join B on a.key = b.key ``` But it doesn't work. I create a pr to fix it: https://github.com/apache/spark/pull/19714 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19560 I can see the value and also the potential extra overhead (more expensive for object stores), although this does not resolve the root cause. Before we providing adaptive runtime optimization and incremental online stats collection, this might be a workaround solution for avoiding some OOM cases. Let us leave this PR open and see the feedbacks from the others --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19560 >My main concern is, we'd better not to put burden on Spark to deal with metastore failures I think this make sense. I was also thinking about this when proposing this pr. I do agree with you on some level. But in the product env, reasons of failing to update the stats seems various and we find it hard to build a strong redo mechanism. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19560 My main concern is, we'd better not to put burden on Spark to deal with metastore failures, because Spark doesn't have control on metastores. The system using Spark and metastore should be responsible for consistency. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19560 > Users always do not know there's error in stats. Isn't there any exceptions or error messages when updating table/stats fails? I suppose the system is able to know it through logging or protect it by redo mechanism? > Help to avoid OOM caused by broadcast join. It's for stability. This config can't avoid OOM anyway becuase file size is different from memory usage when broadcast join. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19560 @wzhfy Thanks for comment; I know your point. In my cluster, namenode is under heavy pressure. Errors in stats happen with big chance. Users always do not know there's error in stats. That's why I propose this config. Users can chose to turn on this config by default. Yes, it will hit performance. But it's only when Join and the `totalSize` from metastore is below `spark.sql.autoBroadcastJoinThreshold`, which I think is acceptable. Like I mentioned in description, this is for defense. Help to avoid OOM caused by broadcast join. It's for stability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19560 I wonder when this config should be used. If user knows there's some error in stats, why not just analyze the table (specify "noscan" if only size is needed)? This can fix the problem instead of verifying the stats every time analyze a query. If user doesn't know, then he is also not sure when to open it, because stats verification can hit performance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19560 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19560 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83008/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19560 **[Test build #83008 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83008/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19560 **[Test build #83008 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83008/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19560 @viirya Thanks a lot for comments. 1. In current change, I verify the stats from file system only when the relation is under join. 2. I added a warning when the size from file system is smaller than from metastore. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19560 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83002/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19560 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19560 **[Test build #83002 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83002/testReport)** for PR 19560 at commit [`bf59c27`](https://github.com/apache/spark/commit/bf59c27d0a8a01dc0572cf148f40b6337799f241). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19560 @gatorsmile @dongjoon-hyun Thanks a lot for looking into this. This pr aims to avoid OOM if metastore fails to update table properties after the data is already produced. With the config in this pr enabled, we check the size on filesystem only when `totalSize` is below `spark.sql.autoBroadcastJoinThreshold`, so I think the cost can be acceptable. Yes, the storage can be other filesystems. I refined the name. Please take a look again when you have time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org