[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19560
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19560
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91092/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19560
  
**[Test build #91092 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91092/testReport)**
 for PR 19560 at commit 
[`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19560
  
**[Test build #91092 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91092/testReport)**
 for PR 19560 at commit 
[`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2018-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19560
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2018-03-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19560
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88775/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2018-03-31 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19560
  
**[Test build #88775 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88775/testReport)**
 for PR 19560 at commit 
[`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2018-03-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19560
  
**[Test build #88775 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88775/testReport)**
 for PR 19560 at commit 
[`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-11-14 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19560
  
@wangyum 
Make sense.
You can also try approach in this pr. 
If there are many(tens of thousands of) ETLs in the warehouse, we cannot 
afford to give that many hints or fix all the inaccurate table properties in 
metastore.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-11-14 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/19560
  
I also hint this issues:
```sql
select * from A join B on a.key = b.key
```
table A is small but table B is big and table B's stats are incorrect. so 
It will Broadcast table B.

I try to use Broadcast hint to solve this issues:
```sql
select /*+ MAPJOIN(A) */ * from A join B on a.key = b.key
```
But it doesn't work. I create a pr to fix it: 
https://github.com/apache/spark/pull/19714


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-25 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19560
  
I can see the value and also the potential extra overhead (more expensive 
for object stores), although this does not resolve the root cause. 

Before we providing adaptive runtime optimization and incremental online 
stats collection, this might be a workaround solution for avoiding some OOM 
cases. Let us leave this PR open and see the feedbacks from the others


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19560
  
>My main concern is, we'd better not to put burden on Spark to deal with 
metastore failures

I think this make sense. I was also thinking about this when proposing this 
pr. I do agree with you on some level. But in the product env, reasons of 
failing to update the stats seems various and we find it hard to build a strong 
redo mechanism. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19560
  
My main concern is, we'd better not to put burden on Spark to deal with 
metastore failures, because Spark doesn't have control on metastores. The 
system using Spark and metastore should be responsible for consistency.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19560
  
> Users always do not know there's error in stats.

Isn't there any exceptions or error messages when updating table/stats 
fails? I suppose the system is able to know it through logging or protect it by 
redo mechanism?

> Help to avoid OOM caused by broadcast join. It's for stability.

This config can't avoid OOM anyway becuase file size is different from 
memory usage when broadcast join.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19560
  
@wzhfy
Thanks for comment;
I know your point.
In my cluster, namenode is under heavy pressure. Errors in stats happen 
with big chance. Users always do not know there's error in stats. That's why I 
propose this config. Users can chose to turn on this config by default. Yes, it 
will hit performance. But it's only when Join and the `totalSize` from 
metastore is below `spark.sql.autoBroadcastJoinThreshold`, which I think is 
acceptable.
Like I mentioned in description, this is for defense. Help to avoid OOM 
caused by broadcast join. It's for stability.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19560
  
I wonder when this config should be used. If user knows there's some error 
in stats, why not just analyze the table (specify "noscan" if only size is 
needed)? This can fix the problem instead of verifying the stats every time 
analyze a query. If user doesn't know, then he is also not sure when to open 
it, because stats verification can hit performance.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19560
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19560
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83008/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19560
  
**[Test build #83008 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83008/testReport)**
 for PR 19560 at commit 
[`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19560
  
**[Test build #83008 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83008/testReport)**
 for PR 19560 at commit 
[`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19560
  
@viirya
Thanks a lot for comments.
1. In current change, I verify the stats from file system only when the 
relation is under join.
2. I added a warning when the size from file system is smaller than from 
metastore.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19560
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83002/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19560
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19560
  
**[Test build #83002 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83002/testReport)**
 for PR 19560 at commit 
[`bf59c27`](https://github.com/apache/spark/commit/bf59c27d0a8a01dc0572cf148f40b6337799f241).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2017-10-23 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19560
  
@gatorsmile @dongjoon-hyun 

Thanks a lot for looking into this.
This pr aims to avoid OOM if metastore fails to update table properties 
after the data is already produced. With the config in this pr enabled, we 
check the size on filesystem only when `totalSize` is below 
`spark.sql.autoBroadcastJoinThreshold`, so I think the cost can be acceptable.

Yes, the storage can be other filesystems. I refined the name. Please take 
a look again when you have time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org