[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-10-05 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 @gatorsmile Hive tables don't support case sensitive column names, so I use data source tables in the added test cases. See [the followup pr](https://github.com/apache/spark/pull/15360) --- If your

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-10-04 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 @rxin Thanks, I'll fix them in the followup pr. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-10-03 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15090 LGTM except for one minor comment regarding ndv config document. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-10-01 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15090 Another test case for Unicode column names in ANALYZE COLUMN: ```Scala // scalastyle:off // non ascii characters are not allowed in the source code, so we disable the

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-10-01 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #66227 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66227/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-10-01 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #66227 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66227/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66226/ Test FAILed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-10-01 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #66226 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66226/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-30 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #66188 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66188/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66150/ Test PASSed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-30 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #66150 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66150/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-29 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #66150 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66150/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66131/ Test PASSed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-29 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #66131 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66131/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-29 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #66131 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66131/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66053/ Test PASSed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-28 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #66053 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66053/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-28 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 When encoding, I convert the InternalRow(UnsafeRow) into a byte array and use Base64 to encode as a string; when decoding, use Base64 to decode from a string and convert the byte array to an

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-28 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #66053 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66053/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-27 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 I have a problem in converting between InternalRow and String, as now we are using InternalRow to represent ColumnStat. Since we want to persist ColumnStat into metastore and we use table

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65994/ Test PASSed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65994 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65994/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65994 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65994/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65950/ Test PASSed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65950 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65950/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65950 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65950/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65941/ Test PASSed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-26 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65941 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65941/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-26 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65941 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65941/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65935/ Test FAILed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-26 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65935 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65935/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-26 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65935 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65935/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-25 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 To help us choose a better design, we need to first clarify the usage of column stats. A simple example may look like this (e.g. predicate: col < 5): ```java filter.condition match {

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-24 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15090 has an offline discussion with @wzhfy , here is the result: 1. The current `ColumnStats` is hard to use because most of its fields are `Option`, some are `Option[Any]`, and we may need a

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-23 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 - For point 1, if we use different schema for every type, we need to do type matching and convert to corresponding typed row every time we want to use some column statistics, that would be tedious

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-23 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15090 In this PR, we use `ColumnStats` to represent the column statistics in memory, and persist in to hive metastore by converting it to string with format `a=1,b=2`. This brings 2 problems:

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65810/ Test PASSed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65810 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65810/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65810 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65810/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65793/ Test FAILed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65793 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65793/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65793 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65793/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 @gatorsmile OK, I'll rebase and add some tests to handle negative cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15090 Although no conflict is detected, we should still fetch and merge the latest master. Then, the changes made in `DataSourceStrategy.scala` will disappaer. --- If your project is set up for it,

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65754/ Test PASSed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-22 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65754 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65754/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15090 The test suite `StatisticsColumnSuite` misses the negative cases. For example, so far, we do not allow users to analyze the temporary tables. Ideally, all the exceptions the code could

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65754 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65754/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65749/ Test PASSed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65749 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65749/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65746/ Test PASSed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65746 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65746/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65749 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65749/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65746 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65746/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65745/ Test FAILed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65745 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65745/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 The latest update includes the following changes: 1. move the test suite for column stats into sql/core; 2. extract the computing logic in `AnalyzeColumnCommand` as a separate method; 3.

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65745 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65745/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread srinathshankar
Github user srinathshankar commented on the issue: https://github.com/apache/spark/pull/15090 Actually, I didn't mean to approve immediately, sorry. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 @gatorsmile OK, I'll modify the test suite. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15090 A SQL function. : ) I might underestimate the effort. Like what @hvanhovell said, how about adding a test suite in sql/core for verifying [`updateStats

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread hvanhovell
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/15090 I would argue strongly against creating a single aggregate function to calculate all statistics, for a couple of reasons: 1. This would create a tonne of duplicated code (integrating

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15090 What do you mean by a built-in function? A SQL function, or just a normal Scala function? Sorry for asking because it is vague given the context we are in. --- If your project is set up for it, you

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15090 @rxin @wzhfy Just my 2 cents. To address @rxin 's comment, we can implement a built-in function, `compute_stats`, like what Hive does. The actual implementation of `AnalyzeColumnCommand` can be

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 How about change the returned result of `ANALYZE` command, as @gatorsmile suggested in the [comment](https://github.com/apache/spark/pull/15090#discussion_r79547634)? Then we can compare collected

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15090 I looked at your test cases. Are there any one that actually depend on things in the Hive module? There is also an in-memory catalog for sql/core that you can use. In addition, unfortunately

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 @rxin Because we want to test storing/loading these stats from metastore, and make sure they are right after we load them into our catalogTable. --- If your project is set up for it, you can reply

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15090 @wzhfy one question: why are the test suites in sql/hive? Can't they live in sql/core? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well.

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-21 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 This pr has been updated. Can you take another look? @rxin @hvanhovell @cloud-fan @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65690/ Test PASSed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65690 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65690/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65690 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65690/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-20 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 > Do you still remember the test case I showed you in table-level statistics? A table with zero column. Can you add a test case for that scenario? What's the purpose of adding this test?

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15090 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65633/ Test PASSed. ---

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65633 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65633/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-20 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15090 Do you still remember the test case I showed you in table-level statistics? A table with zero column. Can you add a test case for that scenario? --- If your project is set up for it, you can

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-19 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 This pr has been updated based on all the above comments, changes are as follows: 1. Modify analyze syntax a little bit: `identifierSeq` is now non-optional, i.e. users must specify column names

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-19 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65633 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65633/consoleFull)** for PR 15090 at commit

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-19 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 @gatorsmile Yeah, my latest code contains lots of changes based on the comments, I'll list them when I finish. And I'll also create a separate pr for the bug fix. Thanks for the advices! --- If

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-19 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15090 @wzhfy You also fixed a bug in this PR. Could you create a separate PR? Then, we can backport it easily if needed. Sometimes, if you fix a bug in a huge PR like this, we might be hard to

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-19 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15090 @wzhfy In the implementation, you have a few limitations. Could you improve/update your PR description? It can help the future code maintainers understand what you did in this PR and why you did

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-17 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 @rxin Yeah, I think it's better to move histograms into ColumnStats than to maintain two members like BasicColStats and Histograms. Let me rename `BasicColStats` as `ColumnStats` so that all the

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-17 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15090 What's "basic" about? Are we going to have something that's not basic in the future (e.g. histogram)? If yes, should those go into a separate class or just in ColumnStats? --- If your project is

  1   2   >