[GitHub] spark issue #14625: [SPARK-17045] [SQL] Build/move Join-related test cases i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14625 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64120/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14625: [SPARK-17045] [SQL] Build/move Join-related test cases i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14625 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14625: [SPARK-17045] [SQL] Build/move Join-related test cases i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14625 **[Test build #64120 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64120/consoleFull)** for PR 14625 at commit [`bf55624`](https://github.com/apache/spark/commit/bf556240e0f01cdd12f53a9407d8811ec30380d4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14635: [SPARK-17052] [SQL] Remove Duplicate Test Cases auto_joi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14635 cc @cloud-fan @rxin Could you check whether this PR is reasonable? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14727: [SPARK-17166] [SQL] Store Table Properties in CTAS that ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14727 **[Test build #64125 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64125/consoleFull)** for PR 14727 at commit [`bffc412`](https://github.com/apache/spark/commit/bffc412b4ce50ffc63da0f6b05d82f7dd52a97fd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14727: [SPARK-17166] [SQL] Store Table Properties in CTAS that ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14727 cc @cloud-fan @yhuai This is what we discussed in another PR. Could you please review whether this is a right fix? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14577: [SPARK-16986][WEB UI] Make 'Started' time, 'Completed' t...
Github user Sherry302 commented on the issue: https://github.com/apache/spark/pull/14577 Hi, @srowen Thanks a lot for the comments. Sorry for the late reply. You are right. I will check how other pages format the date. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14727: [SPARK-17166] [SQL] Store Table Properties Specif...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/14727 [SPARK-17166] [SQL] Store Table Properties Specified in CTAS after Conversion to Data Source Tables ## What changes were proposed in this pull request? CTAS lost table properties after conversion to data source tables. For example, ```SQL CREATE TABLE t TBLPROPERTIES('prop1' = 'c', 'prop2' = 'd') AS SELECT 1 as a, 1 as b ``` The output of `DESC FORMATTED t` does not have the related properties. ``` |Table Parameters: | | | | rawDataSize |-1 | | | numFiles |1 | | | transient_lastDdlTime |1471670983 | | | totalSize |496 | | | spark.sql.sources.provider|parquet | | | EXTERNAL |FALSE | | | COLUMN_STATS_ACCURATE |false | | | numRows |-1 | | || | | |# Storage Information | | | |SerDe Library: |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | |InputFormat: |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | | |OutputFormat: |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | | |Compressed: |No | | |Storage Desc Parameters:| | | | serialization.format |1 | | | path |file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-f3aa2927-6464-4a35-a715-1300dde6c614/t| | ``` After the fix, the properties specified by users are stored as serde properties, since the table properties are used for storing table schemas and system generated properties. ``` |Table Parameters: | | | | rawDataSize |-1 | | | numFiles |1 | | | transient_lastDdlTime |1471672182 | | | totalSize |496 | | | spark.sql.sources.provider|parquet | | | EXTERNAL |FALSE | | | COLUMN_STATS_ACCURATE |false | | | numRows |-1 | | ||
[GitHub] spark issue #14682: [SPARK-17104][SQL] LogicalRelation.newInstance should fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14682 **[Test build #64124 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64124/consoleFull)** for PR 14682 at commit [`e7fe68b`](https://github.com/apache/spark/commit/e7fe68b002594a294b199317be3e2d8fc250eb4e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14697: [SPARK-17124][SQL] RelationalGroupedDataset.agg should p...
Github user petermaxlee commented on the issue: https://github.com/apache/spark/pull/14697 I updated the description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14682: [SPARK-17104][SQL] LogicalRelation.newInstance sh...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14682#discussion_r75573056 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala --- @@ -79,11 +79,18 @@ case class LogicalRelation( /** Used to lookup original attribute capitalization */ val attributeMap: AttributeMap[AttributeReference] = AttributeMap(output.map(o => (o, o))) - def newInstance(): this.type = + /** + * Returns a new instance of this LogicalRelation. According to the semantics of + * MultiInstanceRelation, this method should returns a copy of this object with + * unique expression ids. Thus we don't respect the `expectedOutputAttributes` and --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14726: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/14726#discussion_r75573049 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -22,15 +22,21 @@ import com.google.common.io.ByteStreams; import com.google.common.io.Closeables; +import org.apache.spark.SparkEnv; import org.apache.spark.serializer.SerializerManager; import org.apache.spark.storage.BlockId; import org.apache.spark.unsafe.Platform; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; /** * Reads spill files written by {@link UnsafeSorterSpillWriter} (see that class for a description * of the file format). */ public final class UnsafeSorterSpillReader extends UnsafeSorterIterator implements Closeable { + private static final Logger logger = LoggerFactory.getLogger(UnsafeSorterSpillReader.class); + private static final int DEFAULT_BUFFER_SIZE_BYTES = 1024 * 1024; // 1 MB --- End diff -- @rxin : In response to [0], I have changed to 1 MB. As per my experiments, 1 MB gave good perf and we are using it as default for all prod jobs. One concern / proposal: With the change, UnsafeSorterSpillReader would consume more memory than before as the buffer would increase from 8k to 1 MB. Overall per UnsafeSorterSpillReader object footprint would grow from 2.5 MB to 3.6 MB (I have profiled to the number. See [1]). In case of job(s) which spill a lot, there would be lot of these spill readers created (in the screenshot, there were 400+ readers). Current merging approach is to open all the spill files at the same time and merge them all at once using a priority queue. Having lots of these objects in memory can lead to OOMs as there is no accounting for buffers allocated inside UnsafeSorterSpillReader (even without this change, snappy already had its own buffers for compressed and uncompressed data). Also, from disk point of view, having lots of file open at the same time would lead to random seeks and won't play well with OS's cache for disk reads. One might say that users should tune the job so that the spills are lesser but it might not be o bvious for people who do not understand the system internals. Also, for pipelines the data changes everyday and one setting may not work everytime. Should we add some kinda hierarchical merging wherein spill files are iteratively merged in batches ? It could be turned on when there are say more than 100 spill files to be merged. AFAIK, Hadoop has this. [0] : https://github.com/apache/spark/pull/14475#discussion_r75440822 [1] : https://postimg.org/image/cs5zr6lyx/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14697: [SPARK-17124][SQL] RelationalGroupedDataset.agg should p...
Github user petermaxlee commented on the issue: https://github.com/apache/spark/pull/14697 For example, run both count and sum for a column. Let me update the description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14475: [SPARK-16862] Configurable buffer size in `UnsafeSorterS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14475 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14475: [SPARK-16862] Configurable buffer size in `UnsafeSorterS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14475 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64118/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14475: [SPARK-16862] Configurable buffer size in `UnsafeSorterS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14475 **[Test build #64118 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64118/consoleFull)** for PR 14475 at commit [`950bb21`](https://github.com/apache/spark/commit/950bb21d1f8f3e98b6a8ef00606c9b6c3e30f659). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14709: [SPARK-17150][SQL] Support SQL generation for inl...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14709 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14709: [SPARK-17150][SQL] Support SQL generation for inline tab...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14709 thanks, merging to master/2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14475: [SPARK-16862] Configurable buffer size in `UnsafeSorterS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14475 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14475: [SPARK-16862] Configurable buffer size in `UnsafeSorterS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14475 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64117/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14697: [SPARK-17124][SQL] RelationalGroupedDataset.agg should p...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14697 what do you mean by `allow multiple aggregates per column` in the title? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14475: [SPARK-16862] Configurable buffer size in `UnsafeSorterS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14475 **[Test build #64117 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64117/consoleFull)** for PR 14475 at commit [`6b8fc48`](https://github.com/apache/spark/commit/6b8fc487dd5324ae589d75d271da18c54110cf4a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14692: [SPARK-17115] [SQL] decrease the threshold when s...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14692#discussion_r75572921 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -584,15 +584,18 @@ class CodegenContext { * @param expressions the codes to evaluate expressions. */ def splitExpressions(row: String, expressions: Seq[String]): String = { -if (row == null) { +if (row == null || currentVars != null) { --- End diff -- When will `row == null`? I understand `currentVars != null` means we are in whole stage codegen. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on the issue: https://github.com/apache/spark/pull/14659 Hi, @srowen . Thank you so much for the review. Sorry for the test failure and late update. The failure reasons are that âjobIDâ were none or there was no âspark.app.nameâ in sparkConf. I have updated the PR to set default values to âjobIDâ and âspark.app.nameâ. When a real application runs on Spark, it will always have âjobIDâ and âspark.app.nameâ. What's the use case for this? When users run Spark applications on Yarn on HDFS, Sparkâs caller contexts will be written into hdfs-audit.log. The Spark caller contexts are JobID_stageID_stageAttemptId_taskID_attemptNumbe and applicationsâ name. The caller context can help users to better diagnose and understand how specific applications impacting parts of the Hadoop system and potential problems they may be creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a given HDFS operation, it's very helpful to track which upper level job issues it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14726: [SPARK-16862] Configurable buffer size in `UnsafeSorterS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14726 **[Test build #64123 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64123/consoleFull)** for PR 14726 at commit [`c4f37b6`](https://github.com/apache/spark/commit/c4f37b6c8d3f1a8a565b1f215f55a501edece778). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14475: [SPARK-16862] Configurable buffer size in `UnsafeSorterS...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/14475 Continuing to https://github.com/apache/spark/pull/14726 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14726: [SPARK-16862] Configurable buffer size in `Unsafe...
GitHub user tejasapatil opened a pull request: https://github.com/apache/spark/pull/14726 [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader` ## What changes were proposed in this pull request? Jira: https://issues.apache.org/jira/browse/SPARK-16862 `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k buffer to read data off disk. This PR makes it configurable to improve on disk reads. I have made the default value to be 1 MB as with that value I observed improved performance. ## How was this patch tested? I am relying on the existing unit tests. ## Performance After deploying this change to prod and setting the config to 1 mb, there was a 12% reduction in the CPU time and 19.5% reduction in CPU reservation time. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tejasapatil/spark spill_buffer_2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14726.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14726 commit c4f37b6c8d3f1a8a565b1f215f55a501edece778 Author: Tejas PatilDate: 2016-08-20T05:06:03Z [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14155 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14712: [SPARK-17072] [SQL] support table-level statistics gener...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14712 a high-level question: Looks like the current design depends on some features of hive metastore, e.g. the `STATS_GENERATED_VIA_STATS_TASK` flag. Is it possible that we just treat hive metastore as a persistent level? So that the statistics can still work if Spark SQL has its own metastore in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14155 **[Test build #64122 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64122/consoleFull)** for PR 14155 at commit [`38b838a`](https://github.com/apache/spark/commit/38b838a9d27d5e11bad5f5e7040fe2d6d2e56216). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75572799 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -88,14 +90,30 @@ case class AnalyzeTableCommand(tableName: String) extends RunnableCommand { } }.getOrElse(0L) -// Update the Hive metastore if the total size of the table is different than the size +val needUpdate = new mutable.HashMap[String, String]() +if (newTotalSize > 0 && newTotalSize != oldTotalSize) { + needUpdate += (AnalyzeTableCommand.TOTAL_SIZE_FIELD -> newTotalSize.toString) +} +if (!noscan) { + val oldRowCount = tableParameters.get(AnalyzeTableCommand.ROW_COUNT).map(_.toLong) +.getOrElse(-1L) + val newRowCount = sparkSession.table(tableName).count() + + if (newRowCount >= 0 && newRowCount != oldRowCount) { +needUpdate += (AnalyzeTableCommand.ROW_COUNT -> newRowCount.toString) + } +} +// Update the Hive metastore if the above parameters of the table is different than those // recorded in the Hive metastore. // This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats(). -if (newTotalSize > 0 && newTotalSize != oldTotalSize) { +if (needUpdate.nonEmpty) { + // need to set this parameter so that we can store other parameters like "numRows" into + // Hive metastore + needUpdate.put(AnalyzeTableCommand.STATS_GENERATED_VIA_STATS_TASK, --- End diff -- @viirya yeah, thanks for the advice --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14682: [SPARK-17104][SQL] LogicalRelation.newInstance sh...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14682#discussion_r75572798 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala --- @@ -79,11 +79,18 @@ case class LogicalRelation( /** Used to lookup original attribute capitalization */ val attributeMap: AttributeMap[AttributeReference] = AttributeMap(output.map(o => (o, o))) - def newInstance(): this.type = + /** + * Returns a new instance of this LogicalRelation. According to the semantics of + * MultiInstanceRelation, this method should returns a copy of this object with + * unique expression ids. Thus we don't respect the `expectedOutputAttributes` and --- End diff -- update the doc? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14724: [SPARK-17162] Range does not support SQL generati...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14724#discussion_r75572751 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -205,6 +205,9 @@ class SQLBuilder private ( case p: ScriptTransformation => scriptTransformationToSQL(p) +case Range(start, end, step, numPartitions, output) => + s"SELECT id AS `${output.head.name}` FROM range($start, $end, $step, $numPartitions)" --- End diff -- while you are at it, can you move this into a toSQL function in Range? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14155: [SPARK-16498][SQL] move hive hack for data source...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14155#discussion_r75572741 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -175,7 +127,8 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log } else { val qualifiedTable = MetastoreRelation( - qualifiedTableName.database, qualifiedTableName.name)(table, client, sparkSession) + qualifiedTableName.database, qualifiedTableName.name)( + table.copy(provider = Some("hive")), client, sparkSession) --- End diff -- Then we will restore table metadata from table properties twice. As this class will be removed soon, I don't want to change too much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14155: [SPARK-16498][SQL] move hive hack for data source...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14155#discussion_r75572713 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -200,22 +375,77 @@ private[spark] class HiveExternalCatalog(client: HiveClient, hadoopConf: Configu * Alter a table whose name that matches the one specified in `tableDefinition`, * assuming the table exists. * - * Note: As of now, this only supports altering table properties, serde properties, - * and num buckets! + * Note: As of now, this doesn't support altering table schema, partition column names and bucket + * specification. We will ignore them even if users do specify different values for these fields. */ override def alterTable(tableDefinition: CatalogTable): Unit = withClient { assert(tableDefinition.identifier.database.isDefined) val db = tableDefinition.identifier.database.get requireTableExists(db, tableDefinition.identifier.table) -client.alterTable(tableDefinition) +verifyTableProperties(tableDefinition) + +if (tableDefinition.provider == Some("hive") || tableDefinition.tableType == VIEW) { + client.alterTable(tableDefinition) +} else { + val oldDef = client.getTable(db, tableDefinition.identifier.table) + // Sets the `schema`, `partitionColumnNames` and `bucketSpec` from the old table definition, + // to retain the spark specific format if it is. + // Also add table meta properties to table properties, to retain the data source table format. + val newDef = tableDefinition.copy( +schema = oldDef.schema, +partitionColumnNames = oldDef.partitionColumnNames, +bucketSpec = oldDef.bucketSpec, +properties = tableMetadataToProperties(tableDefinition) ++ tableDefinition.properties) + + client.alterTable(newDef) +} } override def getTable(db: String, table: String): CatalogTable = withClient { -client.getTable(db, table) +restoreTableMetadata(client.getTable(db, table)) } override def getTableOption(db: String, table: String): Option[CatalogTable] = withClient { -client.getTableOption(db, table) +client.getTableOption(db, table).map(restoreTableMetadata) + } + + /** + * Restores table metadata from the table properties if it's a datasouce table. This method is + * kind of a opposite version of [[createTable]]. + * + * It reads table schema, provider, partition column names and bucket specification from table + * properties, and filter out these special entries from table properties. + */ + private def restoreTableMetadata(table: CatalogTable): CatalogTable = { +if (table.tableType == VIEW) { + table +} else { + getProviderFromTableProperties(table).map { provider => +// SPARK-15269: Persisted data source tables always store the location URI as a storage +// property named "path" instead of standard Hive `dataLocation`, because Hive only +// allows directory paths as location URIs while Spark SQL data source tables also +// allows file paths. So the standard Hive `dataLocation` is meaningless for Spark SQL +// data source tables. +// Spark SQL may also save external data source in Hive compatible format when +// possible, so that these tables can be directly accessed by Hive. For these tables, +// `dataLocation` is still necessary. Here we also check for input format because only +// these Hive compatible tables set this field. +val storage = if (table.tableType == EXTERNAL && table.storage.inputFormat.isEmpty) { + table.storage.copy(locationUri = None) +} else { + table.storage +} +table.copy( + storage = storage, + schema = getSchemaFromTableProperties(table), + provider = Some(provider), + partitionColumnNames = getPartitionColumnsFromTableProperties(table), + bucketSpec = getBucketSpecFromTableProperties(table), + properties = getOriginalTableProperties(table)) --- End diff -- The previous code also store options to serde properties, I'm not going to fix everything in this PR, and I'm not sure if it's a real problem, but let's continue the discussion in follow-up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with
[GitHub] spark pull request #11502: [SPARK-13659] Refactor BlockStore put*() APIs to ...
Github user pzz2011 commented on a diff in the pull request: https://github.com/apache/spark/pull/11502#discussion_r75572673 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -432,98 +432,105 @@ private[spark] class BlockManager( logDebug(s"Block $blockId was not found") None case Some(info) => -val level = info.level -logDebug(s"Level for block $blockId is $level") - -// Look for the block in memory -if (level.useMemory) { - logDebug(s"Getting block $blockId from memory") - val result = if (asBlockResult) { -memoryStore.getValues(blockId).map { iter => - val ci = CompletionIterator[Any, Iterator[Any]](iter, releaseLock(blockId)) - new BlockResult(ci, DataReadMethod.Memory, info.size) -} - } else { -memoryStore.getBytes(blockId) - } - result match { -case Some(values) => - return result -case None => - logDebug(s"Block $blockId not found in memory") - } +doGetLocal(blockId, info, asBlockResult) +} + } + + private def doGetLocal( + blockId: BlockId, + info: BlockInfo, + asBlockResult: Boolean): Option[Any] = { +val level = info.level +logDebug(s"Level for block $blockId is $level") + +// Look for the block in memory +if (level.useMemory) { + logDebug(s"Getting block $blockId from memory") + val result = if (asBlockResult) { --- End diff -- Hey? excuse me, @JoshRosen can I ask you a question(may be a little stupid...)? what's the `asBlockResult` mean? why here should use it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce implementation with a dense...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13680 **[Test build #64121 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64121/consoleFull)** for PR 13680 at commit [`f418f9c`](https://github.com/apache/spark/commit/f418f9cf7a35ef8c2c8ed93cf73487aac275e772). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14625: [SPARK-17045] [SQL] Build/move Join-related test cases i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14625 **[Test build #64120 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64120/consoleFull)** for PR 14625 at commit [`bf55624`](https://github.com/apache/spark/commit/bf556240e0f01cdd12f53a9407d8811ec30380d4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14683: [SPARK-16968]Document additional options in jdbc Writer
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14683 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14683: [SPARK-16968]Document additional options in jdbc Writer
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14683 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64119/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14683: [SPARK-16968]Document additional options in jdbc Writer
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14683 **[Test build #64119 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64119/consoleFull)** for PR 14683 at commit [`8595ece`](https://github.com/apache/spark/commit/8595ece40d18611b003b70f4e62d90c615349abd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...
Github user junyangq commented on the issue: https://github.com/apache/spark/pull/14705 Thanks @felixcheung! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14683: [SPARK-16968]Document additional options in jdbc Writer
Github user GraceH commented on the issue: https://github.com/apache/spark/pull/14683 @srowen. I have updated the patch accordingly. please let me know your comments. anything missing, please let me know. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75572040 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -168,6 +170,57 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("generate table-level statistics") { +def checkTableStats( +statsSeq: Seq[Statistics], +sizeInBytes: Int, +estimatedSize: Int, +rowCount: Int): Unit = { + assert(statsSeq.size === 1) + assert(statsSeq.head.sizeInBytes === BigInt(sizeInBytes)) + assert(statsSeq.head.estimatedSize === Some(BigInt(estimatedSize))) + assert(statsSeq.head.rowCount === Some(BigInt(rowCount))) +} + +sql("CREATE TABLE analyzeTable (key STRING, value STRING)").collect() +sql("CREATE TABLE parquetTable (key STRING, value STRING) STORED AS PARQUET").collect() +sql("CREATE TABLE orcTable (key STRING, value STRING) STORED AS ORC").collect() + +sql("INSERT INTO TABLE analyzeTable SELECT * FROM src").collect() +sql("INSERT INTO TABLE parquetTable SELECT * FROM src").collect() +sql("INSERT INTO TABLE orcTable SELECT * FROM src").collect() +sql("INSERT INTO TABLE orcTable SELECT * FROM src").collect() + +sql("ANALYZE TABLE analyzeTable COMPUTE STATISTICS") +sql("ANALYZE TABLE parquetTable COMPUTE STATISTICS") +sql("ANALYZE TABLE orcTable COMPUTE STATISTICS") + +var df = sql("SELECT * FROM analyzeTable") +var stats = df.queryExecution.analyzed.collect { case mr: MetastoreRelation => + mr.statistics +} +checkTableStats(stats, 5812, 5812, 500) + +// test statistics of LogicalRelation inherited from MetastoreRelation +df = sql("SELECT * FROM parquetTable") +stats = df.queryExecution.analyzed.collect { case rel: LogicalRelation => + rel.statistics +} +checkTableStats(stats, 4236, 4236, 500) + +sql("SET spark.sql.hive.convertMetastoreOrc=true").collect() --- End diff -- Please use `withSQLConf` . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14719 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64116/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14719 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14683: [SPARK-16968]Document additional options in jdbc Writer
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14683 **[Test build #64119 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64119/consoleFull)** for PR 14683 at commit [`8595ece`](https://github.com/apache/spark/commit/8595ece40d18611b003b70f4e62d90c615349abd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14719 **[Test build #64116 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64116/consoleFull)** for PR 14719 at commit [`91cb915`](https://github.com/apache/spark/commit/91cb915b4e6c3c4d24fab3f1e772e7e361d4c088). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75571934 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -342,7 +342,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log logicalRelation } -result.copy(expectedOutputAttributes = Some(metastoreRelation.output)) +val logicalRel = result.copy(expectedOutputAttributes = Some(metastoreRelation.output)) +logicalRel.inheritedStats = Some(metastoreRelation.statistics) --- End diff -- I agreed with @cloud-fan. This looks hacky. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14682: [SPARK-17104][SQL] LogicalRelation.newInstance should fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14682 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64114/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14682: [SPARK-17104][SQL] LogicalRelation.newInstance should fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14682 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14682: [SPARK-17104][SQL] LogicalRelation.newInstance should fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14682 **[Test build #64114 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64114/consoleFull)** for PR 14682 at commit [`e243323`](https://github.com/apache/spark/commit/e243323cb04880c20fb40e1aed8b4a28022d5540). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14475: [SPARK-16862] Configurable buffer size in `Unsafe...
Github user tejasapatil closed the pull request at: https://github.com/apache/spark/pull/14475 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14475: [SPARK-16862] Configurable buffer size in `UnsafeSorterS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14475 **[Test build #64118 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64118/consoleFull)** for PR 14475 at commit [`950bb21`](https://github.com/apache/spark/commit/950bb21d1f8f3e98b6a8ef00606c9b6c3e30f659). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75571723 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -88,14 +90,30 @@ case class AnalyzeTableCommand(tableName: String) extends RunnableCommand { } }.getOrElse(0L) -// Update the Hive metastore if the total size of the table is different than the size +val needUpdate = new mutable.HashMap[String, String]() +if (newTotalSize > 0 && newTotalSize != oldTotalSize) { + needUpdate += (AnalyzeTableCommand.TOTAL_SIZE_FIELD -> newTotalSize.toString) +} +if (!noscan) { + val oldRowCount = tableParameters.get(AnalyzeTableCommand.ROW_COUNT).map(_.toLong) +.getOrElse(-1L) + val newRowCount = sparkSession.table(tableName).count() + + if (newRowCount >= 0 && newRowCount != oldRowCount) { +needUpdate += (AnalyzeTableCommand.ROW_COUNT -> newRowCount.toString) + } +} +// Update the Hive metastore if the above parameters of the table is different than those // recorded in the Hive metastore. // This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats(). -if (newTotalSize > 0 && newTotalSize != oldTotalSize) { +if (needUpdate.nonEmpty) { + // need to set this parameter so that we can store other parameters like "numRows" into + // Hive metastore + needUpdate.put(AnalyzeTableCommand.STATS_GENERATED_VIA_STATS_TASK, --- End diff -- Looks `STATS_GENERATED_VIA_STATS_TASK` is only needed to set when `noscan` is `false`. So is it better to move this to above block `if (!noscan)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75571695 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -88,14 +90,30 @@ case class AnalyzeTableCommand(tableName: String) extends RunnableCommand { } }.getOrElse(0L) -// Update the Hive metastore if the total size of the table is different than the size +val needUpdate = new mutable.HashMap[String, String]() +if (newTotalSize > 0 && newTotalSize != oldTotalSize) { + needUpdate += (AnalyzeTableCommand.TOTAL_SIZE_FIELD -> newTotalSize.toString) +} +if (!noscan) { + val oldRowCount = tableParameters.get(AnalyzeTableCommand.ROW_COUNT).map(_.toLong) +.getOrElse(-1L) + val newRowCount = sparkSession.table(tableName).count() + + if (newRowCount >= 0 && newRowCount != oldRowCount) { +needUpdate += (AnalyzeTableCommand.ROW_COUNT -> newRowCount.toString) + } +} +// Update the Hive metastore if the above parameters of the table is different than those // recorded in the Hive metastore. // This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats(). -if (newTotalSize > 0 && newTotalSize != oldTotalSize) { +if (needUpdate.nonEmpty) { + // need to set this parameter so that we can store other parameters like "numRows" into + // Hive metastore + needUpdate.put(AnalyzeTableCommand.STATS_GENERATED_VIA_STATS_TASK, --- End diff -- I saw it. The value of `numRows` is `-1`, as shown in https://github.com/apache/spark/pull/14712#discussion_r75540560 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75571686 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -88,14 +90,30 @@ case class AnalyzeTableCommand(tableName: String) extends RunnableCommand { } }.getOrElse(0L) -// Update the Hive metastore if the total size of the table is different than the size +val needUpdate = new mutable.HashMap[String, String]() +if (newTotalSize > 0 && newTotalSize != oldTotalSize) { + needUpdate += (AnalyzeTableCommand.TOTAL_SIZE_FIELD -> newTotalSize.toString) +} +if (!noscan) { + val oldRowCount = tableParameters.get(AnalyzeTableCommand.ROW_COUNT).map(_.toLong) +.getOrElse(-1L) + val newRowCount = sparkSession.table(tableName).count() + + if (newRowCount >= 0 && newRowCount != oldRowCount) { +needUpdate += (AnalyzeTableCommand.ROW_COUNT -> newRowCount.toString) + } +} +// Update the Hive metastore if the above parameters of the table is different than those // recorded in the Hive metastore. // This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats(). -if (newTotalSize > 0 && newTotalSize != oldTotalSize) { +if (needUpdate.nonEmpty) { + // need to set this parameter so that we can store other parameters like "numRows" into + // Hive metastore + needUpdate.put(AnalyzeTableCommand.STATS_GENERATED_VIA_STATS_TASK, --- End diff -- The code comment can be improved a little. Above comment looks better than the one in current code change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14475: [SPARK-16862] Configurable buffer size in `UnsafeSorterS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14475 **[Test build #64117 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64117/consoleFull)** for PR 14475 at commit [`6b8fc48`](https://github.com/apache/spark/commit/6b8fc487dd5324ae589d75d271da18c54110cf4a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14721: [SPARK-17158][SQL] Change error message for out o...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14721 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14635: [SPARK-17052] [SQL] Remove Duplicate Test Cases auto_joi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14635 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64115/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14635: [SPARK-17052] [SQL] Remove Duplicate Test Cases auto_joi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14635 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14635: [SPARK-17052] [SQL] Remove Duplicate Test Cases auto_joi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14635 **[Test build #64115 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64115/consoleFull)** for PR 14635 at commit [`8b8725c`](https://github.com/apache/spark/commit/8b8725cb28f8f4564f4ee0b168363282df09564f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14721: [SPARK-17158][SQL] Change error message for out of range...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14721 Merging in master/2.0. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13428: [SPARK-12666][CORE] SparkSubmit packages fix for when 'd...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/13428 LGTM; I'll merge when I get home tonight. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14475: [SPARK-16862] Configurable buffer size in `UnsafeSorterS...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/14475 Yeah. I have been stuck with other things so could not clean it up. Will try again. In worst case close this PR and send a new one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14721: [SPARK-17158][SQL] Change error message for out of range...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14721 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14721: [SPARK-17158][SQL] Change error message for out of range...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14721 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64112/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14721: [SPARK-17158][SQL] Change error message for out of range...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14721 **[Test build #3230 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3230/consoleFull)** for PR 14721 at commit [`19582ff`](https://github.com/apache/spark/commit/19582ff633932c3ec0a6804bec9314a3390a6404). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14721: [SPARK-17158][SQL] Change error message for out of range...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14721 **[Test build #64112 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64112/consoleFull)** for PR 14721 at commit [`19582ff`](https://github.com/apache/spark/commit/19582ff633932c3ec0a6804bec9314a3390a6404). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14719 **[Test build #64116 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64116/consoleFull)** for PR 14719 at commit [`91cb915`](https://github.com/apache/spark/commit/91cb915b4e6c3c4d24fab3f1e772e7e361d4c088). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14635: [SPARK-17052] [SQL] Remove Duplicate Test Cases auto_joi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14635 **[Test build #64115 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64115/consoleFull)** for PR 14635 at commit [`8b8725c`](https://github.com/apache/spark/commit/8b8725cb28f8f4564f4ee0b168363282df09564f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14682: [SPARK-17104][SQL] LogicalRelation.newInstance should fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14682 **[Test build #64114 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64114/consoleFull)** for PR 14682 at commit [`e243323`](https://github.com/apache/spark/commit/e243323cb04880c20fb40e1aed8b4a28022d5540). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14682: [SPARK-17104][SQL] LogicalRelation.newInstance should fo...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14682 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75569715 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -168,6 +170,57 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("generate table-level statistics") { +def checkTableStats( +statsSeq: Seq[Statistics], +sizeInBytes: Int, +estimatedSize: Int, +rowCount: Int): Unit = { + assert(statsSeq.size === 1) + assert(statsSeq.head.sizeInBytes === BigInt(sizeInBytes)) + assert(statsSeq.head.estimatedSize === Some(BigInt(estimatedSize))) + assert(statsSeq.head.rowCount === Some(BigInt(rowCount))) +} + +sql("CREATE TABLE analyzeTable (key STRING, value STRING)").collect() --- End diff -- I'll modify unit tests based on your comments, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14724: [SPARK-17162] Range does not support SQL generation
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14724 Can you make the options for logical plan optional? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14708: [SPARK-17149][SQL] array.sql for testing array re...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14708 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75569609 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala --- @@ -141,7 +142,16 @@ private[hive] case class MetastoreRelation( sparkSession.sessionState.conf.defaultSizeInBytes }) } - ) +val tableParams = hiveQlTable.getParameters +val rowCount = tableParams.get(AnalyzeTableCommand.ROW_COUNT) +if (rowCount != null && rowCount.toLong >=0) { --- End diff -- ok, thx, i'll fix this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75569605 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -108,4 +126,8 @@ case class AnalyzeTableCommand(tableName: String) extends RunnableCommand { object AnalyzeTableCommand { val TOTAL_SIZE_FIELD = "totalSize" + // same as org.apache.hadoop.hive.common.StatsSetupConst + val ROW_COUNT = "numRows" + val STATS_GENERATED_VIA_STATS_TASK = "STATS_GENERATED_VIA_STATS_TASK" + val TRUE = "true" --- End diff -- ok, i'll remove this. I just copied it from StatsSetupConst :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14708: [SPARK-17149][SQL] array.sql for testing array related f...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14708 Merging in master/2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14517: [SPARK-16931][PYTHON] PySpark APIS for bucketBy and sort...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14517 **[Test build #64113 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64113/consoleFull)** for PR 14517 at commit [`dfef36b`](https://github.com/apache/spark/commit/dfef36b6fafd24369b94a492285a48e7b4aad12c). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14517: [SPARK-16931][PYTHON] PySpark APIS for bucketBy and sort...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14517 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64113/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14517: [SPARK-16931][PYTHON] PySpark APIS for bucketBy and sort...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14517 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75569351 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -33,7 +34,7 @@ import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, CatalogTable} * Right now, it only supports Hive tables and it only updates the size of a Hive table * in the Hive metastore. */ -case class AnalyzeTableCommand(tableName: String) extends RunnableCommand { +case class AnalyzeTableCommand(tableName: String, noscan: Boolean = true) extends RunnableCommand { --- End diff -- Recalculation incurs high cost, it should be triggered by uses like DBAs. We can have an mechanism to incrementally update the stats in the future, but that will need some well designed algorithms (especially for histograms) and definition of confidence interval. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14724: [SPARK-17162] Range does not support SQL generation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14724 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64110/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14724: [SPARK-17162] Range does not support SQL generation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14724 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14724: [SPARK-17162] Range does not support SQL generation
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14724 **[Test build #64110 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64110/consoleFull)** for PR 14724 at commit [`e0e12e3`](https://github.com/apache/spark/commit/e0e12e36de949cb2715e1aad893b3eeb0007b0f0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75569106 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -88,14 +90,30 @@ case class AnalyzeTableCommand(tableName: String) extends RunnableCommand { } }.getOrElse(0L) -// Update the Hive metastore if the total size of the table is different than the size +val needUpdate = new mutable.HashMap[String, String]() +if (newTotalSize > 0 && newTotalSize != oldTotalSize) { + needUpdate += (AnalyzeTableCommand.TOTAL_SIZE_FIELD -> newTotalSize.toString) +} +if (!noscan) { + val oldRowCount = tableParameters.get(AnalyzeTableCommand.ROW_COUNT).map(_.toLong) +.getOrElse(-1L) + val newRowCount = sparkSession.table(tableName).count() + + if (newRowCount >= 0 && newRowCount != oldRowCount) { +needUpdate += (AnalyzeTableCommand.ROW_COUNT -> newRowCount.toString) + } +} +// Update the Hive metastore if the above parameters of the table is different than those // recorded in the Hive metastore. // This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats(). -if (newTotalSize > 0 && newTotalSize != oldTotalSize) { +if (needUpdate.nonEmpty) { + // need to set this parameter so that we can store other parameters like "numRows" into + // Hive metastore + needUpdate.put(AnalyzeTableCommand.STATS_GENERATED_VIA_STATS_TASK, --- End diff -- If we don't set this parameter, "numRows" can not be stored into Hive metastore. We need to do this in Spark so that we can persist our statistics in metastore. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14517: [SPARK-16931][PYTHON] PySpark APIS for bucketBy and sort...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14517 **[Test build #64113 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64113/consoleFull)** for PR 14517 at commit [`dfef36b`](https://github.com/apache/spark/commit/dfef36b6fafd24369b94a492285a48e7b4aad12c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75568863 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -99,9 +99,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { ctx.identifier.getText.toLowerCase == "noscan") { --- End diff -- @cloud-fan noscan won't scan files, it only collects statistics like total size. Without noscan, we will collect other stats like row count and column level stats. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14475: [SPARK-16862] Configurable buffer size in `UnsafeSorterS...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14475 Looks like the diff is messed up? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14721: [SPARK-17158][SQL] Change error message for out of range...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14721 **[Test build #3230 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3230/consoleFull)** for PR 14721 at commit [`19582ff`](https://github.com/apache/spark/commit/19582ff633932c3ec0a6804bec9314a3390a6404). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14721: [SPARK-17158][SQL] Change error message for out of range...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14721 **[Test build #64112 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64112/consoleFull)** for PR 14721 at commit [`19582ff`](https://github.com/apache/spark/commit/19582ff633932c3ec0a6804bec9314a3390a6404). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14426 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14426 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64108/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14426 **[Test build #64108 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64108/consoleFull)** for PR 14426 at commit [`71954e2`](https://github.com/apache/spark/commit/71954e21ba63dc019103a060f7a4ba63a69ce0c9). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/14579#discussion_r75567101 --- Diff: python/pyspark/rdd.py --- @@ -188,6 +188,12 @@ def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSeri self._id = jrdd.id() self.partitioner = None +def __enter__(self): --- End diff -- yeas, also known as the "If you don't know what to do; raise an Error" approach :p --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14637: [WIP] [SPARK-16967] move mesos to module
Github user mgummelt commented on the issue: https://github.com/apache/spark/pull/14637 mima seems to be upset about my removal of MESOS_REGEX from `SparkMasterRegex`, but I don't understand why, as it's a private class. Should I add an entry to MimaExcludes? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14725: [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14725 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14725: [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14725 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64111/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14725: [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14725 **[Test build #64111 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64111/consoleFull)** for PR 14725 at commit [`f9672bf`](https://github.com/apache/spark/commit/f9672bfe34b1b5f5ea14700d2aaaee055f5323f8). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org