GitHub user thunterdb opened a pull request: https://github.com/apache/spark/pull/17419
[SPARK-19634][ML][WIP] Multivariate summarizer - dataframes API ## What changes were proposed in this pull request? This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of `MultivariateOnlineSummarizer`, it also allows the user to select a subset of the metrics. This should resolve some performance issues related to computing unrequested metrics. Furthermore, it uses the BLAS API to the extent possible, so that the given code should be efficient for the dense case. ## How was this patch tested? This patch includes most of the tests of the RDD-based. It compares results against the existing `MultivariateOnlineSummarizer` as well as adding more tests. This patch also includes some documentation for some low-level constructs such as `TypedImperativeAggregate`. ## Performance I have not run tests against the existing implementation. However, this patch uses the recommended low-level SQL APIs, so it should be interesting to compare both implementation in that respect. ## WIP Marked as WIP because some debugging comments are still present in the code. Thanks to @hvanhovell and Cheng Liang for suggestions on SparkSQL. You can merge this pull request into a Git repository by running: $ git pull https://github.com/thunterdb/spark 19634 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17419.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17419 ---- commit f3fa6580bca70f3307d70e938ef8531c928d958b Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-03T18:36:02Z work commit 7539835dad863a6b73d88d79983342f9ddb7fb9d Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-06T22:38:41Z work on the test suite commit 673943f334b94e5d1ecd8874cb82bbc875d739c6 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-07T00:01:30Z last work commit 202b672afec127f4e0885cf3a58f4dfc97031fc6 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-13T22:48:47Z work on using imperative aggregators commit be019813f241d0ad3559b4d84339f1bb1055cbc4 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-17T21:44:40Z Merge remote-tracking branch 'upstream/master' into 19634 commit a983284cfeddabd017792e3991cf99a7d3ab1e16 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-18T00:14:40Z more work on summarizer commit 647a4fecb17d478c3c8cd68d40f2a9456eb10c66 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-21T17:47:30Z work commit 3c4bef772a3cbc759e43223af658a357c5ca6bc2 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-21T18:54:16Z changes commit 56390ccc456c67b2f7a08c1271fa50408518da0f Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-21T18:54:19Z Merge remote-tracking branch 'upstream/master' into 19634 commit c3f236c4422031ae818cb6bbec2415b3f1bf7b70 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-21T19:03:07Z cleanup commit ef955c00275705f14342f3e4ed970a78f0f3c141 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-21T22:42:42Z debugging commit a04f923913ca1118a61d66bd53b8514af62594d7 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-21T23:14:23Z work commit 946d490c8b29e55ec0e6d40785122269063894ad Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-22T21:14:29Z Merge remote-tracking branch 'upstream/master' into 19634 commit 201eb7712054967cd5093d3a908f4ebbd73f30a8 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-22T21:19:57Z debug commit f4dec88a49d0a20e1b328617fd721633fd8c201a Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-23T18:27:19Z trying to debug serialization issue commit 4af0f47d326ef91d7cf9ccaf6a45ee3f904b191f Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-23T23:16:10Z better tests commit 9f29030f75089884156bdc4ee634857b3730114d Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-24T00:12:28Z changes commit e9877dc2f08d393f079bdf6fbbf1b9b9abaa21da Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-24T21:04:32Z debugging commit 3a11d0265ef665a63cd070eeb1ae4ac25bc89908 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-24T22:14:06Z more tests and debugging commit 6d26c17d0bd4ab18d564ee7f37916780211702d5 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-24T23:12:19Z fixed tests commit 35eaeb0d02ae9cc29ae559231fe4858935315477 Author: Timothy Hunter <timhun...@databricks.com> Date: 2017-03-24T23:23:15Z doc ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org