[GitHub] spark pull request #18503: [SPARK-21271][SQL] Ensure Unsafe.sizeInBytes is a...

2017-07-06 Thread ooq
Github user ooq commented on a diff in the pull request: https://github.com/apache/spark/pull/18503#discussion_r125992343 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/FixedLengthRowBasedKeyValueBatch.java --- @@ -62,7 +62,7 @@ public UnsafeRow

[GitHub] spark issue #15193: [SQL]RowBasedKeyValueBatch reuse valueRow too

2017-02-13 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/15193 @yaooqinn Do you have any benchmarks on the performance difference? I think pointTo() is pretty cheap. And does the patch pass the tests? I think valueRow is not updated correctly in your patch

[GitHub] spark issue #15193: [SQL]RowBasedKeyValueBatch reuse valueRow too

2017-02-08 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/15193 Thanks for the patch @yaooqinn . As per comments for `getValueRow`, because `getValueRow(id)` is always called after `getKeyRow(id)` with the same id, we use `getValueFromKey(id) to retrieve value row

[GitHub] spark pull request #14349: [SPARK-16524][SQL] Add RowBatch and RowBasedHashM...

2017-02-04 Thread ooq
Github user ooq commented on a diff in the pull request: https://github.com/apache/spark/pull/14349#discussion_r99481369 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/FixedLengthRowBasedKeyValueBatch.java --- @@ -0,0 +1,174

[GitHub] spark issue #15016: [SPARK-17405] RowBasedKeyValueBatch should use default p...

2016-09-08 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/15016 LGTM. This is good. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

2016-09-05 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14176 Thanks. I will take a look tonight. > On Sep 5, 2016, at 4:46 PM, Josh Rosen <notificati...@github.com> wrote: > > @ooq @sameeragarwal, it looks like this patch is the

[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

2016-08-29 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14176 Thanks for the comments @clockfly ! As per discussion with @sameeragarwal, I think the plan is to give users the option to turn on/off two-level hashmap. This is why we have this first level logic

[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

2016-08-25 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14176 @davies I guess there is still benefit to make it public? If the user knows that their workload would always run faster with single-level, e.g., many distinct keys. I thought about

[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

2016-08-24 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14176 Thanks for the comments @davies @sameeragarwal . This PR has been updated. Basically the only public boolean flag now is called `spark.sql.codegen.aggregate.map.twolevel.enable`. There is a separate

[GitHub] spark issue #14513: [SPARK-16928][SQL] Recursive call of ColumnVector::getIn...

2016-08-05 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14513 @davies Updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

2016-08-05 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14176 @davies @sameeragarwal I updated more results in the benchmark PR https://github.com/apache/spark/pull/14266 . --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #14266: [SPARK-16526][SQL] Benchmarking Performance for Fast Has...

2016-08-05 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14266 @davies Added some test results with larger number of distinct keys. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #14513: [SPARK-16928][SQL] Recursive call of ColumnVector::getIn...

2016-08-05 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14513 @davies Makes sense. I'll update the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #14513: [SPARK-16928][SQL] Recursive call of ColumnVector::getIn...

2016-08-05 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14513 @davies It should be fine right? Now we don't have recursions. We don't need to force `dictionaryIds` instance outside of `ColumnVector` class to use that method. --- If your project is set up

[GitHub] spark pull request #14513: [SPARK-16928][SQL] Recursive call of ColumnVector...

2016-08-05 Thread ooq
GitHub user ooq opened a pull request: https://github.com/apache/spark/pull/14513 [SPARK-16928][SQL] Recursive call of ColumnVector::getInt() breaks JIT inlining ## What changes were proposed in this pull request? In both `OnHeapColumnVector` and `OffHeapColumnVector`, we

[GitHub] spark issue #14513: [SPARK-16928][SQL] Recursive call of ColumnVector::getIn...

2016-08-05 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14513 cc @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

2016-08-03 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14176 Added the explicit SQL tests for both hash map implementations. The test suites extend `DataFrameAggregateSuite` and reuse all tests there. For the two bugs that failed previous builds: the length bug

[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

2016-08-01 Thread ooq
Github user ooq commented on a diff in the pull request: https://github.com/apache/spark/pull/14176#discussion_r73050331 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -279,9 +280,15 @@ case class HashAggregateExec

[GitHub] spark pull request #13960: [SPARK-16269][SQL] Support null handling for vect...

2016-07-27 Thread ooq
Github user ooq closed the pull request at: https://github.com/apache/spark/pull/13960 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #13960: [SPARK-16269][SQL] Support null handling for vectorized ...

2016-07-27 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/13960 closing this because the patch degrade query performance when there are no nulls --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #14370: [SPARK-16713][SQL] Check codegen method size ≤ 8K on c...

2016-07-26 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14370 Thanks for the PR @lw-lin . I think this check would be useful and ideally `WARN_IF_EXCEEDS_JIT_LIMIT` should be on by default if the overhead is minimal. --- If your project is set up for it, you can

[GitHub] spark issue #14349: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-25 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14349 This PR is a cleaner version for 14174: https://github.com/apache/spark/pull/14174. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-25 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14174 Closing this now. Open another PR for cleaner commit history: https://github.com/apache/spark/pull/14349 --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashM...

2016-07-25 Thread ooq
Github user ooq closed the pull request at: https://github.com/apache/spark/pull/14174 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request #14349: [SPARK-16524][SQL] Add RowBatch and RowBasedHashM...

2016-07-25 Thread ooq
GitHub user ooq opened a pull request: https://github.com/apache/spark/pull/14349 [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator ## What changes were proposed in this pull request? This PR is the first step for the following feature: For hash

[GitHub] spark issue #14337: [SPARK-16699][SQL] Fix performance bug in hash aggregate...

2016-07-24 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14337 cc @rxin @sameeragarwal --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #14337: [SPARK-16699][SQL] Fix performance bug in hash ag...

2016-07-24 Thread ooq
GitHub user ooq opened a pull request: https://github.com/apache/spark/pull/14337 [SPARK-16699][SQL] Fix performance bug in hash aggregate on long string keys ## What changes were proposed in this pull request? In the following code in `VectorizedHashMapGenerator.scala

[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-19 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14174 hi @viirya , you can find the benchmark numbers here in this PR: https://github.com/apache/spark/pull/14266 --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-19 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14174 Thanks for the comments @sameeragarwal . I addressed all of them and also refactored both generators to extend an abstract class `HashMapGenerator`. --- If your project is set up for it, you can reply

[GitHub] spark pull request #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashM...

2016-07-19 Thread ooq
Github user ooq commented on a diff in the pull request: https://github.com/apache/spark/pull/14174#discussion_r71402778 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatch.java --- @@ -0,0 +1,322 @@ +/* + * Licensed

[GitHub] spark pull request #14266: [SPARK-16526] [SQL] Benchmarking Performance for ...

2016-07-19 Thread ooq
GitHub user ooq opened a pull request: https://github.com/apache/spark/pull/14266 [SPARK-16526] [SQL] Benchmarking Performance for Fast HashMap Implementations and Set Knobs ## What changes were proposed in this pull request? The 3rd PR in its series to resolve SPARK-16523

[GitHub] spark pull request #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashM...

2016-07-18 Thread ooq
Github user ooq commented on a diff in the pull request: https://github.com/apache/spark/pull/14174#discussion_r71253773 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatch.java --- @@ -0,0 +1,322 @@ +/* + * Licensed

[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-17 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14174 hey @sameeragarwal, the diff of the generated code can be found here (with the left side for vectorized hashmap and right side for the new, row-based one): https://gist.github.com/ooq

[GitHub] spark issue #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGener...

2016-07-13 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14174 cc @sameeragarwal @davies @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #14176: [SPARK-16525][SQL] Enable Row Based HashMap in HashAggre...

2016-07-13 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/14176 cc @sameeragarwal @davies @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #14176: [SPARK-16525][SQL] Enable Row Based HashMap in Ha...

2016-07-13 Thread ooq
GitHub user ooq opened a pull request: https://github.com/apache/spark/pull/14176 [SPARK-16525][SQL] Enable Row Based HashMap in HashAggregateExec ## What changes were proposed in this pull request? This PR is the second step for the following feature: For hash

[GitHub] spark pull request #14174: [SPARK-16524][SQL] Add RowBatch and RowBasedHashM...

2016-07-13 Thread ooq
GitHub user ooq opened a pull request: https://github.com/apache/spark/pull/14174 [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator ## What changes were proposed in this pull request? This PR is the first step for the following feature: For hash

[GitHub] spark issue #13960: [SQL] Support null handling for vectorized hashmap durin...

2016-06-28 Thread ooq
Github user ooq commented on the issue: https://github.com/apache/spark/pull/13960 JIRA will be added after server is up. cc @davies @sameeragarwal @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request #13960: [SQL] Support null handling for vectorized hashma...

2016-06-28 Thread ooq
GitHub user ooq opened a pull request: https://github.com/apache/spark/pull/13960 [SQL] Support null handling for vectorized hashmap during hash aggregate ## What changes were proposed in this pull request? The current impl of vectorized hashmap does not support null keys