Github user ooq commented on a diff in the pull request:
https://github.com/apache/spark/pull/18503#discussion_r125992343
--- Diff:
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/FixedLengthRowBasedKeyValueBatch.java
---
@@ -62,7 +62,7 @@ public UnsafeRow
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/15193
@yaooqinn Do you have any benchmarks on the performance difference? I think
pointTo() is pretty cheap. And does the patch pass the tests? I think valueRow
is not updated correctly in your patch
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/15193
Thanks for the patch @yaooqinn . As per comments for `getValueRow`, because
`getValueRow(id)` is always called after `getKeyRow(id)` with the same id, we
use `getValueFromKey(id) to retrieve value row
Github user ooq commented on a diff in the pull request:
https://github.com/apache/spark/pull/14349#discussion_r99481369
--- Diff:
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/FixedLengthRowBasedKeyValueBatch.java
---
@@ -0,0 +1,174
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/15016
LGTM. This is good. Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14176
Thanks. I will take a look tonight.
> On Sep 5, 2016, at 4:46 PM, Josh Rosen <notificati...@github.com> wrote:
>
> @ooq @sameeragarwal, it looks like this patch is the
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14176
Thanks for the comments @clockfly ! As per discussion with @sameeragarwal,
I think the plan is to give users the option to turn on/off two-level hashmap.
This is why we have this first level logic
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14176
@davies I guess there is still benefit to make it public? If the user knows
that their workload would always run faster with single-level, e.g., many
distinct keys. I thought about
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14176
Thanks for the comments @davies @sameeragarwal . This PR has been updated.
Basically the only public boolean flag now is called
`spark.sql.codegen.aggregate.map.twolevel.enable`. There is a separate
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14513
@davies Updated.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14176
@davies @sameeragarwal I updated more results in the benchmark PR
https://github.com/apache/spark/pull/14266 .
---
If your project is set up for it, you can reply to this email and have your
reply
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14266
@davies Added some test results with larger number of distinct keys.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14513
@davies Makes sense. I'll update the PR.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14513
@davies It should be fine right? Now we don't have recursions. We don't
need to force `dictionaryIds` instance outside of `ColumnVector` class to use
that method.
---
If your project is set up
GitHub user ooq opened a pull request:
https://github.com/apache/spark/pull/14513
[SPARK-16928][SQL] Recursive call of ColumnVector::getInt() breaks JIT
inlining
## What changes were proposed in this pull request?
In both `OnHeapColumnVector` and `OffHeapColumnVector`, we
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14513
cc @davies
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14176
Added the explicit SQL tests for both hash map implementations. The test
suites extend `DataFrameAggregateSuite` and reuse all tests there. For the two
bugs that failed previous builds: the length bug
Github user ooq commented on a diff in the pull request:
https://github.com/apache/spark/pull/14176#discussion_r73050331
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
---
@@ -279,9 +280,15 @@ case class HashAggregateExec
Github user ooq closed the pull request at:
https://github.com/apache/spark/pull/13960
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/13960
closing this because the patch degrade query performance when there are no
nulls
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14370
Thanks for the PR @lw-lin . I think this check would be useful and ideally
`WARN_IF_EXCEEDS_JIT_LIMIT` should be on by default if the overhead is minimal.
---
If your project is set up for it, you can
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14349
This PR is a cleaner version for 14174:
https://github.com/apache/spark/pull/14174.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14174
Closing this now. Open another PR for cleaner commit history:
https://github.com/apache/spark/pull/14349
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user ooq closed the pull request at:
https://github.com/apache/spark/pull/14174
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
GitHub user ooq opened a pull request:
https://github.com/apache/spark/pull/14349
[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator
## What changes were proposed in this pull request?
This PR is the first step for the following feature:
For hash
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14337
cc @rxin @sameeragarwal
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
GitHub user ooq opened a pull request:
https://github.com/apache/spark/pull/14337
[SPARK-16699][SQL] Fix performance bug in hash aggregate on long string keys
## What changes were proposed in this pull request?
In the following code in `VectorizedHashMapGenerator.scala
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14174
hi @viirya , you can find the benchmark numbers here in this PR:
https://github.com/apache/spark/pull/14266
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14174
Thanks for the comments @sameeragarwal . I addressed all of them and also
refactored both generators to extend an abstract class `HashMapGenerator`.
---
If your project is set up for it, you can reply
Github user ooq commented on a diff in the pull request:
https://github.com/apache/spark/pull/14174#discussion_r71402778
--- Diff:
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatch.java
---
@@ -0,0 +1,322 @@
+/*
+ * Licensed
GitHub user ooq opened a pull request:
https://github.com/apache/spark/pull/14266
[SPARK-16526] [SQL] Benchmarking Performance for Fast HashMap
Implementations and Set Knobs
## What changes were proposed in this pull request?
The 3rd PR in its series to resolve SPARK-16523
Github user ooq commented on a diff in the pull request:
https://github.com/apache/spark/pull/14174#discussion_r71253773
--- Diff:
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatch.java
---
@@ -0,0 +1,322 @@
+/*
+ * Licensed
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14174
hey @sameeragarwal, the diff of the generated code can be found here (with
the left side for vectorized hashmap and right side for the new, row-based
one):
https://gist.github.com/ooq
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14174
cc @sameeragarwal @davies @rxin
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/14176
cc @sameeragarwal @davies @rxin
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
GitHub user ooq opened a pull request:
https://github.com/apache/spark/pull/14176
[SPARK-16525][SQL] Enable Row Based HashMap in HashAggregateExec
## What changes were proposed in this pull request?
This PR is the second step for the following feature:
For hash
GitHub user ooq opened a pull request:
https://github.com/apache/spark/pull/14174
[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator
## What changes were proposed in this pull request?
This PR is the first step for the following feature:
For hash
Github user ooq commented on the issue:
https://github.com/apache/spark/pull/13960
JIRA will be added after server is up. cc @davies @sameeragarwal @rxin
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
GitHub user ooq opened a pull request:
https://github.com/apache/spark/pull/13960
[SQL] Support null handling for vectorized hashmap during hash aggregate
## What changes were proposed in this pull request?
The current impl of vectorized hashmap does not support null keys
39 matches
Mail list logo