[jira] [Commented] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators

2018-07-19 Thread Qifan Pu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549504#comment-16549504
 ] 

Qifan Pu commented on SPARK-24838:
--

Thanks for the PR [~maurits]! Should we also fix it for Aggregate altogether?

> Support uncorrelated IN/EXISTS subqueries for more operators 
> -
>
> Key: SPARK-24838
> URL: https://issues.apache.org/jira/browse/SPARK-24838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Qifan Pu
>Priority: Major
>  Labels: spree
>
> Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. 
> Running a query:
> {{select name in (select * from valid_names)}}
> {{from all_names}}
> returns error:
> {code:java}
> Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries 
> can only be used in a Filter
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators

2018-07-17 Thread Qifan Pu (JIRA)
Qifan Pu created SPARK-24838:


 Summary: Support uncorrelated IN/EXISTS subqueries for more 
operators 
 Key: SPARK-24838
 URL: https://issues.apache.org/jira/browse/SPARK-24838
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Qifan Pu


Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. 
Running a query:

{{select name in (select * from valid_names)}}
{{from all_names}}

returns error:
{code:java}
Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries can 
only be used in a Filter
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-07 Thread Qifan Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472037#comment-15472037
 ] 

Qifan Pu commented on SPARK-17405:
--

[~joshrosen] Yes, running local[32] will reproduce the exception. 

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 
> (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 
> bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
> at 
> org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:783)

[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-07 Thread Qifan Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472009#comment-15472009
 ] 

Qifan Pu commented on SPARK-17405:
--

One quick fix is to set memory capacity in configuration to make sure 
memory_capacity > x*cores (x being some number > 64MB)

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 
> (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 
> bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
> at 
> org.apache.spark.un

[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-07 Thread Qifan Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471747#comment-15471747
 ] 

Qifan Pu commented on SPARK-17405:
--

[~joshrosen]
Yes likely. The new hashmap asks for 64MB per task, and the default single-node 
setting uses only hundreds of memory in total.
We decided on 64MB due to our single memory page design for simplicity and 
performance, and that in production it should hold 64MB * cores << 
memory_capacity.
Maybe we should increase default memory a bit? Or is it bad in general to have 
such upfront cost of 64MB?

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 fail

[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-07 Thread Qifan Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471640#comment-15471640
 ] 

Qifan Pu commented on SPARK-17405:
--

[~joshrosen][~jlaskowski]Thanks for the comments and suggestions. I have run 
both of your queries on 03d77af9ec4ce9a42affd6ab4381ae5bd3c79a5a and was able 
to finish both of them without any exceptions. 
I'll do some static code analysis based on the log from [~jlaskowski]

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 
> (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 26214

[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-06 Thread Qifan Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15466676#comment-15466676
 ] 

Qifan Pu commented on SPARK-17405:
--

[~joshrosen] Thanks for reporting. I haven't been able to reproduce this 
because of a catalyst bug I have now `Error: 
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
foldable on unresolved object, tree: 'TIMESTAMP(2012-10-19 00:00:00.0) 
(state=,code=0)`
I will look more into this.
How much memory is configured for this specific test? One thing is that we 
added memory management through MemoryConsumer for the generated hashmap, so it 
correctly accounts that part of memory usage and is more likely to throw OOM.

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org

[jira] [Commented] (SPARK-17053) DROP statement should not require IF EXISTS

2016-08-14 Thread Qifan Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420443#comment-15420443
 ] 

Qifan Pu commented on SPARK-17053:
--

[~dongjoon]sorry, it was a accident click.

> DROP statement should not require IF EXISTS
> ---
>
> Key: SPARK-17053
> URL: https://issues.apache.org/jira/browse/SPARK-17053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gokhan Civan
>
> In version 1.6.1, the following does not throw an exception:
> create table a as select 1; drop table a; drop table a;
> In version 2.0.0, the second drop fails; this is not compatible with Hive.
> The same problem exists for views.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17053) DROP statement should not require IF EXISTS

2016-08-14 Thread Qifan Pu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qifan Pu resolved SPARK-17053.
--
Resolution: Won't Fix

> DROP statement should not require IF EXISTS
> ---
>
> Key: SPARK-17053
> URL: https://issues.apache.org/jira/browse/SPARK-17053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gokhan Civan
>
> In version 1.6.1, the following does not throw an exception:
> create table a as select 1; drop table a; drop table a;
> In version 2.0.0, the second drop fails; this is not compatible with Hive.
> The same problem exists for views.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16928) Recursive call of ColumnVector::getInt() breaks JIT inlining

2016-08-05 Thread Qifan Pu (JIRA)
Qifan Pu created SPARK-16928:


 Summary: Recursive call of ColumnVector::getInt() breaks JIT 
inlining
 Key: SPARK-16928
 URL: https://issues.apache.org/jira/browse/SPARK-16928
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Qifan Pu


In both OnHeapColumnVector and OffHeapColumnVector, we implemented getInt() 
with the following code pattern: 

  public int getInt(int rowId) {
if (dictionary == null) {
  return intData[rowId];
} else {
  return dictionary.decodeToInt(dictionaryIds.getInt(rowId));
}
  }

As dictionaryIds is also a ColumnVector, this results in a recursive call of 
getInt() and breaks JIT inlining. As a result, getInt() will not get inlined.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16713) Limit codegen method size to 8KB

2016-07-25 Thread Qifan Pu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qifan Pu updated SPARK-16713:
-
Description: Ideally, we would wish codegen methods to be less than 8KB for 
bytecode size. Beyond 8K JIT won't compile and can cause performance 
degradation. We have seen this for queries with wide schema (30+ fields), where 
agg_doAggregateWithKeys() can be more than 8K. This is also a major reason for 
performance regression when we enable fash aggregate hashmap (such as using 
VectorizedHashMapGenerator.scala).  (was: Ideally, we would wish codegen 
methods to be less than 8KB for bytecode size. Beyond 8K JIT won't compile and 
can cause performance degradation. We have seen this for queries with wide 
schema (30+ fields). This is also a major reason for performance regression 
when we enable fash aggregate hashmap (such as using 
VectorizedHashMapGenerator.scala).)

> Limit codegen method size to 8KB
> 
>
> Key: SPARK-16713
> URL: https://issues.apache.org/jira/browse/SPARK-16713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>
> Ideally, we would wish codegen methods to be less than 8KB for bytecode size. 
> Beyond 8K JIT won't compile and can cause performance degradation. We have 
> seen this for queries with wide schema (30+ fields), where 
> agg_doAggregateWithKeys() can be more than 8K. This is also a major reason 
> for performance regression when we enable fash aggregate hashmap (such as 
> using VectorizedHashMapGenerator.scala).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16713) Limit codegen method size to 8KB

2016-07-25 Thread Qifan Pu (JIRA)
Qifan Pu created SPARK-16713:


 Summary: Limit codegen method size to 8KB
 Key: SPARK-16713
 URL: https://issues.apache.org/jira/browse/SPARK-16713
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.0.0
Reporter: Qifan Pu


Ideally, we would wish codegen methods to be less than 8KB for bytecode size. 
Beyond 8K JIT won't compile and can cause performance degradation. We have seen 
this for queries with wide schema (30+ fields). This is also a major reason for 
performance regression when we enable fash aggregate hashmap (such as using 
VectorizedHashMapGenerator.scala).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15258) Nested/Chained case statements generate codegen over 64k exception

2016-07-25 Thread Qifan Pu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392521#comment-15392521
 ] 

Qifan Pu commented on SPARK-15258:
--

Just want to leave a note that we might want to limit method size to be < 8K. 
Beyond 8K JIT won't compile and can cause performance degradation. This can 
happen for queries with wide schema.

> Nested/Chained case statements generate codegen over 64k exception
> --
>
> Key: SPARK-15258
> URL: https://issues.apache.org/jira/browse/SPARK-15258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jonathan Gray
> Attachments: NestedCases.scala
>
>
> Nested/Chained case-when expressions generate a codegen goes beyound 64k 
> exception.
> A test attached demonstrates this behaviour.
> I'd like to try and fix this but don't really know the best place to start.  
> Ideally, I'd like to avoid the codegen fallback as with large volumes this 
> hurts performance.
> This is similar(ish) to SPARK-13242 but I'd like to see if there are any 
> alternatives to the codegen fallback approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16699) Fix performance bug in hash aggregate on long string keys

2016-07-24 Thread Qifan Pu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qifan Pu updated SPARK-16699:
-
Description: 
In the following code in `VectorizedHashMapGenerator.scala`:
```
def hashBytes(b: String): String = {
  val hash = ctx.freshName("hash")
  s"""
 |int $result = 0;
 |for (int i = 0; i < $b.length; i++) {
 |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
 |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
($result >>> 2);
 |}
   """.stripMargin
}

```
when b=input.getBytes(), the current 2.0 code results in getBytes() being 
called n times, n being length of input. getBytes() involves memory copy is 
thus expensive and causes a performance degradation.
Fix is to evaluate getBytes() before the for loop.

  was:
In the following code in `VectorizedHashMapGenerator.scala`:
```
def hashBytes(b: String): String = {
  val hash = ctx.freshName("hash")
  val bytes = ctx.freshName("bytes")
  s"""
 |int $result = 0;
 |byte[] $bytes = $b;
 |for (int i = 0; i < $bytes.length; i++) {
 |  ${genComputeHash(ctx, s"$bytes[i]", ByteType, hash)}
 |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
($result >>> 2);
 |}
   """.stripMargin
}

```
when b=input.getBytes(), the current 2.0 code results in getBytes() being 
called n times, n being length of input. getBytes() involves memory copy is 
thus expensive and causes a performance degradation.
Fix is to evaluate getBytes() before the for loop.


> Fix performance bug in hash aggregate on long string keys
> -
>
> Key: SPARK-16699
> URL: https://issues.apache.org/jira/browse/SPARK-16699
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
> Fix For: 2.0.0
>
>
> In the following code in `VectorizedHashMapGenerator.scala`:
> ```
> def hashBytes(b: String): String = {
>   val hash = ctx.freshName("hash")
>   s"""
>  |int $result = 0;
>  |for (int i = 0; i < $b.length; i++) {
>  |  ${genComputeHash(ctx, s"$b[i]", ByteType, hash)}
>  |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
> ($result >>> 2);
>  |}
>""".stripMargin
> }
> ```
> when b=input.getBytes(), the current 2.0 code results in getBytes() being 
> called n times, n being length of input. getBytes() involves memory copy is 
> thus expensive and causes a performance degradation.
> Fix is to evaluate getBytes() before the for loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16699) Fix performance bug in hash aggregate on long string keys

2016-07-24 Thread Qifan Pu (JIRA)
Qifan Pu created SPARK-16699:


 Summary: Fix performance bug in hash aggregate on long string keys
 Key: SPARK-16699
 URL: https://issues.apache.org/jira/browse/SPARK-16699
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Qifan Pu
 Fix For: 2.0.0


In the following code in `VectorizedHashMapGenerator.scala`:
```
def hashBytes(b: String): String = {
  val hash = ctx.freshName("hash")
  val bytes = ctx.freshName("bytes")
  s"""
 |int $result = 0;
 |byte[] $bytes = $b;
 |for (int i = 0; i < $bytes.length; i++) {
 |  ${genComputeHash(ctx, s"$bytes[i]", ByteType, hash)}
 |  $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + 
($result >>> 2);
 |}
   """.stripMargin
}

```
when b=input.getBytes(), the current 2.0 code results in getBytes() being 
called n times, n being length of input. getBytes() involves memory copy is 
thus expensive and causes a performance degradation.
Fix is to evaluate getBytes() before the for loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16526) Benchmarking Performance for Fast HashMap Implementations and Set Knobs

2016-07-13 Thread Qifan Pu (JIRA)
Qifan Pu created SPARK-16526:


 Summary: Benchmarking Performance for Fast HashMap Implementations 
and Set Knobs
 Key: SPARK-16526
 URL: https://issues.apache.org/jira/browse/SPARK-16526
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Qifan Pu


Add benchmark results for two fast hashmap implementations. Set the rule on how 
to pick between the two (or directly fallback to 2nd level hashmap) based on 
benchmark results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16525) Enable Row Based HashMap in HashAggregateExec

2016-07-13 Thread Qifan Pu (JIRA)
Qifan Pu created SPARK-16525:


 Summary: Enable Row Based HashMap in HashAggregateExec
 Key: SPARK-16525
 URL: https://issues.apache.org/jira/browse/SPARK-16525
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Qifan Pu


Allow `RowBasedHashMapGenerator` to be used in `HashAggregateExec`, so that we 
can turn codegen `RowBasedHashMap`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16524) Add RowBatch and RowBasedHashMapGenerator

2016-07-13 Thread Qifan Pu (JIRA)
Qifan Pu created SPARK-16524:


 Summary: Add RowBatch and RowBasedHashMapGenerator
 Key: SPARK-16524
 URL: https://issues.apache.org/jira/browse/SPARK-16524
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Qifan Pu


This JIRA adds the implementations for `RowBatch` and 
`RowBasedHashMapGenerator`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16523) Support Row Based Aggregation HashMap

2016-07-13 Thread Qifan Pu (JIRA)
Qifan Pu created SPARK-16523:


 Summary: Support Row Based Aggregation HashMap
 Key: SPARK-16523
 URL: https://issues.apache.org/jira/browse/SPARK-16523
 Project: Spark
  Issue Type: Story
  Components: SQL
Reporter: Qifan Pu


For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as 
a "cache" in order to boost aggregation performance. Previously, the hashmap is 
backed by a `ColumnarBatch`. This has performance issues when we have wide 
schema for the aggregation table (large number of key fields or value fields). 
In this JIRA, we support another implementation of fast hashmap, which is 
backed by a `RowBatch`. We then automatically pick between the two 
implementations based on certain knobs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16269) Support null handling for vectorized hashmap during hash aggregate

2016-06-28 Thread Qifan Pu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qifan Pu updated SPARK-16269:
-
External issue URL: https://github.com/apache/spark/pull/13960

> Support null handling for vectorized hashmap during hash aggregate
> --
>
> Key: SPARK-16269
> URL: https://issues.apache.org/jira/browse/SPARK-16269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>Priority: Minor
>
> The current impl of vectorized hashmap does not support null keys. This patch 
> fix the problem by adding `generateFindOrInsertWithNullable()` method in 
> `VectorizedHashMapGenerator.scala`, which code-generates another version of 
> `findOrInsert` that handles null keys. We need null support so the aggregate 
> logic does not have to fallback to BytesToBytesMap. This would also us to 
> remove BytesToBytesMap completely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16269) Support null handling for vectorized hashmap during hash aggregate

2016-06-28 Thread Qifan Pu (JIRA)
Qifan Pu created SPARK-16269:


 Summary: Support null handling for vectorized hashmap during hash 
aggregate
 Key: SPARK-16269
 URL: https://issues.apache.org/jira/browse/SPARK-16269
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Qifan Pu
Priority: Minor


The current impl of vectorized hashmap does not support null keys. This patch 
fix the problem by adding `generateFindOrInsertWithNullable()` method in 
`VectorizedHashMapGenerator.scala`, which code-generates another version of 
`findOrInsert` that handles null keys. We need null support so the aggregate 
logic does not have to fallback to BytesToBytesMap. This would also us to 
remove BytesToBytesMap completely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org