date:20170909

[jira] [Closed] (SPARK-21791) ORC should support column names with dot

2017-09-09 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-21791.
-
Resolution: Duplicate

SPARK-20682 includes this issue now.

> ORC should support column names with dot
> 
>
> Key: SPARK-21791
> URL: https://issues.apache.org/jira/browse/SPARK-21791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> *PARQUET*
> {code}
> scala> Seq(Some(1), None).toDF("col.dots").write.parquet("/tmp/parquet_dot")
> scala> spark.read.parquet("/tmp/parquet_dot").show
> ++
> |col.dots|
> ++
> |   1|
> |null|
> ++
> {code}
> *ORC*
> {code}
> scala> Seq(Some(1), None).toDF("col.dots").write.orc("/tmp/orc_dot")
> scala> spark.read.orc("/tmp/orc_dot").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '.' expecting ':'(line 1, pos 10)
> == SQL ==
> struct
> --^^^
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-09 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160062#comment-16160062
 ] 

Eyal Farago commented on SPARK-21907:
-

[~juliuszsompolski], I've followed the stack trace you've attached (as much as 
it's possible with master's code) and I tend to agree with your assumption 
about UnsafeInMemorySorter.reset.
it seems that allocateArray did fail on OOM and triggered a nested spill, the 
thing is that by this point array points to an already freed block, the first 
time array is actually accessed (TimSort invokes the comparator) it fails on an 
NPE (assuming asserts are really turned off on your env).

i think the way to solve this is by temporarily setting array (and the relevant 
pos, capacity, etc) to null/zero/some other value indicating a currently 
unreadable/empty buffer.

[~kiszk], what do you think?

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegen

[jira] [Reopened] (SPARK-4131) Support "Writing data into the filesystem from queries"

2017-09-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-4131:


> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Assignee: jane
>Priority: Critical
> Fix For: 2.3.0
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
> from page_views;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4131) Support "Writing data into the filesystem from queries"

2017-09-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-4131:
--

Assignee: jane  (was: Fei Wang)

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Assignee: jane
>Priority: Critical
> Fix For: 2.3.0
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
> from page_views;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4131) Support "Writing data into the filesystem from queries"

2017-09-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-4131.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Assignee: jane
>Priority: Critical
> Fix For: 2.3.0
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
> from page_views;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR

2017-09-09 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-21965:
---

Assignee: Yanbo Liang

> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
> ---
>
> Key: SPARK-21965
> URL: https://issues.apache.org/jira/browse/SPARK-21965
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR

2017-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21965:


Assignee: Apache Spark

> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
> ---
>
> Key: SPARK-21965
> URL: https://issues.apache.org/jira/browse/SPARK-21965
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR

2017-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159980#comment-16159980
 ] 

Apache Spark commented on SPARK-21965:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/19176

> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
> ---
>
> Key: SPARK-21965
> URL: https://issues.apache.org/jira/browse/SPARK-21965
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR

2017-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21965:


Assignee: (was: Apache Spark)

> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
> ---
>
> Key: SPARK-21965
> URL: https://issues.apache.org/jira/browse/SPARK-21965
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR

2017-09-09 Thread Yanbo Liang (JIRA)

Yanbo Liang created SPARK-21965:
---

 Summary: Add createOrReplaceGlobalTempView and dropGlobalTempView 
for SparkR
 Key: SPARK-21965
 URL: https://issues.apache.org/jira/browse/SPARK-21965
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Yanbo Liang


Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21964:


Assignee: (was: Apache Spark)

> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.org/jira/browse/SPARK-21964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
> Attachments: before.png, OOMRetry.png, Union.png
>
>
> In current versions, Spark SQL implements grouping analytics clauses (i.e., 
> cube, rollup and grouping sets) as a single Aggregate operator on a single 
> Expand operator. With this implementation, we can read the table only once.
> However, for many scenarios (e.g., high dimensions cube), the Expand operator 
> is too "heavy" with a large number of projections, resulting vast shuffle 
> write.
> In our production environment, we have encountered various such cases, 
> leading to low performance or even OOM issues for direct buffer memory. We 
> demonstrate the issue with the following real-world query of a 6-dimensional 
> cube.
>   
> {code:sql}
> SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
>   ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
>   ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
> sgametype
>   ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
>   ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
>   ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
> iacttypeid
>   ,SUM(iclickcnt) AS iclickcnt
> FROM p_day_advert
> WHERE  statedate = 20170810
> GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
> {code}
> For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
> projections. Though the query reads only about 3GB data, it produces about 
> 250GB data for shuffle write. In our environment, the first stage costs about 
> 2 hours.
> !https://issues.apache.org/jira/secure/attachment/12886247/before.png!
> The second stage is easy to get an OOM error unless we enlarge the some 
> configurations.
> !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png!
> Therefore, we tend to provide another choice which enables splitting the 
> heavyweight aggregate into a number of lightweight aggregates for each group. 
> Actually, it implements the grouping analytics as Union and executes the 
> aggregates one by one. Though it reads the data many times, we can still 
> achieve overall high performance. With such implementation, the query can be 
> accomplished in about 20 mins, of which each aggregation takes 1~4 mins.
> !https://issues.apache.org/jira/secure/attachment/12886245/Union.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159947#comment-16159947
 ] 

Apache Spark commented on SPARK-21964:
--

User 'DonnyZone' has created a pull request for this issue:
https://github.com/apache/spark/pull/19175

> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.org/jira/browse/SPARK-21964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
> Attachments: before.png, OOMRetry.png, Union.png
>
>
> In current versions, Spark SQL implements grouping analytics clauses (i.e., 
> cube, rollup and grouping sets) as a single Aggregate operator on a single 
> Expand operator. With this implementation, we can read the table only once.
> However, for many scenarios (e.g., high dimensions cube), the Expand operator 
> is too "heavy" with a large number of projections, resulting vast shuffle 
> write.
> In our production environment, we have encountered various such cases, 
> leading to low performance or even OOM issues for direct buffer memory. We 
> demonstrate the issue with the following real-world query of a 6-dimensional 
> cube.
>   
> {code:sql}
> SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
>   ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
>   ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
> sgametype
>   ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
>   ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
>   ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
> iacttypeid
>   ,SUM(iclickcnt) AS iclickcnt
> FROM p_day_advert
> WHERE  statedate = 20170810
> GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
> {code}
> For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
> projections. Though the query reads only about 3GB data, it produces about 
> 250GB data for shuffle write. In our environment, the first stage costs about 
> 2 hours.
> !https://issues.apache.org/jira/secure/attachment/12886247/before.png!
> The second stage is easy to get an OOM error unless we enlarge the some 
> configurations.
> !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png!
> Therefore, we tend to provide another choice which enables splitting the 
> heavyweight aggregate into a number of lightweight aggregates for each group. 
> Actually, it implements the grouping analytics as Union and executes the 
> aggregates one by one. Though it reads the data many times, we can still 
> achieve overall high performance. With such implementation, the query can be 
> accomplished in about 20 mins, of which each aggregation takes 1~4 mins.
> !https://issues.apache.org/jira/secure/attachment/12886245/Union.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21964:


Assignee: Apache Spark

> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.org/jira/browse/SPARK-21964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
>Assignee: Apache Spark
> Attachments: before.png, OOMRetry.png, Union.png
>
>
> In current versions, Spark SQL implements grouping analytics clauses (i.e., 
> cube, rollup and grouping sets) as a single Aggregate operator on a single 
> Expand operator. With this implementation, we can read the table only once.
> However, for many scenarios (e.g., high dimensions cube), the Expand operator 
> is too "heavy" with a large number of projections, resulting vast shuffle 
> write.
> In our production environment, we have encountered various such cases, 
> leading to low performance or even OOM issues for direct buffer memory. We 
> demonstrate the issue with the following real-world query of a 6-dimensional 
> cube.
>   
> {code:sql}
> SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
>   ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
>   ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
> sgametype
>   ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
>   ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
>   ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
> iacttypeid
>   ,SUM(iclickcnt) AS iclickcnt
> FROM p_day_advert
> WHERE  statedate = 20170810
> GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
> {code}
> For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
> projections. Though the query reads only about 3GB data, it produces about 
> 250GB data for shuffle write. In our environment, the first stage costs about 
> 2 hours.
> !https://issues.apache.org/jira/secure/attachment/12886247/before.png!
> The second stage is easy to get an OOM error unless we enlarge the some 
> configurations.
> !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png!
> Therefore, we tend to provide another choice which enables splitting the 
> heavyweight aggregate into a number of lightweight aggregates for each group. 
> Actually, it implements the grouping analytics as Union and executes the 
> aggregates one by one. Though it reads the data many times, we can still 
> achieve overall high performance. With such implementation, the query can be 
> accomplished in about 20 mins, of which each aggregation takes 1~4 mins.
> !https://issues.apache.org/jira/secure/attachment/12886245/Union.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory. We demonstrate 
the issue with the following real-world query of a 6-dimensional cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours.


!https://issues.apache.org/jira/secure/attachment/12886247/before.png!


The second stage is easy to get an OOM error unless we enlarge the some 
configurations.


!https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours.

!https://issues.apache.org/jira/secure/attachment/12886247/before.png!

The second stage is easy to get an OOM error unless we enlarge the some 
configurations.

!https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apac

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours.

!https://issues.apache.org/jira/secure/attachment/12886247/before.png!

The second stage is easy to get an OOM error unless we enlarge the some 
configurations.

!https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours.

!https://issues.apache.org/jira/secure/attachment/12886247/before.png!

The second stage is easy to get an OOM error unless we enlarge the some 
configurations.
!https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours.

!https://issues.apache.org/jira/secure/attachment/12886247/before.png!

The second stage is easy to get an OOM error unless we enlarge the some 
configurations.
!https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

!https://issues.apache.org/jira/secure/attachment/12886247/before.png!
!https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apach

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

!https://issues.apache.org/jira/secure/attachment/12886247/before.png!
!https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

!https://issues.apache.org/jira/secure/attachment/12886247/before.png!
!https://issues.apache.org/jira/secure/attachment/12886245/OOMRetry.png!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.ap

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

!https://issues.apache.org/jira/secure/attachment/12886247/before.png!
!https://issues.apache.org/jira/secure/attachment/12886245/OOMRetry.png!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

!https://issues.apache.org/jira/secure/attachment/12886245/before.png!
!https://issues.apache.org/jira/secure/attachment/12886245/OOMRetry.png!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.ap

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

!https://issues.apache.org/jira/secure/attachment/12886245/before.png!
!https://issues.apache.org/jira/secure/attachment/12886245/OOMRetry.png!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.org/jira/browse/SPARK-21964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!https://issues.apache.org/jira/secure/attachment/12886245/Union.png!

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!Union.png|


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.org/jira/browse/SPARK-21964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
> Attachments: before.png, OOMRetry.png, Union.png
>
>
> In current versions, Spark SQL implements grouping analytics clauses (

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

thumbnail!!Union.png|

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!Union.png|thumbnail!


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.org/jira/browse/SPARK-21964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
> Attachments: before.png, OOMRetry.png, Union.png
>
>
> In current versions, Spark SQL implements grouping analytics clauses (i.e., 
> cube, rollup and grouping set

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!Union.png|

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

thumbnail!!Union.png|


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.org/jira/browse/SPARK-21964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
> Attachments: before.png, OOMRetry.png, Union.png
>
>
> In current versions, Spark SQL implements grouping analytics clauses (i.e., 
> cube, rollup and grouping sets) as a si

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance. With such implementation, the query can be 
accomplished in about 20 mins, of which each aggregation takes 1~4 mins.

!Union.png|thumbnail!

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

!before.png|thumbnail!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance.

With such implementation, the query can be accomplished in about 20 mins, of 
which each aggregation takes 1~4 mins.


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.org/jira/browse/SPARK-21964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
> Attachments: before.png, OOMRetry.png, Union.png
>
>
> In current versions, Spark SQL implements grouping analytics clauses (i.e., 
> cube, rollup and grouping s

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS 
sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS 
iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

!before.png|thumbnail!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance.

With such implementation, the query can be accomplished in about 20 mins, of 
which each aggregation takes 1~4 mins.

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END 
AS sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS 
iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS 
igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END 
AS iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

!before.png|thumbnail!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance.

With such implementation, the query can be accomplished in about 20 mins, of 
which each aggregation takes 1~4 mins.


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.org/jira/browse/SPARK-21964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
> Attachments: before.png, OOMRetry.png, Union.png
>
>
> In current versions, Spark SQL implements grouping ana

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom
  ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId
  ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END 
AS sgametype
  ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS 
iOperId
  ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS 
igameid
  ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END 
AS iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

!before.png|thumbnail!

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance.

With such implementation, the query can be accomplished in about 20 mins, of 
which each aggregation takes 1~4 mins.

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE
   WHEN grouping(iFrom) = 1 THEN -1
   ELSE iFrom
   END AS iFrom
  ,CASE
   WHEN grouping(iSrcId) = 1 THEN -1
   ELSE iSrcId
   END AS iSrcId
  ,CASE
   WHEN grouping(sgametype) = 1 THEN '-1'
   ELSE sgametype
   END AS sgametype
  ,CASE
   WHEN grouping(iOperId) = 1 THEN -1
   ELSE iOperId
   END AS iOperId
  ,CASE
   WHEN grouping(igameid) = 1 THEN -1
   ELSE igameid
   END AS igameid
  ,CASE
   WHEN grouping(iacttypeid) = 1 THEN -1
   ELSE iacttypeid
   END AS iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.
!before.png|thumbnail!


Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance.

With such implementation, the query can be accomplished in about 20 mins, of 
which each aggregation takes 1~4 mins.


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.org/jira/browse/SPARK-21964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2,

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Description: 
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE
   WHEN grouping(iFrom) = 1 THEN -1
   ELSE iFrom
   END AS iFrom
  ,CASE
   WHEN grouping(iSrcId) = 1 THEN -1
   ELSE iSrcId
   END AS iSrcId
  ,CASE
   WHEN grouping(sgametype) = 1 THEN '-1'
   ELSE sgametype
   END AS sgametype
  ,CASE
   WHEN grouping(iOperId) = 1 THEN -1
   ELSE iOperId
   END AS iOperId
  ,CASE
   WHEN grouping(igameid) = 1 THEN -1
   ELSE igameid
   END AS igameid
  ,CASE
   WHEN grouping(iacttypeid) = 1 THEN -1
   ELSE iacttypeid
   END AS iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.
!before.png|thumbnail!


Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance.

With such implementation, the query can be accomplished in about 20 mins, of 
which each aggregation takes 1~4 mins.

  was:
In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE
   WHEN grouping(iFrom) = 1 THEN -1
   ELSE iFrom
   END AS iFrom
  ,CASE
   WHEN grouping(iSrcId) = 1 THEN -1
   ELSE iSrcId
   END AS iSrcId
  ,CASE
   WHEN grouping(sgametype) = 1 THEN '-1'
   ELSE sgametype
   END AS sgametype
  ,CASE
   WHEN grouping(iOperId) = 1 THEN -1
   ELSE iOperId
   END AS iOperId
  ,CASE
   WHEN grouping(igameid) = 1 THEN -1
   ELSE igameid
   END AS igameid
  ,CASE
   WHEN grouping(iacttypeid) = 1 THEN -1
   ELSE iacttypeid
   END AS iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance.

With such implementation, the query can be accomplished in about 20 mins, of 
which each aggregation takes 1~4 mins.


> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.org/jira/browse/SPARK-21964
> Project: Spark

[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21964:
-
Attachment: before.png
OOMRetry.png
Union.png

> Enable splitting the Aggregate (on Expand) into a number of Aggregates for 
> grouing analytics
> 
>
> Key: SPARK-21964
> URL: https://issues.apache.org/jira/browse/SPARK-21964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
> Attachments: before.png, OOMRetry.png, Union.png
>
>
> In current versions, Spark SQL implements grouping analytics clauses (i.e., 
> cube, rollup and grouping sets) as a single Aggregate operator on a single 
> Expand operator. With this implementation, we can read the table only once.
> However, for many scenarios (e.g., high dimensions cube), the Expand operator 
> is too "heavy" with a large number of projections, resulting vast shuffle 
> write.
> In our production environment, we have encountered various such cases, 
> leading to low performance or even OOM issues for direct buffer memory.
> We demonstrate the issue with the following real-world query of a 
> 6-dimensional cube.
>   
> {code:sql}
> SELECT CASE
>WHEN grouping(iFrom) = 1 THEN -1
>ELSE iFrom
>END AS iFrom
>   ,CASE
>WHEN grouping(iSrcId) = 1 THEN -1
>ELSE iSrcId
>END AS iSrcId
>   ,CASE
>WHEN grouping(sgametype) = 1 THEN '-1'
>ELSE sgametype
>END AS sgametype
>   ,CASE
>WHEN grouping(iOperId) = 1 THEN -1
>ELSE iOperId
>END AS iOperId
>   ,CASE
>WHEN grouping(igameid) = 1 THEN -1
>ELSE igameid
>END AS igameid
>   ,CASE
>WHEN grouping(iacttypeid) = 1 THEN -1
>ELSE iacttypeid
>END AS iacttypeid
>   ,SUM(iclickcnt) AS iclickcnt
> FROM p_day_advert
> WHERE  statedate = 20170810
> GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
> {code}
> For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
> projections. Though the query reads only about 3GB data, it produces about 
> 250GB data for shuffle write. In our environment, the first stage costs about 
> 2 hours, and the second stage is easy to get an OOM error unless we enlarge 
> the some configurations.
> Therefore, we tend to provide another choice which enables splitting the 
> heavyweight aggregate into a number of lightweight aggregates for each group. 
> Actually, it implements the grouping analytics as Union and executes the 
> aggregates one by one. Though it reads the data many times, we can still 
> achieve overall high performance.
> With such implementation, the query can be accomplished in about 20 mins, of 
> which each aggregation takes 1~4 mins.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

2017-09-09 Thread Feng Zhu (JIRA)

Feng Zhu created SPARK-21964:


 Summary: Enable splitting the Aggregate (on Expand) into a number 
of Aggregates for grouing analytics
 Key: SPARK-21964
 URL: https://issues.apache.org/jira/browse/SPARK-21964
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2, 2.0.0
Reporter: Feng Zhu


In current versions, Spark SQL implements grouping analytics clauses (i.e., 
cube, rollup and grouping sets) as a single Aggregate operator on a single 
Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator 
is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading 
to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional 
cube.
  
{code:sql}
SELECT CASE
   WHEN grouping(iFrom) = 1 THEN -1
   ELSE iFrom
   END AS iFrom
  ,CASE
   WHEN grouping(iSrcId) = 1 THEN -1
   ELSE iSrcId
   END AS iSrcId
  ,CASE
   WHEN grouping(sgametype) = 1 THEN '-1'
   ELSE sgametype
   END AS sgametype
  ,CASE
   WHEN grouping(iOperId) = 1 THEN -1
   ELSE iOperId
   END AS iOperId
  ,CASE
   WHEN grouping(igameid) = 1 THEN -1
   ELSE igameid
   END AS igameid
  ,CASE
   WHEN grouping(iacttypeid) = 1 THEN -1
   ELSE iacttypeid
   END AS iacttypeid
  ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) 
projections. Though the query reads only about 3GB data, it produces about 
250GB data for shuffle write. In our environment, the first stage costs about 2 
hours, and the second stage is easy to get an OOM error unless we enlarge the 
some configurations.

Therefore, we tend to provide another choice which enables splitting the 
heavyweight aggregate into a number of lightweight aggregates for each group. 
Actually, it implements the grouping analytics as Union and executes the 
aggregates one by one. Though it reads the data many times, we can still 
achieve overall high performance.

With such implementation, the query can be accomplished in about 20 mins, of 
which each aggregation takes 1~4 mins.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21963) create temp file should be delete after use

2017-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21963:


Assignee: (was: Apache Spark)

> create temp file should be delete after use
> ---
>
> Key: SPARK-21963
> URL: https://issues.apache.org/jira/browse/SPARK-21963
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> After you create a temporary table, you need to delete it, otherwise it will 
> leave a file similar to the file name ‘SPARK194465907929586320484966temp’



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21963) create temp file should be delete after use

2017-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21963:


Assignee: Apache Spark

> create temp file should be delete after use
> ---
>
> Key: SPARK-21963
> URL: https://issues.apache.org/jira/browse/SPARK-21963
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>
> After you create a temporary table, you need to delete it, otherwise it will 
> leave a file similar to the file name ‘SPARK194465907929586320484966temp’



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21963) create temp file should be delete after use

2017-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159892#comment-16159892
 ] 

Apache Spark commented on SPARK-21963:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/19174

> create temp file should be delete after use
> ---
>
> Key: SPARK-21963
> URL: https://issues.apache.org/jira/browse/SPARK-21963
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> After you create a temporary table, you need to delete it, otherwise it will 
> leave a file similar to the file name ‘SPARK194465907929586320484966temp’



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21963) create temp file should be delete after use

2017-09-09 Thread caoxuewen (JIRA)

caoxuewen created SPARK-21963:
-

 Summary: create temp file should be delete after use
 Key: SPARK-21963
 URL: https://issues.apache.org/jira/browse/SPARK-21963
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 2.3.0
Reporter: caoxuewen


After you create a temporary table, you need to delete it, otherwise it will 
leave a file similar to the file name ‘SPARK194465907929586320484966temp’




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21954) JacksonUtils should verify MapType's value type instead of key type

2017-09-09 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21954.
--
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Fixed in https://github.com/apache/spark/pull/19167

> JacksonUtils should verify MapType's value type instead of key type
> ---
>
> Key: SPARK-21954
> URL: https://issues.apache.org/jira/browse/SPARK-21954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.1, 2.3.0
>
>
> {{JacksonUtils.verifySchema}} verifies if a data type can be converted to 
> JSON. For {{MapType}}, it now verifies the key type. However, in 
> {{JacksonGenerator}}, when converting a map to JSON, we only care about its 
> values and create a writer for the values. The keys in a map are treated as 
> strings by calling {{toString}} on the keys.
> Thus, we should change {{JacksonUtils.verifySchema}} to verify the value type 
> of {{MapType}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21954) JacksonUtils should verify MapType's value type instead of key type

2017-09-09 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21954:


Assignee: Liang-Chi Hsieh

> JacksonUtils should verify MapType's value type instead of key type
> ---
>
> Key: SPARK-21954
> URL: https://issues.apache.org/jira/browse/SPARK-21954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>
> {{JacksonUtils.verifySchema}} verifies if a data type can be converted to 
> JSON. For {{MapType}}, it now verifies the key type. However, in 
> {{JacksonGenerator}}, when converting a map to JSON, we only care about its 
> values and create a writer for the values. The keys in a map are treated as 
> strings by calling {{toString}} on the keys.
> Thus, we should change {{JacksonUtils.verifySchema}} to verify the value type 
> of {{MapType}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21802) Make sparkR MLP summary() expose probability column

2017-09-09 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu closed SPARK-21802.
--
Resolution: Not A Problem

> Make sparkR MLP summary() expose probability column
> ---
>
> Key: SPARK-21802
> URL: https://issues.apache.org/jira/browse/SPARK-21802
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>
> Make sparkR MLP summary() expose probability column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2017-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21856:


Assignee: (was: Apache Spark)

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2017-09-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159797#comment-16159797
 ] 

Apache Spark commented on SPARK-21856:
--

User 'chunshengji' has created a pull request for this issue:
https://github.com/apache/spark/pull/19172

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2017-09-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21856:


Assignee: Apache Spark

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

38 matches

Mail list logo