[jira] [Closed] (SPARK-21791) ORC should support column names with dot
[ https://issues.apache.org/jira/browse/SPARK-21791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-21791. - Resolution: Duplicate SPARK-20682 includes this issue now. > ORC should support column names with dot > > > Key: SPARK-21791 > URL: https://issues.apache.org/jira/browse/SPARK-21791 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Dongjoon Hyun > > *PARQUET* > {code} > scala> Seq(Some(1), None).toDF("col.dots").write.parquet("/tmp/parquet_dot") > scala> spark.read.parquet("/tmp/parquet_dot").show > ++ > |col.dots| > ++ > | 1| > |null| > ++ > {code} > *ORC* > {code} > scala> Seq(Some(1), None).toDF("col.dots").write.orc("/tmp/orc_dot") > scala> spark.read.orc("/tmp/orc_dot").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '.' expecting ':'(line 1, pos 10) > == SQL == > struct > --^^^ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()
[ https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160062#comment-16160062 ] Eyal Farago commented on SPARK-21907: - [~juliuszsompolski], I've followed the stack trace you've attached (as much as it's possible with master's code) and I tend to agree with your assumption about UnsafeInMemorySorter.reset. it seems that allocateArray did fail on OOM and triggered a nested spill, the thing is that by this point array points to an already freed block, the first time array is actually accessed (TimSort invokes the comparator) it fails on an NPE (assuming asserts are really turned off on your env). i think the way to solve this is by temporarily setting array (and the relevant pos, capacity, etc) to null/zero/some other value indicating a currently unreadable/empty buffer. [~kiszk], what do you think? > NullPointerException in UnsafeExternalSorter.spill() > > > Key: SPARK-21907 > URL: https://issues.apache.org/jira/browse/SPARK-21907 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Juliusz Sompolski > > I see NPE during sorting with the following stacktrace: > {code} > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegen
[jira] [Reopened] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-4131: > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Assignee: jane >Priority: Critical > Fix For: 2.3.0 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-4131: -- Assignee: jane (was: Fei Wang) > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Assignee: jane >Priority: Critical > Fix For: 2.3.0 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4131) Support "Writing data into the filesystem from queries"
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-4131. Resolution: Fixed Fix Version/s: 2.3.0 > Support "Writing data into the filesystem from queries" > --- > > Key: SPARK-4131 > URL: https://issues.apache.org/jira/browse/SPARK-4131 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Assignee: jane >Priority: Critical > Fix For: 2.3.0 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > Writing data into the filesystem from queries,SparkSql is not support . > eg: > {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * > from page_views; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
[ https://issues.apache.org/jira/browse/SPARK-21965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-21965: --- Assignee: Yanbo Liang > Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR > --- > > Key: SPARK-21965 > URL: https://issues.apache.org/jira/browse/SPARK-21965 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
[ https://issues.apache.org/jira/browse/SPARK-21965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21965: Assignee: Apache Spark > Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR > --- > > Key: SPARK-21965 > URL: https://issues.apache.org/jira/browse/SPARK-21965 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Assignee: Apache Spark > > Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
[ https://issues.apache.org/jira/browse/SPARK-21965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159980#comment-16159980 ] Apache Spark commented on SPARK-21965: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/19176 > Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR > --- > > Key: SPARK-21965 > URL: https://issues.apache.org/jira/browse/SPARK-21965 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
[ https://issues.apache.org/jira/browse/SPARK-21965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21965: Assignee: (was: Apache Spark) > Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR > --- > > Key: SPARK-21965 > URL: https://issues.apache.org/jira/browse/SPARK-21965 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
Yanbo Liang created SPARK-21965: --- Summary: Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR Key: SPARK-21965 URL: https://issues.apache.org/jira/browse/SPARK-21965 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.2.0 Reporter: Yanbo Liang Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21964: Assignee: (was: Apache Spark) > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.org/jira/browse/SPARK-21964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Feng Zhu > Attachments: before.png, OOMRetry.png, Union.png > > > In current versions, Spark SQL implements grouping analytics clauses (i.e., > cube, rollup and grouping sets) as a single Aggregate operator on a single > Expand operator. With this implementation, we can read the table only once. > However, for many scenarios (e.g., high dimensions cube), the Expand operator > is too "heavy" with a large number of projections, resulting vast shuffle > write. > In our production environment, we have encountered various such cases, > leading to low performance or even OOM issues for direct buffer memory. We > demonstrate the issue with the following real-world query of a 6-dimensional > cube. > > {code:sql} > SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom > ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId > ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS > sgametype > ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId > ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid > ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS > iacttypeid > ,SUM(iclickcnt) AS iclickcnt > FROM p_day_advert > WHERE statedate = 20170810 > GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE > {code} > For such query, the Expand operator will generates 64 (i.e., 64=2^6) > projections. Though the query reads only about 3GB data, it produces about > 250GB data for shuffle write. In our environment, the first stage costs about > 2 hours. > !https://issues.apache.org/jira/secure/attachment/12886247/before.png! > The second stage is easy to get an OOM error unless we enlarge the some > configurations. > !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png! > Therefore, we tend to provide another choice which enables splitting the > heavyweight aggregate into a number of lightweight aggregates for each group. > Actually, it implements the grouping analytics as Union and executes the > aggregates one by one. Though it reads the data many times, we can still > achieve overall high performance. With such implementation, the query can be > accomplished in about 20 mins, of which each aggregation takes 1~4 mins. > !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159947#comment-16159947 ] Apache Spark commented on SPARK-21964: -- User 'DonnyZone' has created a pull request for this issue: https://github.com/apache/spark/pull/19175 > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.org/jira/browse/SPARK-21964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Feng Zhu > Attachments: before.png, OOMRetry.png, Union.png > > > In current versions, Spark SQL implements grouping analytics clauses (i.e., > cube, rollup and grouping sets) as a single Aggregate operator on a single > Expand operator. With this implementation, we can read the table only once. > However, for many scenarios (e.g., high dimensions cube), the Expand operator > is too "heavy" with a large number of projections, resulting vast shuffle > write. > In our production environment, we have encountered various such cases, > leading to low performance or even OOM issues for direct buffer memory. We > demonstrate the issue with the following real-world query of a 6-dimensional > cube. > > {code:sql} > SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom > ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId > ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS > sgametype > ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId > ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid > ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS > iacttypeid > ,SUM(iclickcnt) AS iclickcnt > FROM p_day_advert > WHERE statedate = 20170810 > GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE > {code} > For such query, the Expand operator will generates 64 (i.e., 64=2^6) > projections. Though the query reads only about 3GB data, it produces about > 250GB data for shuffle write. In our environment, the first stage costs about > 2 hours. > !https://issues.apache.org/jira/secure/attachment/12886247/before.png! > The second stage is easy to get an OOM error unless we enlarge the some > configurations. > !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png! > Therefore, we tend to provide another choice which enables splitting the > heavyweight aggregate into a number of lightweight aggregates for each group. > Actually, it implements the grouping analytics as Union and executes the > aggregates one by one. Though it reads the data many times, we can still > achieve overall high performance. With such implementation, the query can be > accomplished in about 20 mins, of which each aggregation takes 1~4 mins. > !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21964: Assignee: Apache Spark > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.org/jira/browse/SPARK-21964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Feng Zhu >Assignee: Apache Spark > Attachments: before.png, OOMRetry.png, Union.png > > > In current versions, Spark SQL implements grouping analytics clauses (i.e., > cube, rollup and grouping sets) as a single Aggregate operator on a single > Expand operator. With this implementation, we can read the table only once. > However, for many scenarios (e.g., high dimensions cube), the Expand operator > is too "heavy" with a large number of projections, resulting vast shuffle > write. > In our production environment, we have encountered various such cases, > leading to low performance or even OOM issues for direct buffer memory. We > demonstrate the issue with the following real-world query of a 6-dimensional > cube. > > {code:sql} > SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom > ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId > ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS > sgametype > ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId > ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid > ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS > iacttypeid > ,SUM(iclickcnt) AS iclickcnt > FROM p_day_advert > WHERE statedate = 20170810 > GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE > {code} > For such query, the Expand operator will generates 64 (i.e., 64=2^6) > projections. Though the query reads only about 3GB data, it produces about > 250GB data for shuffle write. In our environment, the first stage costs about > 2 hours. > !https://issues.apache.org/jira/secure/attachment/12886247/before.png! > The second stage is easy to get an OOM error unless we enlarge the some > configurations. > !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png! > Therefore, we tend to provide another choice which enables splitting the > heavyweight aggregate into a number of lightweight aggregates for each group. > Actually, it implements the grouping analytics as Union and executes the > aggregates one by one. Though it reads the data many times, we can still > achieve overall high performance. With such implementation, the query can be > accomplished in about 20 mins, of which each aggregation takes 1~4 mins. > !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours. !https://issues.apache.org/jira/secure/attachment/12886247/before.png! The second stage is easy to get an OOM error unless we enlarge the some configurations. !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours. !https://issues.apache.org/jira/secure/attachment/12886247/before.png! The second stage is easy to get an OOM error unless we enlarge the some configurations. !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apac
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours. !https://issues.apache.org/jira/secure/attachment/12886247/before.png! The second stage is easy to get an OOM error unless we enlarge the some configurations. !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours. !https://issues.apache.org/jira/secure/attachment/12886247/before.png! The second stage is easy to get an OOM error unless we enlarge the some configurations. !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours. !https://issues.apache.org/jira/secure/attachment/12886247/before.png! The second stage is easy to get an OOM error unless we enlarge the some configurations. !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. !https://issues.apache.org/jira/secure/attachment/12886247/before.png! !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apach
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. !https://issues.apache.org/jira/secure/attachment/12886247/before.png! !https://issues.apache.org/jira/secure/attachment/12886246/OOMRetry.png! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. !https://issues.apache.org/jira/secure/attachment/12886247/before.png! !https://issues.apache.org/jira/secure/attachment/12886245/OOMRetry.png! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.ap
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. !https://issues.apache.org/jira/secure/attachment/12886247/before.png! !https://issues.apache.org/jira/secure/attachment/12886245/OOMRetry.png! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. !https://issues.apache.org/jira/secure/attachment/12886245/before.png! !https://issues.apache.org/jira/secure/attachment/12886245/OOMRetry.png! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.ap
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. !https://issues.apache.org/jira/secure/attachment/12886245/before.png! !https://issues.apache.org/jira/secure/attachment/12886245/OOMRetry.png! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.org/jira/browse/SPARK-21964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !https://issues.apache.org/jira/secure/attachment/12886245/Union.png! was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !Union.png| > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.org/jira/browse/SPARK-21964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Feng Zhu > Attachments: before.png, OOMRetry.png, Union.png > > > In current versions, Spark SQL implements grouping analytics clauses (
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. thumbnail!!Union.png| was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !Union.png|thumbnail! > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.org/jira/browse/SPARK-21964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Feng Zhu > Attachments: before.png, OOMRetry.png, Union.png > > > In current versions, Spark SQL implements grouping analytics clauses (i.e., > cube, rollup and grouping set
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !Union.png| was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. thumbnail!!Union.png| > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.org/jira/browse/SPARK-21964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Feng Zhu > Attachments: before.png, OOMRetry.png, Union.png > > > In current versions, Spark SQL implements grouping analytics clauses (i.e., > cube, rollup and grouping sets) as a si
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. !Union.png|thumbnail! was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. !before.png|thumbnail! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.org/jira/browse/SPARK-21964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Feng Zhu > Attachments: before.png, OOMRetry.png, Union.png > > > In current versions, Spark SQL implements grouping analytics clauses (i.e., > cube, rollup and grouping s
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. !before.png|thumbnail! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. !before.png|thumbnail! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.org/jira/browse/SPARK-21964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Feng Zhu > Attachments: before.png, OOMRetry.png, Union.png > > > In current versions, Spark SQL implements grouping ana
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. !before.png|thumbnail! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. !before.png|thumbnail! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.org/jira/browse/SPARK-21964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.2,
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Description: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. !before.png|thumbnail! Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. was: In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.org/jira/browse/SPARK-21964 > Project: Spark
[jira] [Updated] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
[ https://issues.apache.org/jira/browse/SPARK-21964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Zhu updated SPARK-21964: - Attachment: before.png OOMRetry.png Union.png > Enable splitting the Aggregate (on Expand) into a number of Aggregates for > grouing analytics > > > Key: SPARK-21964 > URL: https://issues.apache.org/jira/browse/SPARK-21964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Feng Zhu > Attachments: before.png, OOMRetry.png, Union.png > > > In current versions, Spark SQL implements grouping analytics clauses (i.e., > cube, rollup and grouping sets) as a single Aggregate operator on a single > Expand operator. With this implementation, we can read the table only once. > However, for many scenarios (e.g., high dimensions cube), the Expand operator > is too "heavy" with a large number of projections, resulting vast shuffle > write. > In our production environment, we have encountered various such cases, > leading to low performance or even OOM issues for direct buffer memory. > We demonstrate the issue with the following real-world query of a > 6-dimensional cube. > > {code:sql} > SELECT CASE >WHEN grouping(iFrom) = 1 THEN -1 >ELSE iFrom >END AS iFrom > ,CASE >WHEN grouping(iSrcId) = 1 THEN -1 >ELSE iSrcId >END AS iSrcId > ,CASE >WHEN grouping(sgametype) = 1 THEN '-1' >ELSE sgametype >END AS sgametype > ,CASE >WHEN grouping(iOperId) = 1 THEN -1 >ELSE iOperId >END AS iOperId > ,CASE >WHEN grouping(igameid) = 1 THEN -1 >ELSE igameid >END AS igameid > ,CASE >WHEN grouping(iacttypeid) = 1 THEN -1 >ELSE iacttypeid >END AS iacttypeid > ,SUM(iclickcnt) AS iclickcnt > FROM p_day_advert > WHERE statedate = 20170810 > GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE > {code} > For such query, the Expand operator will generates 64 (i.e., 64=2^6) > projections. Though the query reads only about 3GB data, it produces about > 250GB data for shuffle write. In our environment, the first stage costs about > 2 hours, and the second stage is easy to get an OOM error unless we enlarge > the some configurations. > Therefore, we tend to provide another choice which enables splitting the > heavyweight aggregate into a number of lightweight aggregates for each group. > Actually, it implements the grouping analytics as Union and executes the > aggregates one by one. Though it reads the data many times, we can still > achieve overall high performance. > With such implementation, the query can be accomplished in about 20 mins, of > which each aggregation takes 1~4 mins. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
Feng Zhu created SPARK-21964: Summary: Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics Key: SPARK-21964 URL: https://issues.apache.org/jira/browse/SPARK-21964 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2, 2.0.0 Reporter: Feng Zhu In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once. However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write. In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory. We demonstrate the issue with the following real-world query of a 6-dimensional cube. {code:sql} SELECT CASE WHEN grouping(iFrom) = 1 THEN -1 ELSE iFrom END AS iFrom ,CASE WHEN grouping(iSrcId) = 1 THEN -1 ELSE iSrcId END AS iSrcId ,CASE WHEN grouping(sgametype) = 1 THEN '-1' ELSE sgametype END AS sgametype ,CASE WHEN grouping(iOperId) = 1 THEN -1 ELSE iOperId END AS iOperId ,CASE WHEN grouping(igameid) = 1 THEN -1 ELSE igameid END AS igameid ,CASE WHEN grouping(iacttypeid) = 1 THEN -1 ELSE iacttypeid END AS iacttypeid ,SUM(iclickcnt) AS iclickcnt FROM p_day_advert WHERE statedate = 20170810 GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE {code} For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations. Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance. With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21963) create temp file should be delete after use
[ https://issues.apache.org/jira/browse/SPARK-21963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21963: Assignee: (was: Apache Spark) > create temp file should be delete after use > --- > > Key: SPARK-21963 > URL: https://issues.apache.org/jira/browse/SPARK-21963 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.3.0 >Reporter: caoxuewen > > After you create a temporary table, you need to delete it, otherwise it will > leave a file similar to the file name ‘SPARK194465907929586320484966temp’ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21963) create temp file should be delete after use
[ https://issues.apache.org/jira/browse/SPARK-21963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21963: Assignee: Apache Spark > create temp file should be delete after use > --- > > Key: SPARK-21963 > URL: https://issues.apache.org/jira/browse/SPARK-21963 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.3.0 >Reporter: caoxuewen >Assignee: Apache Spark > > After you create a temporary table, you need to delete it, otherwise it will > leave a file similar to the file name ‘SPARK194465907929586320484966temp’ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21963) create temp file should be delete after use
[ https://issues.apache.org/jira/browse/SPARK-21963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159892#comment-16159892 ] Apache Spark commented on SPARK-21963: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/19174 > create temp file should be delete after use > --- > > Key: SPARK-21963 > URL: https://issues.apache.org/jira/browse/SPARK-21963 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.3.0 >Reporter: caoxuewen > > After you create a temporary table, you need to delete it, otherwise it will > leave a file similar to the file name ‘SPARK194465907929586320484966temp’ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21963) create temp file should be delete after use
caoxuewen created SPARK-21963: - Summary: create temp file should be delete after use Key: SPARK-21963 URL: https://issues.apache.org/jira/browse/SPARK-21963 Project: Spark Issue Type: Bug Components: Spark Core, Tests Affects Versions: 2.3.0 Reporter: caoxuewen After you create a temporary table, you need to delete it, otherwise it will leave a file similar to the file name ‘SPARK194465907929586320484966temp’ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21954) JacksonUtils should verify MapType's value type instead of key type
[ https://issues.apache.org/jira/browse/SPARK-21954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21954. -- Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 Fixed in https://github.com/apache/spark/pull/19167 > JacksonUtils should verify MapType's value type instead of key type > --- > > Key: SPARK-21954 > URL: https://issues.apache.org/jira/browse/SPARK-21954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.2.1, 2.3.0 > > > {{JacksonUtils.verifySchema}} verifies if a data type can be converted to > JSON. For {{MapType}}, it now verifies the key type. However, in > {{JacksonGenerator}}, when converting a map to JSON, we only care about its > values and create a writer for the values. The keys in a map are treated as > strings by calling {{toString}} on the keys. > Thus, we should change {{JacksonUtils.verifySchema}} to verify the value type > of {{MapType}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21954) JacksonUtils should verify MapType's value type instead of key type
[ https://issues.apache.org/jira/browse/SPARK-21954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-21954: Assignee: Liang-Chi Hsieh > JacksonUtils should verify MapType's value type instead of key type > --- > > Key: SPARK-21954 > URL: https://issues.apache.org/jira/browse/SPARK-21954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > > {{JacksonUtils.verifySchema}} verifies if a data type can be converted to > JSON. For {{MapType}}, it now verifies the key type. However, in > {{JacksonGenerator}}, when converting a map to JSON, we only care about its > values and create a writer for the values. The keys in a map are treated as > strings by calling {{toString}} on the keys. > Thus, we should change {{JacksonUtils.verifySchema}} to verify the value type > of {{MapType}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21802) Make sparkR MLP summary() expose probability column
[ https://issues.apache.org/jira/browse/SPARK-21802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu closed SPARK-21802. -- Resolution: Not A Problem > Make sparkR MLP summary() expose probability column > --- > > Key: SPARK-21802 > URL: https://issues.apache.org/jira/browse/SPARK-21802 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Priority: Minor > > Make sparkR MLP summary() expose probability column -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21856: Assignee: (was: Apache Spark) > Update Python API for MultilayerPerceptronClassifierModel > - > > Key: SPARK-21856 > URL: https://issues.apache.org/jira/browse/SPARK-21856 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Minor > > SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so > python API also need update. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159797#comment-16159797 ] Apache Spark commented on SPARK-21856: -- User 'chunshengji' has created a pull request for this issue: https://github.com/apache/spark/pull/19172 > Update Python API for MultilayerPerceptronClassifierModel > - > > Key: SPARK-21856 > URL: https://issues.apache.org/jira/browse/SPARK-21856 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Minor > > SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so > python API also need update. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21856: Assignee: Apache Spark > Update Python API for MultilayerPerceptronClassifierModel > - > > Key: SPARK-21856 > URL: https://issues.apache.org/jira/browse/SPARK-21856 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Minor > > SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so > python API also need update. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org