[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r513153232 ## File path: sql/core/src/test/resources/sql-tests/results/window.sql.out ## @@ -479,6 +479,38 @@ Anthony Bow6627Gerard Bondur Leslie Thompson5186Gerard Bondur +-- !query +SELECT +employee_name, +salary, +nth_value(employee_name, 2) OVER ( + ORDER BY salary DESC + ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) second_highest_salary +FROM +basic_pays +ORDER BY salary DESC +-- !query schema +struct +-- !query output +Larry Bott 11798 NULL +Gerard Bondur 11472 Gerard Bondur +Pamela Castillo11303 Gerard Bondur Review comment: The output comes from `hiveResultString` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r513138536 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala ## @@ -151,10 +173,93 @@ final class OffsetWindowFunctionFrame( } inputIndex += 1 } +} - override def currentLowerBound(): Int = throw new UnsupportedOperationException() +/** + * The unbounded offset window frame is an internal window frame just used to optimize the + * performance for the window function that returns the value of the input column offset + * by a number of rows within the partition and has specified ROWS BETWEEN UNBOUNDED PRECEDING + * AND UNBOUNDED FOLLOWING. The internal window frame is not a popular window frame cannot be + * specified and used directly by the users. + * The unbounded offset window frame calculates frames containing NTH_VALUE statements. + * The unbounded offset window frame return the same value for all rows in the window partition. + */ +class UnboundedOffsetWindowFunctionFrame( +target: InternalRow, +ordinal: Int, +expressions: Array[OffsetWindowSpec], +inputSchema: Seq[Attribute], +newMutableProjection: (Seq[Expression], Seq[Attribute]) => MutableProjection, +offset: Int) + extends OffsetWindowFunctionFrameBase( +target, ordinal, expressions, inputSchema, newMutableProjection, offset) { - override def currentUpperBound(): Int = throw new UnsupportedOperationException() + override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { +input = rows +if (offset > input.length) { + fillDefaultValue(EmptyRow) +} else { + inputIterator = input.generateIterator() + // drain the first few rows if offset is larger than one + inputIndex = 0 + while (inputIndex < offset - 1) { +if (inputIterator.hasNext) inputIterator.next() +inputIndex += 1 + } + val r = WindowFunctionFrame.getNextOrNull(inputIterator) + projection(r) +} + } + + override def write(index: Int, current: InternalRow): Unit = { +// The results are the same for each row in the partition, and have been evaluated in prepare. +// Don't need to recalculate here. + } +} + +/** + * The unbounded preceding offset window frame is an internal window frame just used to optimize + * the performance for the window function that returns the value of the input column offset + * by a number of rows within the partition and has specified ROWS BETWEEN UNBOUNDED PRECEDING + * AND CURRENT ROW. The internal window frame is not a popular window frame cannot be specified + * and used directly by the users. + * The unbounded preceding offset window frame calculates frames containing NTH_VALUE statements. + * The unbounded preceding offset window frame return the same value for rows which index + * (starting from 1) equal to or greater than offset in the window partition. + */ +class UnboundedPrecedingOffsetWindowFunctionFrame( +target: InternalRow, +ordinal: Int, +expressions: Array[OffsetWindowSpec], +inputSchema: Seq[Attribute], +newMutableProjection: (Seq[Expression], Seq[Attribute]) => MutableProjection, +offset: Int) + extends OffsetWindowFunctionFrameBase( +target, ordinal, expressions, inputSchema, newMutableProjection, offset) { + + var selectedRow: UnsafeRow = null + + override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { +input = rows +inputIterator = input.generateIterator() +// drain the first few rows if offset is larger than one +inputIndex = 0 +while (inputIndex < offset - 1) { + if (inputIterator.hasNext) inputIterator.next() + inputIndex += 1 +} +if (inputIndex >= 0 && inputIndex < input.length) { Review comment: Yes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r513133285 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala ## @@ -57,8 +57,12 @@ import org.apache.spark.sql.types.{CalendarIntervalType, DateType, IntegerType, * 3. CURRENT ROW AND 1 FOLLOWING * 4. 1 PRECEDING AND 1 FOLLOWING * 5. 1 FOLLOWING AND 2 FOLLOWING - * - Offset frame: The frame consist of one row, which is an offset number of rows away from the - * current row. Only [[OffsetWindowFunction]]s can be processed in an offset frame. + * - Offset frame: The frame consist of one row, which is an offset number of rows. There are three + * implement of offset frame. + * 1. [[FrameLessOffsetWindowFunction]] returns the value of the input column offset by a number + *of rows according to the current row. + * 2. [[UnboundedOffsetWindowFunctionFrame]] and [[UnboundedPrecedingOffsetWindowFunctionFrame]] + * returns the value of the input column offset by a number of rows within the partition. Review comment: Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r512481955 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala ## @@ -151,10 +169,69 @@ final class OffsetWindowFunctionFrame( } inputIndex += 1 } +} - override def currentLowerBound(): Int = throw new UnsupportedOperationException() +/** + * The unbounded offset window frame calculates frames containing NTH_VALUE statements. + * The unbounded offset window frame return the same value for all rows in the window partition. + */ +class UnboundedOffsetWindowFunctionFrame( +target: InternalRow, +ordinal: Int, +expressions: Array[OffsetWindowSpec], +inputSchema: Seq[Attribute], +newMutableProjection: (Seq[Expression], Seq[Attribute]) => MutableProjection, +offset: Int) + extends OffsetWindowFunctionFrameBase( +target, ordinal, expressions, inputSchema, newMutableProjection, offset) { - override def currentUpperBound(): Int = throw new UnsupportedOperationException() + override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { +super.prepare(rows) +if (inputIndex >= 0 && inputIndex < input.length) { + val r = WindowFunctionFrame.getNextOrNull(inputIterator) Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r512478210 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala ## @@ -171,18 +178,42 @@ trait WindowExecBase extends UnaryExecNode { // Create the factory to produce WindowFunctionFrame. val factory = key match { - // Offset Frame - case ("OFFSET", _, IntegerLiteral(offset), _) => + // Frameless offset Frame + case ("FRAME_LESS_OFFSET", _, IntegerLiteral(offset), _) => target: InternalRow => - new OffsetWindowFunctionFrame( + new FrameLessOffsetWindowFunctionFrame( target, ordinal, // OFFSET frame functions are guaranteed be OffsetWindowFunctions. Review comment: OK ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala ## @@ -171,18 +178,42 @@ trait WindowExecBase extends UnaryExecNode { // Create the factory to produce WindowFunctionFrame. val factory = key match { - // Offset Frame - case ("OFFSET", _, IntegerLiteral(offset), _) => + // Frameless offset Frame + case ("FRAME_LESS_OFFSET", _, IntegerLiteral(offset), _) => target: InternalRow => - new OffsetWindowFunctionFrame( + new FrameLessOffsetWindowFunctionFrame( target, ordinal, // OFFSET frame functions are guaranteed be OffsetWindowFunctions. -functions.map(_.asInstanceOf[OffsetWindowFunction]), +functions.map(_.asInstanceOf[OffsetWindowSpec]), child.output, (expressions, schema) => MutableProjection.create(expressions, schema), offset) + case ("UNBOUNDED_OFFSET", _, IntegerLiteral(offset), _) => +target: InternalRow => { + new UnboundedOffsetWindowFunctionFrame( +target, +ordinal, +// OFFSET frame functions are guaranteed be OffsetWindowFunctions. Review comment: OK ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala ## @@ -171,18 +178,42 @@ trait WindowExecBase extends UnaryExecNode { // Create the factory to produce WindowFunctionFrame. val factory = key match { - // Offset Frame - case ("OFFSET", _, IntegerLiteral(offset), _) => + // Frameless offset Frame + case ("FRAME_LESS_OFFSET", _, IntegerLiteral(offset), _) => target: InternalRow => - new OffsetWindowFunctionFrame( + new FrameLessOffsetWindowFunctionFrame( target, ordinal, // OFFSET frame functions are guaranteed be OffsetWindowFunctions. -functions.map(_.asInstanceOf[OffsetWindowFunction]), +functions.map(_.asInstanceOf[OffsetWindowSpec]), child.output, (expressions, schema) => MutableProjection.create(expressions, schema), offset) + case ("UNBOUNDED_OFFSET", _, IntegerLiteral(offset), _) => +target: InternalRow => { + new UnboundedOffsetWindowFunctionFrame( +target, +ordinal, +// OFFSET frame functions are guaranteed be OffsetWindowFunctions. +functions.map(_.asInstanceOf[OffsetWindowSpec]), +child.output, +(expressions, schema) => + MutableProjection.create(expressions, schema), +offset - 1) +} + case ("UNBOUNDED_PRECEDING_OFFSET", _, IntegerLiteral(offset), _) => +target: InternalRow => { + new UnboundedPrecedingOffsetWindowFunctionFrame( +target, +ordinal, +// OFFSET frame functions are guaranteed be OffsetWindowFunctions. Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r512477822 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala ## @@ -58,7 +58,7 @@ import org.apache.spark.sql.types.{CalendarIntervalType, DateType, IntegerType, * 4. 1 PRECEDING AND 1 FOLLOWING * 5. 1 FOLLOWING AND 2 FOLLOWING * - Offset frame: The frame consist of one row, which is an offset number of rows away from the - * current row. Only [[OffsetWindowFunction]]s can be processed in an offset frame. + * current row. Only [[FrameLessOffsetWindowFunction]]s can be processed in an offset frame. Review comment: I will update the doc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r512468539 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala ## @@ -355,6 +344,36 @@ abstract class OffsetWindowFunction */ val offset: Expression + /** + * Default result value for the function when the `offset`th row does not exist. + */ + val default: Expression + + /** + * An optional specification that indicates the offset window function should skip null values in + * the determination of which row to use. + */ + val ignoreNulls: Boolean + + /** + * Whether the offset is starts with the current row. If `isRelative` is true, `offset` means + * the offset is start with the current row. otherwise, the offset is starts with the first + * row of the entire window frame. + */ + val isRelative: Boolean + + lazy val fakeFrame = SpecifiedWindowFrame(RowFrame, offset, offset) +} + +/** + * A frameless offset window function is a window function that cannot specify window frame and + * returns the value of the input column offset by a number of rows within the partition. Review comment: I make a mistake. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r512387980 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala ## @@ -151,10 +169,69 @@ final class OffsetWindowFunctionFrame( } inputIndex += 1 Review comment: `nth_value` not use `FrameLessOffsetWindowFunctionFrame`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r512387456 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala ## @@ -151,10 +169,69 @@ final class OffsetWindowFunctionFrame( } inputIndex += 1 } +} - override def currentLowerBound(): Int = throw new UnsupportedOperationException() +/** + * The unbounded offset window frame calculates frames containing NTH_VALUE statements. + * The unbounded offset window frame return the same value for all rows in the window partition. + */ +class UnboundedOffsetWindowFunctionFrame( +target: InternalRow, +ordinal: Int, +expressions: Array[OffsetWindowSpec], +inputSchema: Seq[Attribute], +newMutableProjection: (Seq[Expression], Seq[Attribute]) => MutableProjection, +offset: Int) + extends OffsetWindowFunctionFrameBase( +target, ordinal, expressions, inputSchema, newMutableProjection, offset) { - override def currentUpperBound(): Int = throw new UnsupportedOperationException() + override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { +super.prepare(rows) +if (inputIndex >= 0 && inputIndex < input.length) { + val r = WindowFunctionFrame.getNextOrNull(inputIterator) + projection(r) +} else { + fillDefaultValue(EmptyRow) +} + } + + override def write(index: Int, current: InternalRow): Unit = { +// The results are the same for each row in the partition, and have been evaluated in prepare. +// Don't need to recalculate here. + } +} + +/** + * The unbounded preceding offset window frame calculates frames containing NTH_VALUE statements. + * The unbounded preceding offset window frame return the same value for rows which index + * (starting from 1) equal to or greater than offset in the window partition. + */ +class UnboundedPrecedingOffsetWindowFunctionFrame( +target: InternalRow, +ordinal: Int, +expressions: Array[OffsetWindowSpec], +inputSchema: Seq[Attribute], +newMutableProjection: (Seq[Expression], Seq[Attribute]) => MutableProjection, +offset: Int) + extends OffsetWindowFunctionFrameBase( +target, ordinal, expressions, inputSchema, newMutableProjection, offset) { + + var selectedRow: UnsafeRow = null + + override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { +super.prepare(rows) +if (inputIndex >= 0 && inputIndex < input.length) { + selectedRow = WindowFunctionFrame.getNextOrNull(inputIterator) +} + } + + override def write(index: Int, current: InternalRow): Unit = { +if (index >= inputIndex && inputIndex < input.length && selectedRow != null) { Review comment: Thanks! Good eyesight! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r512381721 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala ## @@ -172,17 +179,41 @@ trait WindowExecBase extends UnaryExecNode { // Create the factory to produce WindowFunctionFrame. val factory = key match { // Offset Frame Review comment: OK. I update the comment of `FrameLessOffsetWindowFunctionFrame` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r512376963 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala ## @@ -136,8 +136,15 @@ trait WindowExecBase extends UnaryExecNode { val frame = spec.frameSpecification.asInstanceOf[SpecifiedWindowFrame] function match { case AggregateExpression(f, _, _, _, _) => collect("AGGREGATE", frame, e, f) +case f: FrameLessOffsetWindowFunction => collect("RELATIVE_OFFSET", frame, e, f) Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r512376471 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala ## @@ -355,6 +344,35 @@ abstract class OffsetWindowFunction */ val offset: Expression + /** + * Default result value for the function when the `offset`th row does not exist. + */ + val default: Expression + + /** + * An optional specification that indicates the offset window function should skip null values in + * the determination of which row to use. + */ + val ignoreNulls: Boolean + + /** + * Whether the offset is starts with the current row. If `isRelative` is true, `offset` means + * the offset is start with the current row. otherwise, the offset is starts with the first + * row of the entire window frame. + */ + val isRelative: Boolean + + lazy val fakeFrame = SpecifiedWindowFrame(RowFrame, offset, offset) +} + +/** + * An offset window function is a window function that returns the value of the input column offset Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r510808186 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala ## @@ -355,6 +344,35 @@ abstract class OffsetWindowFunction */ val offset: Expression + /** + * Default result value for the function when the `offset`th row does not exist. + */ + val default: Expression + + /** + * An optional specification that indicates the offset window function should skip null values in + * the determination of which row to use. + */ + val ignoreNulls: Boolean + + /** + * Whether the offset is starts with the current row. If `isRelative` is true, `offset` means + * the offset is start with the current row. otherwise, the offset is starts with the first + * row of the entire window frame. + */ + val isRelative: Boolean = true + + lazy val fakeFrame = SpecifiedWindowFrame(RowFrame, offset, offset) +} + +/** + * An offset window function is a window function that returns the value of the input column offset + * by a number of rows within the partition. For instance: an OffsetWindowfunction for value x with + * offset -2, will get the value of x 2 rows back in the partition. + */ +abstract class OffsetWindowFunction Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r510791572 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala ## @@ -136,8 +136,15 @@ trait WindowExecBase extends UnaryExecNode { val frame = spec.frameSpecification.asInstanceOf[SpecifiedWindowFrame] function match { case AggregateExpression(f, _, _, _, _) => collect("AGGREGATE", frame, e, f) +case f: OffsetWindowFunction => collect("RELATIVE_OFFSET", frame, e, f) +case f: OffsetWindowSpec if !f.ignoreNulls && + frame.frameType == RowFrame && frame.lower == UnboundedPreceding => + frame.upper match { +case UnboundedFollowing => collect("UNBOUNDED_OFFSET", f.fakeFrame, e, f) +case CurrentRow => collect("UNBOUNDED_PRECEDING_OFFSET", f.fakeFrame, e, f) +case _ => collect("AGGREGATE", frame, e, f) Review comment: OK. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r510781235 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala ## @@ -355,6 +344,35 @@ abstract class OffsetWindowFunction */ val offset: Expression + /** + * Default result value for the function when the `offset`th row does not exist. + */ + val default: Expression + + /** + * An optional specification that indicates the offset window function should skip null values in + * the determination of which row to use. + */ + val ignoreNulls: Boolean + + /** + * Whether the offset is starts with the current row. If `isRelative` is true, `offset` means + * the offset is start with the current row. otherwise, the offset is starts with the first + * row of the entire window frame. + */ + val isRelative: Boolean = true Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r509983342 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala ## @@ -136,8 +136,15 @@ trait WindowExecBase extends UnaryExecNode { val frame = spec.frameSpecification.asInstanceOf[SpecifiedWindowFrame] function match { case AggregateExpression(f, _, _, _, _) => collect("AGGREGATE", frame, e, f) +case f: OffsetWindowSpec if f.isRelative => + collect("RELATIVE_OFFSET", frame, e, f) +case f: OffsetWindowSpec if !f.ignoreNulls && + frame.frameType == RowFrame && frame.lower == UnboundedPreceding => + frame.upper match { +case UnboundedFollowing => collect("UNBOUNDED_OFFSET", f.fakeFrame, e, f) +case _ => collect("UNBOUNDED_PRECEDING_OFFSET", f.fakeFrame, e, f) Review comment: I see. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r509957584 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala ## @@ -136,8 +136,15 @@ trait WindowExecBase extends UnaryExecNode { val frame = spec.frameSpecification.asInstanceOf[SpecifiedWindowFrame] function match { case AggregateExpression(f, _, _, _, _) => collect("AGGREGATE", frame, e, f) +case f: OffsetWindowSpec if f.isRelative => Review comment: Good idea. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r509933808 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala ## @@ -327,25 +327,14 @@ object WindowFunctionType { } } - -/** - * An offset window function is a window function that returns the value of the input column offset - * by a number of rows within the partition. For instance: an OffsetWindowfunction for value x with - * offset -2, will get the value of x 2 rows back in the partition. - */ -abstract class OffsetWindowFunction - extends Expression with WindowFunction with Unevaluable with ImplicitCastInputTypes { +trait OffsetWindowSpec extends Expression { /** * Input expression to evaluate against a row which a number of rows below or above (depending on - * the value and sign of the offset) the current row. + * the value and sign of the offset) the starting row (current row if isRelative=true, or the + * first row of the window frame otherwise). */ val input: Expression - /** - * Default result value for the function when the `offset`th row does not exist. - */ - val default: Expression Review comment: Because for `Lead` and` Lag`, `default` is the third parameter. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r503782816 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala ## @@ -327,38 +327,56 @@ object WindowFunctionType { } } - -/** - * An offset window function is a window function that returns the value of the input column offset - * by a number of rows within the partition. For instance: an OffsetWindowfunction for value x with - * offset -2, will get the value of x 2 rows back in the partition. - */ -abstract class OffsetWindowFunction - extends Expression with WindowFunction with Unevaluable with ImplicitCastInputTypes { +trait OffsetWindowSpec extends Expression { /** * Input expression to evaluate against a row which a number of rows below or above (depending on - * the value and sign of the offset) the current row. + * the value and sign of the offset) the starting row (current row if isRelative=true, or the + * first row of the window frame otherwise). */ val input: Expression + /** + * (Foldable) expression that contains the number of rows between the starting row (current row + * if isRelative=true, or the first row of the window frame otherwise) and the row where the + * input expression is evaluated. + */ + val inputOffset: Expression + /** * Default result value for the function when the `offset`th row does not exist. */ val default: Expression /** - * (Foldable) expression that contains the number of rows between the current row and the row - * where the input expression is evaluated. + * A new expression based on inputOffset, considering the direction of the `offset`. + */ + val offsetExpr: Expression Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r503179138 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala ## @@ -347,18 +341,51 @@ abstract class OffsetWindowFunction val default: Expression /** - * (Foldable) expression that contains the number of rows between the current row and the row - * where the input expression is evaluated. + * (Foldable) expression that contains the number of rows between the starting row (current row + * if isRelative=true, or the first row of the window frame otherwise) and the row where the + * input expression is evaluated. */ - val offset: Expression + val offsetExpr: Expression /** - * Direction of the number of rows between the current row and the row where the input expression - * is evaluated. + * Direction of the number of rows between the current row or the first row of the entire + * window partition and the row where the input expression is evaluated. */ val direction: SortDirection - override def children: Seq[Expression] = Seq(input, offset, default) + /** + * An optional specification that indicates the offset window function should skip null values in + * the determination of which row to use. + */ + val ignoreNulls: Boolean + + /** + * Whether the offset is starts with the current row. If `isRelative` is true, `offset` means + * the offset is start with the current row. otherwise, the offset is starts with the first + * row of the entire window frame. + */ + val isRelative: Boolean = true + + lazy val boundary = direction match { +case Ascending => offsetExpr +case Descending => UnaryMinus(offsetExpr) match { + case e: Expression if e.foldable => Literal.create(e.eval(EmptyRow), e.dataType) + case o => o +} + } + + lazy val fakeFrame = SpecifiedWindowFrame(RowFrame, boundary, boundary) Review comment: Just tell every one that this frame don't play the real role of the frame, just a fake. ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala ## @@ -347,18 +341,51 @@ abstract class OffsetWindowFunction val default: Expression /** - * (Foldable) expression that contains the number of rows between the current row and the row - * where the input expression is evaluated. + * (Foldable) expression that contains the number of rows between the starting row (current row + * if isRelative=true, or the first row of the window frame otherwise) and the row where the + * input expression is evaluated. */ - val offset: Expression + val offsetExpr: Expression /** - * Direction of the number of rows between the current row and the row where the input expression - * is evaluated. + * Direction of the number of rows between the current row or the first row of the entire + * window partition and the row where the input expression is evaluated. */ val direction: SortDirection - override def children: Seq[Expression] = Seq(input, offset, default) + /** + * An optional specification that indicates the offset window function should skip null values in + * the determination of which row to use. + */ + val ignoreNulls: Boolean + + /** + * Whether the offset is starts with the current row. If `isRelative` is true, `offset` means + * the offset is start with the current row. otherwise, the offset is starts with the first + * row of the entire window frame. + */ + val isRelative: Boolean = true + + lazy val boundary = direction match { Review comment: I got it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r503178031 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala ## @@ -617,10 +639,16 @@ case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction { group = "window_funcs") // scalastyle:on line.size.limit line.contains.tab case class NthValue(input: Expression, offsetExpr: Expression, ignoreNulls: Boolean) -extends AggregateWindowFunction with ImplicitCastInputTypes { +extends AggregateWindowFunction with OffsetWindowSpec with ImplicitCastInputTypes { def this(child: Expression, offset: Expression) = this(child, offset, false) + override lazy val default = Literal.create(null, input.dataType) + + override val direction = Ascending + + override val isRelative = false + override def children: Seq[Expression] = input :: offsetExpr :: Nil override val frame: WindowFrame = UnspecifiedFrame Review comment: `NthValue` extends `AggregateWindowFunction` and later override frame as: `override val frame: WindowFrame = SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow)` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r502165124 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala ## @@ -136,8 +136,15 @@ trait WindowExecBase extends UnaryExecNode { val frame = spec.frameSpecification.asInstanceOf[SpecifiedWindowFrame] function match { case AggregateExpression(f, _, _, _, _) => collect("AGGREGATE", frame, e, f) +case f: OffsetWindowSpec if f.isRelative => + collect("RELATIVE_OFFSET", frame, e, f) +case f: OffsetWindowSpec if !f.ignoreNulls && + frame.frameType == RowFrame && frame.lower == UnboundedPreceding => + frame.upper match { +case UnboundedFollowing => collect("UNBOUNDED_OFFSET", f.fakeFrame, e, f) Review comment: I use the name `UNBOUNDED_OFFSET` because there is `UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r502165124 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala ## @@ -136,8 +136,15 @@ trait WindowExecBase extends UnaryExecNode { val frame = spec.frameSpecification.asInstanceOf[SpecifiedWindowFrame] function match { case AggregateExpression(f, _, _, _, _) => collect("AGGREGATE", frame, e, f) +case f: OffsetWindowSpec if f.isRelative => + collect("RELATIVE_OFFSET", frame, e, f) +case f: OffsetWindowSpec if !f.ignoreNulls && + frame.frameType == RowFrame && frame.lower == UnboundedPreceding => + frame.upper match { +case UnboundedFollowing => collect("UNBOUNDED_OFFSET", f.fakeFrame, e, f) Review comment: I use the name `UNBOUNDED_OFFSET` because there is unbounded before and after. ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala ## @@ -151,10 +169,69 @@ final class OffsetWindowFunctionFrame( } inputIndex += 1 } +} - override def currentLowerBound(): Int = throw new UnsupportedOperationException() +/** + * The unbounded offset window frame calculates frames containing NTH_VALUE statements. + * The unbounded offset window frame return the same value for all rows in the window partition. + */ +class UnboundedOffsetWindowFunctionFrame( +target: InternalRow, +ordinal: Int, +expressions: Array[OffsetWindowSpec], +inputSchema: Seq[Attribute], +newMutableProjection: (Seq[Expression], Seq[Attribute]) => MutableProjection, +offset: Int) + extends OffsetWindowFunctionFrameBase( +target, ordinal, expressions, inputSchema, newMutableProjection, offset) { - override def currentUpperBound(): Int = throw new UnsupportedOperationException() + override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { +super.prepare(rows) +if (inputIndex >= 0 && inputIndex < input.length) { + val r = WindowFunctionFrame.getNextOrNull(inputIterator) + projection(r) +} else { + fillDefaultValue(EmptyRow) +} + } + + override def write(index: Int, current: InternalRow): Unit = { +// The results are the same for each row in the partition, and have been evaluated in prepare. +// Don't need to recalculate here. + } +} + +/** + * The unbounded preceding offset window frame calculates frames containing NTH_VALUE statements. + * The unbounded preceding offset window frame return the same value for rows which index + * (starting from 1) equal to or greater than offset in the window partition. + */ +class UnboundedPrecedingOffsetWindowFunctionFrame( Review comment: Because the current `WindowFunctionFrame` is for data row by row. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r502165838 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala ## @@ -151,10 +169,69 @@ final class OffsetWindowFunctionFrame( } inputIndex += 1 } +} - override def currentLowerBound(): Int = throw new UnsupportedOperationException() +/** + * The unbounded offset window frame calculates frames containing NTH_VALUE statements. + * The unbounded offset window frame return the same value for all rows in the window partition. + */ +class UnboundedOffsetWindowFunctionFrame( +target: InternalRow, +ordinal: Int, +expressions: Array[OffsetWindowSpec], +inputSchema: Seq[Attribute], +newMutableProjection: (Seq[Expression], Seq[Attribute]) => MutableProjection, +offset: Int) + extends OffsetWindowFunctionFrameBase( +target, ordinal, expressions, inputSchema, newMutableProjection, offset) { - override def currentUpperBound(): Int = throw new UnsupportedOperationException() + override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { +super.prepare(rows) +if (inputIndex >= 0 && inputIndex < input.length) { + val r = WindowFunctionFrame.getNextOrNull(inputIterator) + projection(r) +} else { + fillDefaultValue(EmptyRow) +} + } + + override def write(index: Int, current: InternalRow): Unit = { +// The results are the same for each row in the partition, and have been evaluated in prepare. +// Don't need to recalculate here. + } +} + +/** + * The unbounded preceding offset window frame calculates frames containing NTH_VALUE statements. + * The unbounded preceding offset window frame return the same value for rows which index + * (starting from 1) equal to or greater than offset in the window partition. + */ +class UnboundedPrecedingOffsetWindowFunctionFrame( Review comment: Because the current `WindowFunctionFrame` is for data row by row. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #29800: [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
beliefer commented on a change in pull request #29800: URL: https://github.com/apache/spark/pull/29800#discussion_r502165124 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala ## @@ -136,8 +136,15 @@ trait WindowExecBase extends UnaryExecNode { val frame = spec.frameSpecification.asInstanceOf[SpecifiedWindowFrame] function match { case AggregateExpression(f, _, _, _, _) => collect("AGGREGATE", frame, e, f) +case f: OffsetWindowSpec if f.isRelative => + collect("RELATIVE_OFFSET", frame, e, f) +case f: OffsetWindowSpec if !f.ignoreNulls && + frame.frameType == RowFrame && frame.lower == UnboundedPreceding => + frame.upper match { +case UnboundedFollowing => collect("UNBOUNDED_OFFSET", f.fakeFrame, e, f) Review comment: I use the name `UNBOUNDED_OFFSET` because there is unbounded before and after. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org