[jira] [Updated] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-08 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34989:
-
Description: 
One of the main performance bottlenecks in query compilation is overly-generic 
tree transformation methods, namely {{mapChildren}} and {{withNewChildren}} 
(defined in {{TreeNode}}). These methods have an overly-generic implementation 
to iterate over the children and rely on reflection to create new instances. We 
have observed that, especially for queries with large query plans, a 
significant amount of CPU cycles are wasted in these methods. In this PR we 
make these methods more efficient, by delegating the iteration and 
instantiation to concrete node types. The benchmarks show that we can expect 
significant performance improvement in total query compilation time in queries 
with large query plans (from 30-80%) and about 20% on average.
h4. Problem detail

The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To be 
more specific, this method:
 * iterates over all the fields of a node using Scala’s product iterator. While 
the iteration is not reflection-based, thanks to the Scala compiler generating 
code for {{Product}}, we create many anonymous functions and visit many nested 
structures (recursive calls).
 The anonymous functions (presumably compiled to Java anonymous inner classes) 
also show up quite high on the list in the object allocation profiles, so we 
are putting unnecessary pressure on GC here.
 * does a lot of comparisons. Basically for each element returned from the 
product iterator, we check if it is a child (contained in the list of children) 
and then transform it. We can avoid that by just iterating over children, but 
in the current implementation, we need to gather all the fields (only transform 
the children) so that we can instantiate the object using the reflection.
 * creates objects using reflection, by delegating to the {{makeCopy}} method, 
which is several orders of magnitude slower than using the constructor.

h4. Solution

The proposed solution in this PR is rather straightforward: we rewrite the 
{{mapChildren}} method using the {{children}} and {{withNewChildren}} methods. 
The default {{withNewChildren}} method suffers from the same problems as 
{{mapChildren}} and we need to make it more efficient by specializing it in 
concrete classes. Similar to how each concrete query plan node already defines 
its children, it should also define how they can be constructed given a new 
list of children. Actually, the implementation is quite simple in most cases 
and is a one-liner thanks to the copy method present in Scala case classes. 
Note that we cannot abstract over the copy method, it’s generated by the 
compiler for case classes if no other type higher in the hierarchy defines it. 
For most concrete nodes, the implementation of {{withNewChildren}} looks like 
this:

 

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
copy(children = newChildren)}}

The current {{withNewChildren}} method has two properties that we should 
preserve:
 * It returns the same instance if the provided children are the same as its 
children, i.e., it preserves referential equality.
 * It copies tags and maintains the origin links when a new copy is created.

These properties are hard to enforce in the concrete node type implementation. 
Therefore, we propose a template method {{withNewChildrenInternal}} that should 
be rewritten by the concrete classes and let the {{withNewChildren}} method 
take care of referential equality and copying:

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = {}}
 {{  if (childrenFastEquals(children, newChildren)) {}}
 {{    this}}
 {{  } else {}}
 {{    CurrentOrigin.withOrigin(origin) {}}
 {{      val res = withNewChildrenInternal(newChildren)}}
 {{      res.copyTagsFrom(this)}}
 {{      res}}
       }
    }
 }

With the refactoring done in a previous PR 
([#31932|https://github.com/apache/spark/pull/31932]) most tree node types fall 
in one of the categories of {{Leaf}}, {{Unary}}, {{Binary}} or {{Ternary}}. 
These traits have a more efficient implementation for {{mapChildren}} and 
define a more specialized version of {{withNewChildrenInternal}} that avoids 
creating unnecessary lists. For example, the {{mapChildren}} method in 
{{UnaryLike}} is defined as follows:

{{override final def mapChildren(f: T => T): T = {}}
 {{  val newChild = f(child)}}
 {{  if (newChild fastEquals child) {}}
 {{    this.asInstanceOf[T]}}
 {{  } else {}}
 {{    CurrentOrigin.withOrigin(origin) {}}
 {{      val res = withNewChildInternal(newChild)}}
 {{      res.copyTagsFrom(this.asInstanceOf[T])}}
 {{      res}}
          }
     }
}
h4. Results

With this PR, we have observed significant performance improvements in query 
compilation time, more specifically in the analysis and optimization 

[jira] [Updated] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-08 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34989:
-
Description: 
One of the main performance bottlenecks in query compilation is overly-generic 
tree transformation methods, namely {{mapChildren}} and {{withNewChildren}} 
(defined in {{TreeNode}}). These methods have an overly-generic implementation 
to iterate over the children and rely on reflection to create new instances. We 
have observed that, especially for queries with large query plans, a 
significant amount of CPU cycles are wasted in these methods. In this PR we 
make these methods more efficient, by delegating the iteration and 
instantiation to concrete node types. The benchmarks show that we can expect 
significant performance improvement in total query compilation time in queries 
with large query plans (from 30-80%) and about 20% on average.
h4. Problem detail

The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To be 
more specific, this method:
 * iterates over all the fields of a node using Scala’s product iterator. While 
the iteration is not reflection-based, thanks to the Scala compiler generating 
code for {{Product}}, we create many anonymous functions and visit many nested 
structures (recursive calls).
 The anonymous functions (presumably compiled to Java anonymous inner classes) 
also show up quite high on the list in the object allocation profiles, so we 
are putting unnecessary pressure on GC here.
 * does a lot of comparisons. Basically for each element returned from the 
product iterator, we check if it is a child (contained in the list of children) 
and then transform it. We can avoid that by just iterating over children, but 
in the current implementation, we need to gather all the fields (only transform 
the children) so that we can instantiate the object using the reflection.
 * creates objects using reflection, by delegating to the {{makeCopy}} method, 
which is several orders of magnitude slower than using the constructor.

h4. Solution

The proposed solution in this PR is rather straightforward: we rewrite the 
{{mapChildren}} method using the {{children}} and {{withNewChildren}} methods. 
The default {{withNewChildren}} method suffers from the same problems as 
{{mapChildren}} and we need to make it more efficient by specializing it in 
concrete classes. Similar to how each concrete query plan node already defines 
its children, it should also define how they can be constructed given a new 
list of children. Actually, the implementation is quite simple in most cases 
and is a one-liner thanks to the copy method present in Scala case classes. 
Note that we cannot abstract over the copy method, it’s generated by the 
compiler for case classes if no other type higher in the hierarchy defines it. 
For most concrete nodes, the implementation of {{withNewChildren}} looks like 
this:

 

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
copy(children = newChildren)}}

The current {{withNewChildren}} method has two properties that we should 
preserve:
 * It returns the same instance if the provided children are the same as its 
children, i.e., it preserves referential equality.
 * It copies tags and maintains the origin links when a new copy is created.

These properties are hard to enforce in the concrete node type implementation. 
Therefore, we propose a template method {{withNewChildrenInternal}} that should 
be rewritten by the concrete classes and let the {{withNewChildren}} method 
take care of referential equality and copying:

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = {}}
 {{  if (childrenTheSame(children, newChildren)) {}}
 {{    this}}
 {{  } else {}}
 {{    CurrentOrigin.withOrigin(origin) {}}
 {{      val res = withNewChildrenInternal(newChildren)}}
 {{      res.copyTagsFrom(this)}}
 {{      res}}
       }
    }
 }

With the refactoring done in a previous PR 
([#31932|https://github.com/apache/spark/pull/31932]) most tree node types fall 
in one of the categories of {{Leaf}}, {{Unary}}, {{Binary}} or {{Ternary}}. 
These traits have a more efficient implementation for {{mapChildren}} and 
define a more specialized version of {{withNewChildrenInternal}} that avoids 
creating unnecessary lists. For example, the {{mapChildren}} method in 
{{UnaryLike}} is defined as follows:

{{override final def mapChildren(f: T => T): T = {}}
 {{  val newChild = f(child)}}
 {{  if (newChild fastEquals child) {}}
 {{    this.asInstanceOf[T]}}
 {{  } else {}}
 {{    CurrentOrigin.withOrigin(origin) {}}
 {{      val res = withNewChildInternal(newChild)}}
 {{      res.copyTagsFrom(this.asInstanceOf[T])}}
 {{      res}}
          }
     }
}
h4. Results

With this PR, we have observed significant performance improvements in query 
compilation time, more specifically in the analysis and optimization phases. 

[jira] [Updated] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-08 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34989:
-
Description: 
One of the main performance bottlenecks in query compilation is overly-generic 
tree transformation methods, namely {{mapChildren}} and {{withNewChildren}} 
(defined in {{TreeNode}}). These methods have an overly-generic implementation 
to iterate over the children and rely on reflection to create new instances. We 
have observed that, especially for queries with large query plans, a 
significant amount of CPU cycles are wasted in these methods. In this PR we 
make these methods more efficient, by delegating the iteration and 
instantiation to concrete node types. The benchmarks show that we can expect 
significant performance improvement in total query compilation time in queries 
with large query plans (from 30-80%) and about 20% on average.
h4. Problem detail

The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To be 
more specific, this method:
 * iterates over all the fields of a node using Scala’s product iterator. While 
the iteration is not reflection-based, thanks to the Scala compiler generating 
code for {{Product}}, we create many anonymous functions and visit many nested 
structures (recursive calls).
 The anonymous functions (presumably compiled to Java anonymous inner classes) 
also show up quite high on the list in the object allocation profiles, so we 
are putting unnecessary pressure on GC here.
 * does a lot of comparisons. Basically for each element returned from the 
product iterator, we check if it is a child (contained in the list of children) 
and then transform it. We can avoid that by just iterating over children, but 
in the current implementation, we need to gather all the fields (only transform 
the children) so that we can instantiate the object using the reflection.
 * creates objects using reflection, by delegating to the {{makeCopy}} method, 
which is several orders of magnitude slower than using the constructor.

h4. Solution

The proposed solution in this PR is rather straightforward: we rewrite the 
{{mapChildren}} method using the {{children}} and {{withNewChildren}} methods. 
The default {{withNewChildren}} method suffers from the same problems as 
{{mapChildren}} and we need to make it more efficient by specializing it in 
concrete classes. Similar to how each concrete query plan node already defines 
its children, it should also define how they can be constructed given a new 
list of children. Actually, the implementation is quite simple in most cases 
and is a one-liner thanks to the copy method present in Scala case classes. 
Note that we cannot abstract over the copy method, it’s generated by the 
compiler for case classes if no other type higher in the hierarchy defines it. 
For most concrete nodes, the implementation of {{withNewChildren}} looks like 
this:

 

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
copy(children = newChildren)}}

The current {{withNewChildren}} method has two properties that we should 
preserve:
 * It returns the same instance if the provided children are the same as its 
children, i.e., it preserves referential equality.
 * It copies tags and maintains the origin links when a new copy is created.

These properties are hard to enforce in the concrete node type implementation. 
Therefore, we propose a template method {{withNewChildrenInternal}} that should 
be rewritten by the concrete classes and let the {{withNewChildren}} method 
take care of referential equality and copying:

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = {}}
 {{  if (childrenTheSame(children, newChildren)) {}}
 {{    this}}
 {{  } else {}}
 {{    CurrentOrigin.withOrigin(origin) {}}
 {{      val res = withNewChildrenInternal(newChildren)}}
 {{      res.copyTagsFrom(this)}}
{{      res}}
      }
   }
}

With the refactoring done in a previous PR 
([#31932|https://github.com/apache/spark/pull/31932]) most tree node types fall 
in one of the categories of {{Leaf}}, {{Unary}}, {{Binary}} or {{Ternary}}. 
These traits have a more efficient implementation for {{mapChildren}} and 
define a more specialized version of {{withNewChildrenInternal}} that avoids 
creating unnecessary lists. For example, the {{mapChildren}} method in 
{{UnaryLike}} is defined as follows:

{{override final def mapChildren(f: T => T): T = {}}
 {{  val newChild = f(child)}}
 {{  if (newChild fastEquals child) {}}
 {{    this.asInstanceOf[T]}}
 {{  } else {}}
 {{    CurrentOrigin.withOrigin(origin) {}}
 {{      val res = withNewChildInternal(newChild)}}
 {{      res.copyTagsFrom(this.asInstanceOf[T])}}
 {{      res}}
 \{{    }}}
 \{{  }}}
 \{{ }}}
h4. Results

With this PR, we have observed significant performance improvements in query 
compilation time, more specifically in the analysis and optimization 

[jira] [Updated] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-08 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34989:
-
Description: 
One of the main performance bottlenecks in query compilation is overly-generic 
tree transformation methods, namely {{mapChildren}} and {{withNewChildren}} 
(defined in {{TreeNode}}). These methods have an overly-generic implementation 
to iterate over the children and rely on reflection to create new instances. We 
have observed that, especially for queries with large query plans, a 
significant amount of CPU cycles are wasted in these methods. In this PR we 
make these methods more efficient, by delegating the iteration and 
instantiation to concrete node types. The benchmarks show that we can expect 
significant performance improvement in total query compilation time in queries 
with large query plans (from 30-80%) and about 20% on average.
h4. Problem detail

The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To be 
more specific, this method:
 * iterates over all the fields of a node using Scala’s product iterator. While 
the iteration is not reflection-based, thanks to the Scala compiler generating 
code for {{Product}}, we create many anonymous functions and visit many nested 
structures (recursive calls).
 The anonymous functions (presumably compiled to Java anonymous inner classes) 
also show up quite high on the list in the object allocation profiles, so we 
are putting unnecessary pressure on GC here.
 * does a lot of comparisons. Basically for each element returned from the 
product iterator, we check if it is a child (contained in the list of children) 
and then transform it. We can avoid that by just iterating over children, but 
in the current implementation, we need to gather all the fields (only transform 
the children) so that we can instantiate the object using the reflection.
 * creates objects using reflection, by delegating to the {{makeCopy}} method, 
which is several orders of magnitude slower than using the constructor.

h4. Solution

The proposed solution in this PR is rather straightforward: we rewrite the 
{{mapChildren}} method using the {{children}} and {{withNewChildren}} methods. 
The default {{withNewChildren}} method suffers from the same problems as 
{{mapChildren}} and we need to make it more efficient by specializing it in 
concrete classes. Similar to how each concrete query plan node already defines 
its children, it should also define how they can be constructed given a new 
list of children. Actually, the implementation is quite simple in most cases 
and is a one-liner thanks to the copy method present in Scala case classes. 
Note that we cannot abstract over the copy method, it’s generated by the 
compiler for case classes if no other type higher in the hierarchy defines it. 
For most concrete nodes, the implementation of {{withNewChildren}} looks like 
this:

 

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
copy(children = newChildren)}}

The current {{withNewChildren}} method has two properties that we should 
preserve:
 * It returns the same instance if the provided children are the same as its 
children, i.e., it preserves referential equality.
 * It copies tags and maintains the origin links when a new copy is created.

These properties are hard to enforce in the concrete node type implementation. 
Therefore, we propose a template method {{withNewChildrenInternal}} that should 
be rewritten by the concrete classes and let the {{withNewChildren}} method 
take care of referential equality and copying:

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = {}}
 {{  if (childrenTheSame(children, newChildren)) {}}
 {{    this}}
 {{  } else {}}
 {{    CurrentOrigin.withOrigin(origin) {}}
 {{      val res = withNewChildrenInternal(newChildren)}}
 {{      res.copyTagsFrom(this)}}
{{      res}}

      }

   }

}

With the refactoring done in a previous PR 
([#31932|https://github.com/apache/spark/pull/31932]) most tree node types fall 
in one of the categories of {{Leaf}}, {{Unary}}, {{Binary}} or {{Ternary}}. 
These traits have a more efficient implementation for {{mapChildren}} and 
define a more specialized version of {{withNewChildrenInternal}} that avoids 
creating unnecessary lists. For example, the {{mapChildren}} method in 
{{UnaryLike}} is defined as follows:

{{override final def mapChildren(f: T => T): T = {}}
 {{  val newChild = f(child)}}
 {{  if (newChild fastEquals child) {}}
 {{    this.asInstanceOf[T]}}
 {{  } else {}}
 {{    CurrentOrigin.withOrigin(origin) {}}
 {{      val res = withNewChildInternal(newChild)}}
 {{      res.copyTagsFrom(this.asInstanceOf[T])}}
 {{      res}}
 \{{    }}}
 \{{  }}}
 \{{ }}}
h4. Results

With this PR, we have observed significant performance improvements in query 
compilation time, more specifically in the analysis and optimization 

[jira] [Updated] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-08 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34989:
-
Description: 
One of the main performance bottlenecks in query compilation is overly-generic 
tree transformation methods, namely {{mapChildren}} and {{withNewChildren}} 
(defined in {{TreeNode}}). These methods have an overly-generic implementation 
to iterate over the children and rely on reflection to create new instances. We 
have observed that, especially for queries with large query plans, a 
significant amount of CPU cycles are wasted in these methods. In this PR we 
make these methods more efficient, by delegating the iteration and 
instantiation to concrete node types. The benchmarks show that we can expect 
significant performance improvement in total query compilation time in queries 
with large query plans (from 30-80%) and about 20% on average.
h4. Problem detail

The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To be 
more specific, this method:
 * iterates over all the fields of a node using Scala’s product iterator. While 
the iteration is not reflection-based, thanks to the Scala compiler generating 
code for {{Product}}, we create many anonymous functions and visit many nested 
structures (recursive calls).
 The anonymous functions (presumably compiled to Java anonymous inner classes) 
also show up quite high on the list in the object allocation profiles, so we 
are putting unnecessary pressure on GC here.
 * does a lot of comparisons. Basically for each element returned from the 
product iterator, we check if it is a child (contained in the list of children) 
and then transform it. We can avoid that by just iterating over children, but 
in the current implementation, we need to gather all the fields (only transform 
the children) so that we can instantiate the object using the reflection.
 * creates objects using reflection, by delegating to the {{makeCopy}} method, 
which is several orders of magnitude slower than using the constructor.

h4. Solution

The proposed solution in this PR is rather straightforward: we rewrite the 
{{mapChildren}} method using the {{children}} and {{withNewChildren}} methods. 
The default {{withNewChildren}} method suffers from the same problems as 
{{mapChildren}} and we need to make it more efficient by specializing it in 
concrete classes. Similar to how each concrete query plan node already defines 
its children, it should also define how they can be constructed given a new 
list of children. Actually, the implementation is quite simple in most cases 
and is a one-liner thanks to the copy method present in Scala case classes. 
Note that we cannot abstract over the copy method, it’s generated by the 
compiler for case classes if no other type higher in the hierarchy defines it. 
For most concrete nodes, the implementation of {{withNewChildren}} looks like 
this:

 

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
copy(children = newChildren)}}

The current {{withNewChildren}} method has two properties that we should 
preserve:
 * It returns the same instance if the provided children are the same as its 
children, i.e., it preserves referential equality.
 * It copies tags and maintains the origin links when a new copy is created.

These properties are hard to enforce in the concrete node type implementation. 
Therefore, we propose a template method {{withNewChildrenInternal}} that should 
be rewritten by the concrete classes and let the {{withNewChildren}} method 
take care of referential equality and copying:

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = {}}
{{  if (childrenTheSame(children, newChildren)) {}}
{{    this}}
{{  } else {}}
{{    CurrentOrigin.withOrigin(origin) {}}
{{      val res = withNewChildrenInternal(newChildren)}}
{{      res.copyTagsFrom(this)}}
{{      res}}
{{    }   }}
{{  }}}
{{ } }}

With the refactoring done in a previous PR 
([#31932|https://github.com/apache/spark/pull/31932]) most tree node types fall 
in one of the categories of {{Leaf}}, {{Unary}}, {{Binary}} or {{Ternary}}. 
These traits have a more efficient implementation for {{mapChildren}} and 
define a more specialized version of {{withNewChildrenInternal}} that avoids 
creating unnecessary lists. For example, the {{mapChildren}} method in 
{{UnaryLike}} is defined as follows:

{{override final def mapChildren(f: T => T): T = {}}
{{  val newChild = f(child)}}
{{  if (newChild fastEquals child) {}}
{{    this.asInstanceOf[T]}}
{{  } else {}}
{{    CurrentOrigin.withOrigin(origin) {}}
{{      val res = withNewChildInternal(newChild)}}
{{      res.copyTagsFrom(this.asInstanceOf[T])}}
{{      res}}
{{    }}}
{{  }}}
{{ }}}

h4. Results

With this PR, we have observed significant performance improvements in query 
compilation time, more specifically in the analysis and optimization 

[jira] [Updated] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-08 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34989:
-
Description: 
One of the main performance bottlenecks in query compilation is overly-generic 
tree transformation methods, namely {{mapChildren}} and {{withNewChildren}} 
(defined in {{TreeNode}}). These methods have an overly-generic implementation 
to iterate over the children and rely on reflection to create new instances. We 
have observed that, especially for queries with large query plans, a 
significant amount of CPU cycles are wasted in these methods. In this PR we 
make these methods more efficient, by delegating the iteration and 
instantiation to concrete node types. The benchmarks show that we can expect 
significant performance improvement in total query compilation time in queries 
with large query plans (from 30-80%) and about 20% on average.
h4. Problem detail

The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To be 
more specific, this method:
 * iterates over all the fields of a node using Scala’s product iterator. While 
the iteration is not reflection-based, thanks to the Scala compiler generating 
code for {{Product}}, we create many anonymous functions and visit many nested 
structures (recursive calls).
 The anonymous functions (presumably compiled to Java anonymous inner classes) 
also show up quite high on the list in the object allocation profiles, so we 
are putting unnecessary pressure on GC here.
 * does a lot of comparisons. Basically for each element returned from the 
product iterator, we check if it is a child (contained in the list of children) 
and then transform it. We can avoid that by just iterating over children, but 
in the current implementation, we need to gather all the fields (only transform 
the children) so that we can instantiate the object using the reflection.
 * creates objects using reflection, by delegating to the {{makeCopy}} method, 
which is several orders of magnitude slower than using the constructor.

h4. Solution

The proposed solution in this PR is rather straightforward: we rewrite the 
{{mapChildren}} method using the {{children}} and {{withNewChildren}} methods. 
The default {{withNewChildren}} method suffers from the same problems as 
{{mapChildren}} and we need to make it more efficient by specializing it in 
concrete classes. Similar to how each concrete query plan node already defines 
its children, it should also define how they can be constructed given a new 
list of children. Actually, the implementation is quite simple in most cases 
and is a one-liner thanks to the copy method present in Scala case classes. 
Note that we cannot abstract over the copy method, it’s generated by the 
compiler for case classes if no other type higher in the hierarchy defines it. 
For most concrete nodes, the implementation of {{withNewChildren}} looks like 
this:

 

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
copy(children = newChildren)}}

The current {{withNewChildren}} method has two properties that we should 
preserve:
 * It returns the same instance if the provided children are the same as its 
children, i.e., it preserves referential equality.
 * It copies tags and maintains the origin links when a new copy is created.

These properties are hard to enforce in the concrete node type implementation. 
Therefore, we propose a template method {{withNewChildrenInternal}} that should 
be rewritten by the concrete classes and let the {{withNewChildren}} method 
take care of referential equality and copying:

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = {}}
{{  if (childrenTheSame(children, newChildren)) {}}
{{    this}}
{{  } else {}}
{{    CurrentOrigin.withOrigin(origin) {}}
{{      val res = withNewChildrenInternal(newChildren)}}
{{      res.copyTagsFrom(this)}}
{{      res}}
{{    } }}
{{  } }}
{{ } }}

With the refactoring done in a previous PR 
([#31932|https://github.com/apache/spark/pull/31932]) most tree node types fall 
in one of the categories of {{Leaf}}, {{Unary}}, {{Binary}} or {{Ternary}}. 
These traits have a more efficient implementation for {{mapChildren}} and 
define a more specialized version of {{withNewChildrenInternal}} that avoids 
creating unnecessary lists. For example, the {{mapChildren}} method in 
{{UnaryLike}} is defined as follows:

{{override final def mapChildren(f: T => T): T = {}}
{{  val newChild = f(child)}}
{{  if (newChild fastEquals child) {}}
{{    this.asInstanceOf[T]}}
{{  } else {}}
{{    CurrentOrigin.withOrigin(origin) {}}
{{      val res = withNewChildInternal(newChild)}}
{{      res.copyTagsFrom(this.asInstanceOf[T])}}
{{      res}}
{{    }}}
{{  }}}
{{ }}}

h4. Results

With this PR, we have observed significant performance improvements in query 
compilation time, more specifically in the analysis and optimization phases. 

[jira] [Updated] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-08 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34989:
-
Description: 
One of the main performance bottlenecks in query compilation is overly-generic 
tree transformation methods, namely {{mapChildren}} and {{withNewChildren}} 
(defined in {{TreeNode}}). These methods have an overly-generic implementation 
to iterate over the children and rely on reflection to create new instances. We 
have observed that, especially for queries with large query plans, a 
significant amount of CPU cycles are wasted in these methods. In this PR we 
make these methods more efficient, by delegating the iteration and 
instantiation to concrete node types. The benchmarks show that we can expect 
significant performance improvement in total query compilation time in queries 
with large query plans (from 30-80%) and about 20% on average.
h4. Problem detail

The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To be 
more specific, this method:
 * iterates over all the fields of a node using Scala’s product iterator. While 
the iteration is not reflection-based, thanks to the Scala compiler generating 
code for {{Product}}, we create many anonymous functions and visit many nested 
structures (recursive calls).
 The anonymous functions (presumably compiled to Java anonymous inner classes) 
also show up quite high on the list in the object allocation profiles, so we 
are putting unnecessary pressure on GC here.
 * does a lot of comparisons. Basically for each element returned from the 
product iterator, we check if it is a child (contained in the list of children) 
and then transform it. We can avoid that by just iterating over children, but 
in the current implementation, we need to gather all the fields (only transform 
the children) so that we can instantiate the object using the reflection.
 * creates objects using reflection, by delegating to the {{makeCopy}} method, 
which is several orders of magnitude slower than using the constructor.

h4. Solution

The proposed solution in this PR is rather straightforward: we rewrite the 
{{mapChildren}} method using the {{children}} and {{withNewChildren}} methods. 
The default {{withNewChildren}} method suffers from the same problems as 
{{mapChildren}} and we need to make it more efficient by specializing it in 
concrete classes. Similar to how each concrete query plan node already defines 
its children, it should also define how they can be constructed given a new 
list of children. Actually, the implementation is quite simple in most cases 
and is a one-liner thanks to the copy method present in Scala case classes. 
Note that we cannot abstract over the copy method, it’s generated by the 
compiler for case classes if no other type higher in the hierarchy defines it. 
For most concrete nodes, the implementation of {{withNewChildren}} looks like 
this:

 

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
copy(children = newChildren)}}

The current {{withNewChildren}} method has two properties that we should 
preserve:
 * It returns the same instance if the provided children are the same as its 
children, i.e., it preserves referential equality.
 * It copies tags and maintains the origin links when a new copy is created.

These properties are hard to enforce in the concrete node type implementation. 
Therefore, we propose a template method {{withNewChildrenInternal}} that should 
be rewritten by the concrete classes and let the {{withNewChildren}} method 
take care of referential equality and copying:

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = {}}
{{  if (childrenTheSame(children, newChildren)) {}}
{{    this}}
{{  } else {}}
{{    CurrentOrigin.withOrigin(origin) {}}
{{      val res = withNewChildrenInternal(newChildren)}}
{{      res.copyTagsFrom(this)}}
{{      res}}
{{    } }}
{{  } }}
{{ } }}

With the refactoring done in a previous PR 
([#31932|https://github.com/apache/spark/pull/31932]) most tree node types fall 
in one of the categories of {{Leaf}}, {{Unary}}, {{Binary}} or {{Ternary}}. 
These traits have a more efficient implementation for {{mapChildren}} and 
define a more specialized version of {{withNewChildrenInternal}} that avoids 
creating unnecessary lists. For example, the {{mapChildren}} method in 
{{UnaryLike}} is defined as follows:

{{override final def mapChildren(f: T => T): T = {}}
{{  val newChild = f(child)}}
{{  if (newChild fastEquals child) {}}
{{    this.asInstanceOf[T]}}
{{  } else {}}
{{    CurrentOrigin.withOrigin(origin) {}}
{{      val res = withNewChildInternal(newChild)}}
{{      res.copyTagsFrom(this.asInstanceOf[T])}}
{{      res}}
{{    }}}
{{  }}}
{{ }}}
h4. Results

With this PR, we have observed significant performance improvements in query 
compilation time, more specifically in the analysis and optimization phases. 

[jira] [Updated] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-08 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34989:
-
Description: 
One of the main performance bottlenecks in query compilation is overly-generic 
tree transformation methods, namely {{mapChildren}} and {{withNewChildren}} 
(defined in {{TreeNode}}). These methods have an overly-generic implementation 
to iterate over the children and rely on reflection to create new instances. We 
have observed that, especially for queries with large query plans, a 
significant amount of CPU cycles are wasted in these methods. In this PR we 
make these methods more efficient, by delegating the iteration and 
instantiation to concrete node types. The benchmarks show that we can expect 
significant performance improvement in total query compilation time in queries 
with large query plans (from 30-80%) and about 20% on average.
h4. Problem detail

The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To be 
more specific, this method:
 * iterates over all the fields of a node using Scala’s product iterator. While 
the iteration is not reflection-based, thanks to the Scala compiler generating 
code for {{Product}}, we create many anonymous functions and visit many nested 
structures (recursive calls).
 The anonymous functions (presumably compiled to Java anonymous inner classes) 
also show up quite high on the list in the object allocation profiles, so we 
are putting unnecessary pressure on GC here.
 * does a lot of comparisons. Basically for each element returned from the 
product iterator, we check if it is a child (contained in the list of children) 
and then transform it. We can avoid that by just iterating over children, but 
in the current implementation, we need to gather all the fields (only transform 
the children) so that we can instantiate the object using the reflection.
 * creates objects using reflection, by delegating to the {{makeCopy}} method, 
which is several orders of magnitude slower than using the constructor.

h4. Solution

The proposed solution in this PR is rather straightforward: we rewrite the 
{{mapChildren}} method using the {{children}} and {{withNewChildren}} methods. 
The default {{withNewChildren}} method suffers from the same problems as 
{{mapChildren}} and we need to make it more efficient by specializing it in 
concrete classes. Similar to how each concrete query plan node already defines 
its children, it should also define how they can be constructed given a new 
list of children. Actually, the implementation is quite simple in most cases 
and is a one-liner thanks to the copy method present in Scala case classes. 
Note that we cannot abstract over the copy method, it’s generated by the 
compiler for case classes if no other type higher in the hierarchy defines it. 
For most concrete nodes, the implementation of {{withNewChildren}} looks like 
this:

 

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
copy(children = newChildren)}}

The current {{withNewChildren}} method has two properties that we should 
preserve:
 * It returns the same instance if the provided children are the same as its 
children, i.e., it preserves referential equality.
 * It copies tags and maintains the origin links when a new copy is created.

These properties are hard to enforce in the concrete node type implementation. 
Therefore, we propose a template method {{withNewChildrenInternal}} that should 
be rewritten by the concrete classes and let the {{withNewChildren}} method 
take care of referential equality and copying:

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = {}}
{{  if (childrenTheSame(children, newChildren)) {}}
{{    this}}
{{  } else {}}
{{    CurrentOrigin.withOrigin(origin) {}}
{{      val res = withNewChildrenInternal(newChildren)}}
{{      res.copyTagsFrom(this)}}
{{      res}}
{{    }}}
{{  }}}
{{}}}

With the refactoring done in a previous PR 
([#31932|https://github.com/apache/spark/pull/31932]) most tree node types fall 
in one of the categories of {{Leaf}}, {{Unary}}, {{Binary}} or {{Ternary}}. 
These traits have a more efficient implementation for {{mapChildren}} and 
define a more specialized version of {{withNewChildrenInternal}} that avoids 
creating unnecessary lists. For example, the {{mapChildren}} method in 
{{UnaryLike}} is defined as follows:

{{override final def mapChildren(f: T => T): T = {}}
{{  val newChild = f(child)}}
{{  if (newChild fastEquals child) {}}
{{    this.asInstanceOf[T]}}
{{  } else {}}
{{    CurrentOrigin.withOrigin(origin) {}}
{{      val res = withNewChildInternal(newChild)}}
{{      res.copyTagsFrom(this.asInstanceOf[T])}}
{{      res}}
{{    }}}
{{  }}}
{{ }}}
h4. Results

With this PR, we have observed significant performance improvements in query 
compilation time, more specifically in the analysis and optimization phases. 
The 

[jira] [Updated] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-08 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34989:
-
Description: 
One of the main performance bottlenecks in query compilation is overly-generic 
tree transformation methods, namely {{mapChildren}} and {{withNewChildren}} 
(defined in {{TreeNode}}). These methods have an overly-generic implementation 
to iterate over the children and rely on reflection to create new instances. We 
have observed that, especially for queries with large query plans, a 
significant amount of CPU cycles are wasted in these methods. In this PR we 
make these methods more efficient, by delegating the iteration and 
instantiation to concrete node types. The benchmarks show that we can expect 
significant performance improvement in total query compilation time in queries 
with large query plans (from 30-80%) and about 20% on average.
h4. Problem detail

The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To be 
more specific, this method:
 * iterates over all the fields of a node using Scala’s product iterator. While 
the iteration is not reflection-based, thanks to the Scala compiler generating 
code for {{Product}}, we create many anonymous functions and visit many nested 
structures (recursive calls).
 The anonymous functions (presumably compiled to Java anonymous inner classes) 
also show up quite high on the list in the object allocation profiles, so we 
are putting unnecessary pressure on GC here.
 * does a lot of comparisons. Basically for each element returned from the 
product iterator, we check if it is a child (contained in the list of children) 
and then transform it. We can avoid that by just iterating over children, but 
in the current implementation, we need to gather all the fields (only transform 
the children) so that we can instantiate the object using the reflection.
 * creates objects using reflection, by delegating to the {{makeCopy}} method, 
which is several orders of magnitude slower than using the constructor.

h4. Solution

The proposed solution in this PR is rather straightforward: we rewrite the 
{{mapChildren}} method using the {{children}} and {{withNewChildren}} methods. 
The default {{withNewChildren}} method suffers from the same problems as 
{{mapChildren}} and we need to make it more efficient by specializing it in 
concrete classes. Similar to how each concrete query plan node already defines 
its children, it should also define how they can be constructed given a new 
list of children. Actually, the implementation is quite simple in most cases 
and is a one-liner thanks to the copy method present in Scala case classes. 
Note that we cannot abstract over the copy method, it’s generated by the 
compiler for case classes if no other type higher in the hierarchy defines it. 
For most concrete nodes, the implementation of {{withNewChildren}} looks like 
this:

 

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
copy(children = newChildren)}}

The current {{withNewChildren}} method has two properties that we should 
preserve:
 * It returns the same instance if the provided children are the same as its 
children, i.e., it preserves referential equality.
 * It copies tags and maintains the origin links when a new copy is created.

These properties are hard to enforce in the concrete node type implementation. 
Therefore, we propose a template method {{withNewChildrenInternal}} that should 
be rewritten by the concrete classes and let the {{withNewChildren}} method 
take care of referential equality and copying:

 

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = {
 if (childrenTheSame(children, newChildren))

{ this }

else {
 CurrentOrigin.withOrigin(origin)

{ val res = withNewChildrenInternal(newChildren) res.copyTagsFrom(this) res }

}
 }}}

With the refactoring done in a previous PR 
([#31932|https://github.com/apache/spark/pull/31932]) most tree node types fall 
in one of the categories of {{Leaf}}, {{Unary}}, {{Binary}} or {{Ternary}}. 
These traits have a more efficient implementation for {{mapChildren}} and 
define a more specialized version of {{withNewChildrenInternal}} that avoids 
creating unnecessary lists. For example, the {{mapChildren}} method in 
{{UnaryLike}} is defined as follows:

 

{{ override final def mapChildren(f: T => T): T = {
 val newChild = f(child)
 if (newChild fastEquals child)

{ this.asInstanceOf[T] }

else {
 CurrentOrigin.withOrigin(origin)

{ val res = withNewChildInternal(newChild) 
res.copyTagsFrom(this.asInstanceOf[T]) res }

}
 }}}
h4. Results

With this PR, we have observed significant performance improvements in query 
compilation time, more specifically in the analysis and optimization phases. 
The table below shows the TPC-DS queries that had more than 25% speedup in 
compilation times. Biggest speedups are observed in queries with 

[jira] [Created] (SPARK-34989) Improve the performance of mapChildren and withNewChildren methods

2021-04-08 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-34989:


 Summary: Improve the performance of mapChildren and 
withNewChildren methods
 Key: SPARK-34989
 URL: https://issues.apache.org/jira/browse/SPARK-34989
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Ali Afroozeh


One of the main performance bottlenecks in query compilation is overly-generic 
tree transformation methods, namely {{mapChildren}} and {{withNewChildren}} 
(defined in {{TreeNode}}). These methods have an overly-generic implementation 
to iterate over the children and rely on reflection to create new instances. We 
have observed that, especially for queries with large query plans, a 
significant amount of CPU cycles are wasted in these methods. In this PR we 
make these methods more efficient, by delegating the iteration and 
instantiation to concrete node types. The benchmarks show that we can expect 
significant performance improvement in total query compilation time in queries 
with large query plans (from 30-80%) and about 20% on average.
h4. Problem detail

The {{mapChildren}} method in {{TreeNode}} is overly generic and costly. To be 
more specific, this method:
 * iterates over all the fields of a node using Scala’s product iterator. While 
the iteration is not reflection-based, thanks to the Scala compiler generating 
code for {{Product}}, we create many anonymous functions and visit many nested 
structures (recursive calls).
The anonymous functions (presumably compiled to Java anonymous inner classes) 
also show up quite high on the list in the object allocation profiles, so we 
are putting unnecessary pressure on GC here.
 * does a lot of comparisons. Basically for each element returned from the 
product iterator, we check if it is a child (contained in the list of children) 
and then transform it. We can avoid that by just iterating over children, but 
in the current implementation, we need to gather all the fields (only transform 
the children) so that we can instantiate the object using the reflection.
 * creates objects using reflection, by delegating to the {{makeCopy}} method, 
which is several orders of magnitude slower than using the constructor.

h4. Solution

The proposed solution in this PR is rather straightforward: we rewrite the 
{{mapChildren}} method using the {{children}} and {{withNewChildren}} methods. 
The default {{withNewChildren}} method suffers from the same problems as 
{{mapChildren}} and we need to make it more efficient by specializing it in 
concrete classes. Similar to how each concrete query plan node already defines 
its children, it should also define how they can be constructed given a new 
list of children. Actually, the implementation is quite simple in most cases 
and is a one-liner thanks to the copy method present in Scala case classes. 
Note that we cannot abstract over the copy method, it’s generated by the 
compiler for case classes if no other type higher in the hierarchy defines it. 
For most concrete nodes, the implementation of {{withNewChildren}} looks like 
this:

 

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = 
copy(children = newChildren)}}

The current {{withNewChildren}} method has two properties that we should 
preserve:
 * It returns the same instance if the provided children are the same as its 
children, i.e., it preserves referential equality.
 * It copies tags and maintains the origin links when a new copy is created.

These properties are hard to enforce in the concrete node type implementation. 
Therefore, we propose a template method {{withNewChildrenInternal}} that should 
be rewritten by the concrete classes and let the {{withNewChildren}} method 
take care of referential equality and copying:

 

{{override def withNewChildren(newChildren: Seq[LogicalPlan]): LogicalPlan = \{
 if (childrenTheSame(children, newChildren)) {
   this
 } else \{
   CurrentOrigin.withOrigin(origin) {
 val res = withNewChildrenInternal(newChildren)
 res.copyTagsFrom(this)
 res
   }
 }
}}}

With the refactoring done in a previous PR 
([#31932|https://github.com/apache/spark/pull/31932]) most tree node types fall 
in one of the categories of {{Leaf}}, {{Unary}}, {{Binary}} or {{Ternary}}. 
These traits have a more efficient implementation for {{mapChildren}} and 
define a more specialized version of {{withNewChildrenInternal}} that avoids 
creating unnecessary lists. For example, the {{mapChildren}} method in 
{{UnaryLike}} is defined as follows:

 

{{  override final def mapChildren(f: T => T): T = \{
val newChild = f(child)
if (newChild fastEquals child) {
  this.asInstanceOf[T]
} else \{
  CurrentOrigin.withOrigin(origin) {
val res = withNewChildInternal(newChild)
res.copyTagsFrom(this.asInstanceOf[T])
res
  }
}
  }}}
h4. Results

With this PR, we 

[jira] [Updated] (SPARK-34969) Followup for Refactor TreeNode's children handling methods into specialized traits (SPARK-34906)

2021-04-06 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34969:
-
Summary: Followup for Refactor TreeNode's children handling methods into 
specialized traits (SPARK-34906)  (was: Followup for SPARK-34906)

> Followup for Refactor TreeNode's children handling methods into specialized 
> traits (SPARK-34906)
> 
>
> Key: SPARK-34969
> URL: https://issues.apache.org/jira/browse/SPARK-34969
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> This is a followup for https://issues.apache.org/jira/browse/SPARK-34906
> In this PR we:
>  * Introduce the QuaternaryLike trait for node types with 4 children.
>  * Specialize more node types
>  * Fix a number of style errors that were introduced in the original PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34969) Followup for SPARK-34906

2021-04-06 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-34969:


 Summary: Followup for SPARK-34906
 Key: SPARK-34969
 URL: https://issues.apache.org/jira/browse/SPARK-34969
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Ali Afroozeh


This is a followup for https://issues.apache.org/jira/browse/SPARK-34906

In this PR we:
 * Introduce the QuaternaryLike trait for node types with 4 children.
 * Specialize more node types
 * Fix a number of style errors that were introduced in the original PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34906:
-
Description: 
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example UnaryExpression, 
UnaryNode and UnaryExec for representing an expression, a logical plan and a 
physical plan with only one child, respectively. This PR refactors the TreeNode 
hierarchy by extracting the children handling functionality into the following 
traits. UnaryExpression` and other similar classes now extend the corresponding 
new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend AggregateFunction cannot extend UnaryExpression as AggregateFunction 
defines the foldable method final while UnaryExpression defines it as non 
final. With the new traits, we can directly extend the concrete class from 
UnaryLike in these cases. Classes with more specific child handling will make 
tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.

  was:
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example UnaryExpression, 
UnaryNode and UnaryExec for representing an expression, a logical plan and a 
physical plan with only one child, respectively. This PR refactors the TreeNode 
hierarchy by extracting the children handling functionality into the following 
traits. The former nodes such as UnaryExpression now extend the corresponding 
new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend AggregateFunction cannot extend UnaryExpression as AggregateFunction 
defines the foldable method final while UnaryExpression defines it as non 
final. With the new traits, we can directly extend the concrete class from 
UnaryLike in these cases. Classes with more specific child handling will make 
tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.


> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling nodes with fixed number of 

[jira] [Updated] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34906:
-
Description: 
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example UnaryExpression, 
UnaryNode and UnaryExec for representing an expression, a logical plan and a 
physical plan with only one child, respectively. This PR refactors the TreeNode 
hierarchy by extracting the children handling functionality into the following 
traits. The former nodes such as UnaryExpression now extend the corresponding 
new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend AggregateFunction cannot extend UnaryExpression as AggregateFunction 
defines the foldable method final while UnaryExpression defines it as non 
final. With the new traits, we can directly extend the concrete class from 
UnaryLike in these cases. Classes with more specific child handling will make 
tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.

  was:
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example UnaryExpression, 
UnaryNode and UnaryExec for representing an expression, a logical plan and a 
physical plan with only one child, respectively. This PR refactors the 
`TreeNode` hierarchy by extracting the children handling functionality into the 
following traits. The former nodes such as UnaryExpression now extend the 
corresponding new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend AggregateFunction cannot extend UnaryExpression as AggregateFunction 
defines the foldable method final while UnaryExpression defines it as non 
final. With the new traits, we can directly extend the concrete class from 
UnaryLike in these cases. Classes with more specific child handling will make 
tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.


> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling nodes with fixed number of 

[jira] [Updated] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34906:
-
Description: 
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example UnaryExpression, 
UnaryNode and UnaryExec for representing an expression, a logical plan and a 
physical plan with only one child, respectively. This PR refactors the 
`TreeNode` hierarchy by extracting the children handling functionality into the 
following traits. The former nodes such as UnaryExpression now extend the 
corresponding new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend AggregateFunction cannot extend UnaryExpression as AggregateFunction 
defines the foldable method final while UnaryExpression defines it as non 
final. With the new traits, we can directly extend the concrete class from 
UnaryLike in these cases. Classes with more specific child handling will make 
tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.

  was:
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example 
`UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an expression, 
a logical plan and a physical plan with only one child, respectively. This PR 
refactors the `TreeNode` hierarchy by extracting the children handling 
functionality into the following traits. The former nodes such as 
`UnaryExpression` now extend the corresponding new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend `AggregateFunction` cannot extend `UnaryExpression` as 
`AggregateFunction` defines the `foldable` method final while `UnaryExpression` 
defines it as non final. With the new traits, we can directly extend the 
concrete class from `UnaryLike` in these cases. Classes with more specific 
child handling will make tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.


> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling nodes with 

[jira] [Updated] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34906:
-
Description: 
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example 
`UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an expression, 
a logical plan and a physical plan with only one child, respectively. This PR 
refactors the `TreeNode` hierarchy by extracting the children handling 
functionality into the following traits. The former nodes such as 
`UnaryExpression` now extend the corresponding new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend `AggregateFunction` cannot extend `UnaryExpression` as 
`AggregateFunction` defines the `foldable` method final while `UnaryExpression` 
defines it as non final. With the new traits, we can directly extend the 
concrete class from `UnaryLike` in these cases. Classes with more specific 
child handling will make tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.

  was:
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example 
`UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an expression, 
a logical plan and a physical plan with child, respectively. This PR refactors 
the `TreeNode` hierarchy by extracting the children handling functionality into 
the following traits. The former nodes such as `UnaryExpression` now extend the 
corresponding new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend `AggregateFunction` cannot extend `UnaryExpression` as 
`AggregateFunction` defines the `foldable` method final while `UnaryExpression` 
defines it as non final. With the new traits, we can directly extend the 
concrete class from `UnaryLike` in these cases. Classes with more specific 
child handling will make tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.


> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling 

[jira] [Created] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-34906:


 Summary: Refactor TreeNode's children handling methods into 
specialized traits
 Key: SPARK-34906
 URL: https://issues.apache.org/jira/browse/SPARK-34906
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Ali Afroozeh


Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example 
`UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an expression, 
a logical plan and a physical plan with child, respectively. This PR refactors 
the `TreeNode` hierarchy by extracting the children handling functionality into 
the following traits. The former nodes such as `UnaryExpression` now extend the 
corresponding new trait:


{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  override final def children: Seq[T] = Nil}}
{{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  def child: T}}
{{  @transient override final lazy val children: Seq[T] = child :: Nil}}
{{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  def left: T}}
{{  def right: T}}
{{  @transient override final lazy val children: Seq[T] = left :: right :: Nil}}
{{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  def first: T}}
{{  def second: T}}
{{  def third: T}}
{{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
{{}}}
 * This refactoring, which is part of a bigger effort to make tree 
transformations in Spark more efficient, has two benefits:
It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend `AggregateFunction` cannot extend `UnaryExpression` as 
`AggregateFunction` defines the `foldable` method final while `UnaryExpression` 
defines it as non final. With the new traits, we can directly extend the 
concrete class from `UnaryLike` in these cases. Classes with more specific 
child handling will make tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34906) Refactor TreeNode's children handling methods into specialized traits

2021-03-30 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-34906:
-
Description: 
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example 
`UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an expression, 
a logical plan and a physical plan with child, respectively. This PR refactors 
the `TreeNode` hierarchy by extracting the children handling functionality into 
the following traits. The former nodes such as `UnaryExpression` now extend the 
corresponding new trait:

{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  override final def children: Seq[T] = Nil}}
 {{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def child: T}}
 {{  @transient override final lazy val children: Seq[T] = child :: Nil}}
 {{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def left: T}}
 {{  def right: T}}
 {{  @transient override final lazy val children: Seq[T] = left :: right :: 
Nil}}
 {{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
 {{  def first: T}}
 {{  def second: T}}
 {{  def third: T}}
 {{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
 {{}}}

 

This refactoring, which is part of a bigger effort to make tree transformations 
in Spark more efficient, has two benefits:
 * It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend `AggregateFunction` cannot extend `UnaryExpression` as 
`AggregateFunction` defines the `foldable` method final while `UnaryExpression` 
defines it as non final. With the new traits, we can directly extend the 
concrete class from `UnaryLike` in these cases. Classes with more specific 
child handling will make tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.

  was:
Spark query plan node hierarchy has specialized traits (or abstract classes) 
for handling nodes with fixed number of children, for example 
`UnaryExpression`, `UnaryNode` and `UnaryExec` for representing an expression, 
a logical plan and a physical plan with child, respectively. This PR refactors 
the `TreeNode` hierarchy by extracting the children handling functionality into 
the following traits. The former nodes such as `UnaryExpression` now extend the 
corresponding new trait:


{{trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  override final def children: Seq[T] = Nil}}
{{}}}

{{trait UnaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  def child: T}}
{{  @transient override final lazy val children: Seq[T] = child :: Nil}}
{{}}}

{{trait BinaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  def left: T}}
{{  def right: T}}
{{  @transient override final lazy val children: Seq[T] = left :: right :: Nil}}
{{}}}

{{trait TernaryLike[T <: TreeNode[T]] { self: TreeNode[T] =>}}
{{  def first: T}}
{{  def second: T}}
{{  def third: T}}
{{  @transient override final lazy val children: Seq[T] = first :: second :: 
third :: Nil}}
{{}}}
 * This refactoring, which is part of a bigger effort to make tree 
transformations in Spark more efficient, has two benefits:
It moves the children handling to a single place, instead of being spread in 
specific subclasses, which will help the future optimizations for tree 
traversals.
 * It allows to mix in these traits with some concrete node types that could 
not extend the previous classes. For example, expressions with one child that 
extend `AggregateFunction` cannot extend `UnaryExpression` as 
`AggregateFunction` defines the `foldable` method final while `UnaryExpression` 
defines it as non final. With the new traits, we can directly extend the 
concrete class from `UnaryLike` in these cases. Classes with more specific 
child handling will make tree traversal methods faster.

In this PR we have also updated many concrete node types to extend these traits 
to benefit from more specific child handling.


> Refactor TreeNode's children handling methods into specialized traits
> -
>
> Key: SPARK-34906
> URL: https://issues.apache.org/jira/browse/SPARK-34906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ali Afroozeh
>Priority: Major
>
> Spark query plan node hierarchy has specialized traits (or abstract classes) 
> for handling nodes with fixed number of 

[jira] [Created] (SPARK-32800) Remove ExpressionSet from the 2.13 branch

2020-09-04 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-32800:


 Summary: Remove ExpressionSet from the 2.13 branch
 Key: SPARK-32800
 URL: https://issues.apache.org/jira/browse/SPARK-32800
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Ali Afroozeh


ExpressionSet does not extent Scala Set anymore, and therefore, can be removed 
from the 2.13 branch. This is a followup on 
https://issues.apache.org/jira/browse/SPARK-32755.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet

2020-08-31 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-32755:


 Summary: Maintain the order of expressions in AttributeSet and 
ExpressionSet 
 Key: SPARK-32755
 URL: https://issues.apache.org/jira/browse/SPARK-32755
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Ali Afroozeh


Expressions identity is based on the ExprId which is an auto-incremented 
number. This means that the same query can yield a query plan with different 
expression ids in different runs. AttributeSet and ExpressionSet internally use 
a HashSet as the underlying data structure, and therefore cannot guarantee the 
a fixed order of operations in different runs. This can be problematic in cases 
we like to check for plan changes in different runs.

We change do the following changes to AttributeSet and ExpressionSet to 
maintain the insertion order of the elements:
 * We change the underlying data structure of AttributeSet from HashSet to 
LinkedHashSet to maintain the insertion order.
 * ExpressionSet already uses a list to keep track of the expressions, however, 
since it is extending Scala's immutable.Set class, operations such as map and 
flatMap are delegated to the immutable.Set itself. This means that the result 
of these operations is not an instance of ExpressionSet anymore, rather it's a 
implementation picked up by the parent class. We also remove this inheritance 
from immutable.Set and implement the needed methods directly. ExpressionSet has 
a very specific semantics and it does not make sense to extend immutable.Set 
anyway.
 * We change the PlanStabilitySuite to not sort the attributes, to be able to 
catch changes in the order of expressions in different runs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31721) Assert optimized plan is initialized before tracking the execution of planning

2020-05-15 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-31721:
-
Description: 
The {{QueryPlanningTracker}} in {{QueryExeuction}} reports the planning time 
that also includes the optimization time. This happens because the 
{{optimizedPlan}} in {{QueryExecution}} is lazy and only will initialize when 
first called. When {{df.queryExecution.executedPlan}} is called, the the 
tracker starts recording the planning time, and then calls the optimized plan. 
This causes the planning time to start before optimization and also include the 
planning time.

This PR fixes this behavior by introducing a method {{assertOptimized}}, 
similar to {{assertAnalyzed}} that explicitly initializes the optimized plan. 
This method is called before measuring the time for {{sparkPlan}} and 
{{executedPlan}}. We call it before {{sparkPlan}} because that also counts as 
planning time.

  was:
The {{QueryPlanningTracker}} in {{QueryExeuction}} reports the planning time 
that also includes the optimization time. This happens because the 
{{optimizedPlan}} in {{QueryExecution}} is lazy and only will initialize when 
first called. When {{df.queryExecution.executedPlan}} is called, the the 
tracker starts recording the planning time, and then calls the optimized plan. 
This causes the planning time to start before optimization and also include the 
planning time.

This PR fixes this behavior by introducing a method {{assertOptimized}}, 
similar to {{assertAnalyzed}}that explicitly initializes the optimized plan. 
This method is called before measuring the time for {{sparkPlan}} and 
{{executedPlan}}. We call it before {{sparkPlan}} because that also counts as 
planning time.


> Assert optimized plan is initialized before tracking the execution of planning
> --
>
> Key: SPARK-31721
> URL: https://issues.apache.org/jira/browse/SPARK-31721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Major
>
> The {{QueryPlanningTracker}} in {{QueryExeuction}} reports the planning time 
> that also includes the optimization time. This happens because the 
> {{optimizedPlan}} in {{QueryExecution}} is lazy and only will initialize when 
> first called. When {{df.queryExecution.executedPlan}} is called, the the 
> tracker starts recording the planning time, and then calls the optimized 
> plan. This causes the planning time to start before optimization and also 
> include the planning time.
> This PR fixes this behavior by introducing a method {{assertOptimized}}, 
> similar to {{assertAnalyzed}} that explicitly initializes the optimized plan. 
> This method is called before measuring the time for {{sparkPlan}} and 
> {{executedPlan}}. We call it before {{sparkPlan}} because that also counts as 
> planning time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31721) Assert optimized plan is initialized before tracking the execution of planning

2020-05-15 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-31721:


 Summary: Assert optimized plan is initialized before tracking the 
execution of planning
 Key: SPARK-31721
 URL: https://issues.apache.org/jira/browse/SPARK-31721
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ali Afroozeh


The {{QueryPlanningTracker}} in {{QueryExeuction}} reports the planning time 
that also includes the optimization time. This happens because the 
{{optimizedPlan}} in {{QueryExecution}} is lazy and only will initialize when 
first called. When {{df.queryExecution.executedPlan}} is called, the the 
tracker starts recording the planning time, and then calls the optimized plan. 
This causes the planning time to start before optimization and also include the 
planning time.

This PR fixes this behavior by introducing a method {{assertOptimized}}, 
similar to {{assertAnalyzed}}that explicitly initializes the optimized plan. 
This method is called before measuring the time for {{sparkPlan}} and 
{{executedPlan}}. We call it before {{sparkPlan}} because that also counts as 
planning time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31719) Refactor JoinSelection

2020-05-15 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-31719:


 Summary: Refactor JoinSelection
 Key: SPARK-31719
 URL: https://issues.apache.org/jira/browse/SPARK-31719
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ali Afroozeh


This PR extracts the logic for selecting the planned join type out of the 
`JoinSelection` rule and moves it to `JoinSelectionHelper` in Catalyst. This 
change both cleans up the code in `JoinSelection` and allows the logic to be in 
one place and be used from other rules that need to make decision based on the 
join type before the planning time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31192) Introduce PushProjectThroughLimit

2020-03-19 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-31192:


 Summary: Introduce PushProjectThroughLimit
 Key: SPARK-31192
 URL: https://issues.apache.org/jira/browse/SPARK-31192
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ali Afroozeh


Currently the {{CollapseProject}} rule does many things: not only it collapses 
stacked projects, but also pushes down projects into limits, windows, etc. In 
this PR we factored out rules from {{CollapseProject}} that were pushing 
projects into limits and introduced a new rule called 
{{PushProjectThroughLimit.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30798) Scope Session.active in QueryExecution

2020-02-12 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-30798:


 Summary: Scope Session.active in QueryExecution
 Key: SPARK-30798
 URL: https://issues.apache.org/jira/browse/SPARK-30798
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ali Afroozeh


SparkSession.active is a thread local variable that points to the current 
thread's spark session. It is important to note that the SQLConf.get method 
depends on SparkSession.active. In the current implementation it is possible 
that SparkSession.active points to a different session which causes various 
problems. Most of these problems arise because part of the query processing is 
done using the configurations of a different session. For example, when 
creating a data frame using a new session, i.e., session.sql("..."), part of 
the data frame is constructed using the currently active spark session, which 
can be a different session from the one used later for processing the query.

This PR scopes SparkSession.active to prevent the above-mentioned problems. A 
new method, withActive is introduced on SparkSession that restores the previous 
spark session after the block of code is executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30072) Create dedicated planner for subqueries

2019-11-28 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-30072:
-
Description: 
This PR changes subquery planning by calling the planner and plan preparation 
rules on the subquery plan directly. Before we were creating a QueryExecution 
instance for subqueries to get the executedPlan. This would re-run analysis and 
optimization on the subqueries plan. Running the analysis again on an optimized 
query plan can have unwanted consequences, as some rules, for example 
DecimalPrecision, are not idempotent.

As an example, consider the expression 1.7 * avg(a) which after applying the 
DecimalPrecision rule becomes:

promote_precision(1.7) * promote_precision(avg(a))

After the optimization, more specifically the constant folding rule, this 
expression becomes:

1.7 * promote_precision(avg(a))

Now if we run the analyzer on this optimized query again, we will get:

promote_precision(1.7) * promote_precision(promote_precision(avg(a)))

Which will later optimized as:

1.7 * promote_precision(promote_precision(avg(a)))

As can be seen, re-running the analysis and optimization on this expression 
results in an expression with extra nested promote_preceision nodes. Adding 
unneeded nodes to the plan is problematic because it can eliminate situations 
where we can reuse the plan.

We opted to introduce dedicated planners for subuqueries, instead of making the 
DecimalPrecision rule idempotent, because this eliminates this entire category 
of problems. Another benefit is that planning time for subqueries is reduced.

> Create dedicated planner for subqueries
> ---
>
> Key: SPARK-30072
> URL: https://issues.apache.org/jira/browse/SPARK-30072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR changes subquery planning by calling the planner and plan preparation 
> rules on the subquery plan directly. Before we were creating a QueryExecution 
> instance for subqueries to get the executedPlan. This would re-run analysis 
> and optimization on the subqueries plan. Running the analysis again on an 
> optimized query plan can have unwanted consequences, as some rules, for 
> example DecimalPrecision, are not idempotent.
> As an example, consider the expression 1.7 * avg(a) which after applying the 
> DecimalPrecision rule becomes:
> promote_precision(1.7) * promote_precision(avg(a))
> After the optimization, more specifically the constant folding rule, this 
> expression becomes:
> 1.7 * promote_precision(avg(a))
> Now if we run the analyzer on this optimized query again, we will get:
> promote_precision(1.7) * promote_precision(promote_precision(avg(a)))
> Which will later optimized as:
> 1.7 * promote_precision(promote_precision(avg(a)))
> As can be seen, re-running the analysis and optimization on this expression 
> results in an expression with extra nested promote_preceision nodes. Adding 
> unneeded nodes to the plan is problematic because it can eliminate situations 
> where we can reuse the plan.
> We opted to introduce dedicated planners for subuqueries, instead of making 
> the DecimalPrecision rule idempotent, because this eliminates this entire 
> category of problems. Another benefit is that planning time for subqueries is 
> reduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30072) Create dedicated planner for subqueries

2019-11-28 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-30072:


 Summary: Create dedicated planner for subqueries
 Key: SPARK-30072
 URL: https://issues.apache.org/jira/browse/SPARK-30072
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
 Environment: This PR changes subquery planning by calling the planner 
and plan preparation rules on the subquery plan directly. Before we were 
creating a QueryExecution instance for subqueries to get the executedPlan. This 
would re-run analysis and optimization on the subqueries plan. Running the 
analysis again on an optimized query plan can have unwanted consequences, as 
some rules, for example DecimalPrecision, are not idempotent.

As an example, consider the expression 1.7 * avg(x) which after applying the 
DecimalPrecision rule becomes:

promote_precision(1.7) * promote_precision(avg(x))

After the optimization, more specifically the constant folding rule, this 
expression becomes:

1.7 * promote_precision(avg(x))

Now if we run the analyzer on this optimized query again, we will get:

promote_precision(1.7) * promote_precision(promote_precision(avg(x)))

Which will later optimized as:

1.7 * promote_precision(promote_precision(avg(x)))

As can be seen, re-running the analysis and optimization on this expression 
results in an expression with extra nested promote_preceision nodes. Adding 
unneeded nodes to the plan is problematic because it can eliminate situations 
where we can reuse the plan.

We opted to introduce dedicated planners for subuqueries, instead of making the 
DecimalPrecision rule idempotent, because this eliminates this entire category 
of problems. Another benefit is that planning time for subqueries is reduced.
Reporter: Ali Afroozeh






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30072) Create dedicated planner for subqueries

2019-11-28 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-30072:
-
Environment: (was: This PR changes subquery planning by calling the 
planner and plan preparation rules on the subquery plan directly. Before we 
were creating a QueryExecution instance for subqueries to get the executedPlan. 
This would re-run analysis and optimization on the subqueries plan. Running the 
analysis again on an optimized query plan can have unwanted consequences, as 
some rules, for example DecimalPrecision, are not idempotent.

As an example, consider the expression 1.7 * avg(x) which after applying the 
DecimalPrecision rule becomes:

promote_precision(1.7) * promote_precision(avg(x))

After the optimization, more specifically the constant folding rule, this 
expression becomes:

1.7 * promote_precision(avg(x))

Now if we run the analyzer on this optimized query again, we will get:

promote_precision(1.7) * promote_precision(promote_precision(avg(x)))

Which will later optimized as:

1.7 * promote_precision(promote_precision(avg(x)))

As can be seen, re-running the analysis and optimization on this expression 
results in an expression with extra nested promote_preceision nodes. Adding 
unneeded nodes to the plan is problematic because it can eliminate situations 
where we can reuse the plan.

We opted to introduce dedicated planners for subuqueries, instead of making the 
DecimalPrecision rule idempotent, because this eliminates this entire category 
of problems. Another benefit is that planning time for subqueries is reduced.)

> Create dedicated planner for subqueries
> ---
>
> Key: SPARK-30072
> URL: https://issues.apache.org/jira/browse/SPARK-30072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28836) Remove the canonicalize(attributes) method from PlanExpression

2019-08-23 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-28836:
-
Summary: Remove the canonicalize(attributes) method from PlanExpression  
(was: Remove the canonicalize() method )

> Remove the canonicalize(attributes) method from PlanExpression
> --
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
> confusing. 
>  First, it is not clear why `PlanExpression` should have this method, and why 
> the canonicalization is not handled
>  by the canonicalized method of its parent, the Expression class. Second, the 
> QueryPlan.normalizeExpressionId
>  is the only place where PlanExpression.canonicalized is being called.
> This PR removes the canonicalize method from the PlanExpression class and 
> delegates the normalization of expression ids to
>  the QueryPlan.normalizedExpressionId method. Also, the name 
> normalizedExpressions is more suitable for this method,
>  therefore, the method has also been renamed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28836) Remove the canonicalize() method

2019-08-23 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-28836:
-
Summary: Remove the canonicalize() method   (was: Improve canonicalize API)

> Remove the canonicalize() method 
> -
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
> confusing. 
>  First, it is not clear why `PlanExpression` should have this method, and why 
> the canonicalization is not handled
>  by the canonicalized method of its parent, the Expression class. Second, the 
> QueryPlan.normalizeExpressionId
>  is the only place where PlanExpression.canonicalized is being called.
> This PR removes the canonicalize method from the PlanExpression class and 
> delegates the normalization of expression ids to
>  the QueryPlan.normalizedExpressionId method. Also, the name 
> normalizedExpressions is more suitable for this method,
>  therefore, the method has also been renamed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28836) Improve canonicalize API

2019-08-23 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-28836:
-
Description: 
The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
confusing. 
 First, it is not clear why `PlanExpression` should have this method, and why 
the canonicalization is not handled
 by the canonicalized method of its parent, the Expression class. Second, the 
QueryPlan.normalizeExpressionId
 is the only place where PlanExpression.canonicalized is being called.

This PR removes the canonicalize method from the PlanExpression class and 
delegates the normalization of expression ids to
 the QueryPlan.normalizedExpressionId method. Also, the name 
normalizedExpressions is more suitable for this method,
 therefore, the method has also been renamed.

  was:
The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
confusing. 
First, it is not clear why `PlanExpression` should have this method, and why 
the canonicalization is not handled
by the canonicalized method of its parent, the Expression class. Second, the 
QueryPlan.normalizeExpressionId
is the only place where PlanExpression.canonicalized is being called.

This PR simplifies the canonicalize method on PlanExpression and delegates the 
normalization of expression ids to
the QueryPlan.normalizedExpressionId method. Also, the name 
normalizedExpressions is more suitable for this method,
therefore, the method has also been renamed.


> Improve canonicalize API
> 
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
> confusing. 
>  First, it is not clear why `PlanExpression` should have this method, and why 
> the canonicalization is not handled
>  by the canonicalized method of its parent, the Expression class. Second, the 
> QueryPlan.normalizeExpressionId
>  is the only place where PlanExpression.canonicalized is being called.
> This PR removes the canonicalize method from the PlanExpression class and 
> delegates the normalization of expression ids to
>  the QueryPlan.normalizedExpressionId method. Also, the name 
> normalizedExpressions is more suitable for this method,
>  therefore, the method has also been renamed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28836) Improve canonicalize API

2019-08-23 Thread Ali Afroozeh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-28836:
-
Description: 
The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
confusing. 
First, it is not clear why `PlanExpression` should have this method, and why 
the canonicalization is not handled
by the canonicalized method of its parent, the Expression class. Second, the 
QueryPlan.normalizeExpressionId
is the only place where PlanExpression.canonicalized is being called.

This PR simplifies the canonicalize method on PlanExpression and delegates the 
normalization of expression ids to
the QueryPlan.normalizedExpressionId method. Also, the name 
normalizedExpressions is more suitable for this method,
therefore, the method has also been renamed.

  was:This PR improves the `canonicalize` API by removing the method `def 
canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and 
taking care of normalizing expressions in `QueryPlan`.


> Improve canonicalize API
> 
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> The canonicalize(attrs: AttributeSeq) method in PlanExpression is somewhat 
> confusing. 
> First, it is not clear why `PlanExpression` should have this method, and why 
> the canonicalization is not handled
> by the canonicalized method of its parent, the Expression class. Second, the 
> QueryPlan.normalizeExpressionId
> is the only place where PlanExpression.canonicalized is being called.
> This PR simplifies the canonicalize method on PlanExpression and delegates 
> the normalization of expression ids to
> the QueryPlan.normalizedExpressionId method. Also, the name 
> normalizedExpressions is more suitable for this method,
> therefore, the method has also been renamed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28836) Introduce TPCDSSchema

2019-08-21 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-28836:


 Summary: Introduce TPCDSSchema
 Key: SPARK-28836
 URL: https://issues.apache.org/jira/browse/SPARK-28836
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ali Afroozeh


This PR extracts the schema information of TPCDS tables into a separate class 
called `TPCDSSchema` which can be reused for other testing purposes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28835) Improve canonicalize API

2019-08-21 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-28835:


 Summary: Improve canonicalize API
 Key: SPARK-28835
 URL: https://issues.apache.org/jira/browse/SPARK-28835
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ali Afroozeh


This PR improves the `canonicalize` API by removing the method `def 
canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and 
taking care of normalizing expressions in `QueryPlan`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28716) Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans

2019-08-13 Thread Ali Afroozeh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-28716:
-
Description: 
Add id to Exchange and Subquery's stringArgs method for easier identifying 
their reuses in query plans, for example:

{{ReusedExchange [d_date_sk#827|#827], BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) 
[[id=#2710|#2710]]}}

Where {{2710}} is the id of the reused exchange.

  was:
Add id to Exchange and Subquery's stringArgs method for easier identifying 
their reuses in query plans, for example:

{{ReusedExchange [d_date_sk#827|#827], BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) 
[id=#2710] }}

Where {{2710}} is the id of the reused exchange.


> Add id to Exchange and Subquery's stringArgs method for easier identifying 
> their reuses in query plans
> --
>
> Key: SPARK-28716
> URL: https://issues.apache.org/jira/browse/SPARK-28716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ali Afroozeh
>Priority: Minor
>
> Add id to Exchange and Subquery's stringArgs method for easier identifying 
> their reuses in query plans, for example:
> {{ReusedExchange [d_date_sk#827|#827], BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) 
> [[id=#2710|#2710]]}}
> Where {{2710}} is the id of the reused exchange.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28716) Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans

2019-08-13 Thread Ali Afroozeh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-28716:
-
Description: 
Add id to Exchange and Subquery's stringArgs method for easier identifying 
their reuses in query plans, for example:

{{ReusedExchange [d_date_sk#827|#827], BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) 
[id=#2710] }}

Where {{2710}} is the id of the reused exchange.

  was:
Add id to Exchange and Subquery's stringArgs method for easier identifying 
their reuses in query plans, for example:

{{ReusedExchange [d_date_sk#827|#827], BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) 
[#[id=#$id|#$id]] }}

Where {{2710}} is the id of the reused exchange.


> Add id to Exchange and Subquery's stringArgs method for easier identifying 
> their reuses in query plans
> --
>
> Key: SPARK-28716
> URL: https://issues.apache.org/jira/browse/SPARK-28716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ali Afroozeh
>Priority: Minor
>
> Add id to Exchange and Subquery's stringArgs method for easier identifying 
> their reuses in query plans, for example:
> {{ReusedExchange [d_date_sk#827|#827], BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) 
> [id=#2710] }}
> Where {{2710}} is the id of the reused exchange.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28716) Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans

2019-08-13 Thread Ali Afroozeh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-28716:
-
Description: 
Add id to Exchange and Subquery's stringArgs method for easier identifying 
their reuses in query plans, for example:

{{ReusedExchange [d_date_sk#827|#827], BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) 
[#[id=#$id|#$id]] }}

Where {{2710}} is the id of the reused exchange.

  was:
Add id to Exchange and Subquery's stringArgs method for easier identifying 
their reuses in query plans, for example:

{{ReusedExchange [d_date_sk#827], BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) 
[id=#[id=#$id]] }}

Where {{2710}} is the id of the reused exchange.


> Add id to Exchange and Subquery's stringArgs method for easier identifying 
> their reuses in query plans
> --
>
> Key: SPARK-28716
> URL: https://issues.apache.org/jira/browse/SPARK-28716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ali Afroozeh
>Priority: Minor
>
> Add id to Exchange and Subquery's stringArgs method for easier identifying 
> their reuses in query plans, for example:
> {{ReusedExchange [d_date_sk#827|#827], BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) 
> [#[id=#$id|#$id]] }}
> Where {{2710}} is the id of the reused exchange.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28716) Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans

2019-08-13 Thread Ali Afroozeh (JIRA)
Ali Afroozeh created SPARK-28716:


 Summary: Add id to Exchange and Subquery's stringArgs method for 
easier identifying their reuses in query plans
 Key: SPARK-28716
 URL: https://issues.apache.org/jira/browse/SPARK-28716
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ali Afroozeh


Add id to Exchange and Subquery's stringArgs method for easier identifying 
their reuses in query plans, for example:

{{ReusedExchange [d_date_sk#827], BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) 
[id=#[id=#$id]] }}

Where {{2710}} is the id of the reused exchange.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28715) Introduce collectInPlanAndSubqueries and subqueriesAll in QueryPlan

2019-08-13 Thread Ali Afroozeh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-28715:
-
Summary: Introduce collectInPlanAndSubqueries and subqueriesAll in 
QueryPlan  (was: Introduce `collectInPlanAndSubqueries` and `subqueriesAll` in 
`QueryPlan`)

> Introduce collectInPlanAndSubqueries and subqueriesAll in QueryPlan
> ---
>
> Key: SPARK-28715
> URL: https://issues.apache.org/jira/browse/SPARK-28715
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ali Afroozeh
>Priority: Minor
>
> Introduces the {{collectInPlanAndSubqueries and subqueriesAll}} methods in 
> QueryPlan that consider all the plans in the query plan, including the ones 
> in nested subqueries.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28715) Introduce `collectInPlanAndSubqueries` and `subqueriesAll` in `QueryPlan`

2019-08-13 Thread Ali Afroozeh (JIRA)
Ali Afroozeh created SPARK-28715:


 Summary: Introduce `collectInPlanAndSubqueries` and 
`subqueriesAll` in `QueryPlan`
 Key: SPARK-28715
 URL: https://issues.apache.org/jira/browse/SPARK-28715
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ali Afroozeh


Introduces the {{collectInPlanAndSubqueries and subqueriesAll}} methods in 
QueryPlan that consider all the plans in the query plan, including the ones in 
nested subqueries.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org