[jira] [Updated] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-16 Thread Michel Davit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michel Davit updated SPARK-23986:
-
Description: 
Considering the following code:
{code:java}
val df1: DataFrame = sparkSession.sparkContext
  .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
  .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")

val df2: DataFrame = sparkSession.sparkContext
  .makeRDD(Seq((0, "val1", "val2")))
  .toDF("key", "dummy1", "dummy2")

val agg = df1
  .join(df2, df1("key") === df2("key"), "leftouter")
  .groupBy(df1("key"))
  .agg(
avg("col2").as("avg2"),
avg("col3").as("avg3"),
avg("col4").as("avg4"),
avg("col1").as("avg1"),
avg("col5").as("avg5"),
avg("col6").as("avg6")
  )

val head = agg.take(1)
{code}
This logs the following exception:
{code:java}
ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
467, Column 28: Redefinition of parameter "agg_expr_11"
{code}
I am not a spark expert but after investigation, I realized that the generated 
{{doConsume}} method is responsible of the exception.

Indeed, {{avg}} calls several times 
{{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
The 1st time with the 'avg' Expr and a second time for the base aggregation 
Expr (count and sum).

The problem comes from the generation of parameters in CodeGenerator:
{code:java}
  /**
   * Returns a term name that is unique within this instance of a 
`CodegenContext`.
   */
  def freshName(name: String): String = synchronized {
val fullName = if (freshNamePrefix == "") {
  name
} else {
  s"${freshNamePrefix}_$name"
}
if (freshNameIds.contains(fullName)) {
  val id = freshNameIds(fullName)
  freshNameIds(fullName) = id + 1
  s"$fullName$id"
} else {
  freshNameIds += fullName -> 1
  fullName
}
  }
{code}
The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
 The second call is made with {{agg_expr_[1..12]}} and generates the following 
names:
 {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
conflicts in the generated code: {{agg_expr_11.}}

Appending the 'id' in s"$fullName$id" to generate unique term name is source of 
conflict. Maybe simply using undersoce can solve this issue : $fullName_$id"

  was:
Considering the following code:
{code:java}
val df1: DataFrame = sparkSession.sparkContext
  .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
  .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")

val df2: DataFrame = sparkSession.sparkContext
  .makeRDD(Seq((0, "val1", "val2")))
  .toDF("key", "dummy1", "dummy2")

val agg = df1
  .join(df2, df1("key") === df2("key"), "leftouter")
  .groupBy(df1("key"))
  .agg(
avg("col2").as("avg2"),
avg("col3").as("avg3"),
avg("col4").as("avg4"),
avg("col1").as("avg1"),
avg("col5").as("avg5"),
avg("col6").as("avg6")
  )

val head = agg.take(1)
{code}
This logs the following exception:
{code:java}
ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
467, Column 28: Redefinition of parameter "agg_expr_11"
{code}
I am not a spark expert but after investigation, I realized that the generated 
{{doConsume}} method is responsible of the exception.

Indeed, {{avg}} calls several times 
{{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
The 1st time with the 'avg' Expr and a second time for the base aggregation 
Expr (count and sum).

The problem comes from the generation of parameters in CodeGenerator:
{code:java}
  /**
   * Returns a term name that is unique within this instance of a 
`CodegenContext`.
   */
  def freshName(name: String): String = synchronized {
val fullName = if (freshNamePrefix == "") {
  name
} else {
  s"${freshNamePrefix}_$name"
}
if (freshNameIds.contains(fullName)) {
  val id = freshNameIds(fullName)
  freshNameIds(fullName) = id + 1
  s"$fullName$id"
} else {
  freshNameIds += fullName -> 1
  fullName
}
  }
{code}
The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
 The second call is made with {{agg_expr_[1..12]}} and generates the following 
names:
 {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have a parameter name conflicts 
in the generated code: {{agg_expr_11.}}

Appending the 'id' in s"$fullName$id" to generate unique term name is source of 
conflict. Maybe simply using undersoce can solve this issue : $fullName_$id"


> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: 

[jira] [Updated] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-16 Thread Michel Davit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michel Davit updated SPARK-23986:
-
Description: 
Considering the following code:
{code:java}
val df1: DataFrame = sparkSession.sparkContext
  .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
  .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")

val df2: DataFrame = sparkSession.sparkContext
  .makeRDD(Seq((0, "val1", "val2")))
  .toDF("key", "dummy1", "dummy2")

val agg = df1
  .join(df2, df1("key") === df2("key"), "leftouter")
  .groupBy(df1("key"))
  .agg(
avg("col2").as("avg2"),
avg("col3").as("avg3"),
avg("col4").as("avg4"),
avg("col1").as("avg1"),
avg("col5").as("avg5"),
avg("col6").as("avg6")
  )

val head = agg.take(1)
{code}
This logs the following exception:
{code:java}
ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
467, Column 28: Redefinition of parameter "agg_expr_11"
{code}
I am not a spark expert but after investigation, I realized that the generated 
{{doConsume}} method is responsible of the exception.

Indeed, {{avg}} calls several times 
{{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
The 1st time with the 'avg' Expr and a second time for the base aggregation 
Expr (count and sum).

The problem comes from the generation of parameters in CodeGenerator:
{code:java}
  /**
   * Returns a term name that is unique within this instance of a 
`CodegenContext`.
   */
  def freshName(name: String): String = synchronized {
val fullName = if (freshNamePrefix == "") {
  name
} else {
  s"${freshNamePrefix}_$name"
}
if (freshNameIds.contains(fullName)) {
  val id = freshNameIds(fullName)
  freshNameIds(fullName) = id + 1
  s"$fullName$id"
} else {
  freshNameIds += fullName -> 1
  fullName
}
  }
{code}
The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
 The second call is made with {{agg_expr_[1..12]}} and generates the following 
names:
 {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have a parameter name conflicts 
in the generated code: {{agg_expr_11.}}

Appending the 'id' in s"$fullName$id" to generate unique term name is source of 
conflict. Maybe simply using undersoce can solve this issue : $fullName_$id"

  was:
Considering the following code:
{code:java}
val df1: DataFrame = sparkSession.sparkContext
  .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
  .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")

val df2: DataFrame = sparkSession.sparkContext
  .makeRDD(Seq((0, "val1", "val2")))
  .toDF("key", "dummy1", "dummy2")

val agg = df1
  .join(df2, df1("key") === df2("key"), "leftouter")
  .groupBy(df1("key"))
  .agg(
avg("col2").as("avg2"),
avg("col3").as("avg3"),
avg("col4").as("avg4"),
avg("col1").as("avg1"),
avg("col5").as("avg5"),
avg("col6").as("avg6")
  )

val head = agg.take(1)
{code}
This logs the following exception:
{code:java}
ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
467, Column 28: Redefinition of parameter "agg_expr_11"
{code}
I am not a spark expert but after investigation, I realized that the generated 
{{doConsume}} method is responsible of the exception.

Indeed, {{avg}} calls several times 
{{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
The 1st time with the 'avg' Expr and a second time for the base aggregation 
Expr (count and sum).

The problem comes from the generation of parameters in CodeGenerator:
{code:java}
  /**
   * Returns a term name that is unique within this instance of a 
`CodegenContext`.
   */
  def freshName(name: String): String = synchronized {
val fullName = if (freshNamePrefix == "") {
  name
} else {
  s"${freshNamePrefix}_$name"
}
if (freshNameIds.contains(fullName)) {
  val id = freshNameIds(fullName)
  freshNameIds(fullName) = id + 1
  s"$fullName$id"
} else {
  freshNameIds += fullName -> 1
  fullName
}
  }
{code}
The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
 The second call is made with {{agg_expr_[1..12]}} and generates the following 
names:
 {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name conflicts 
in the generated code: {{agg_expr_11}} and {{agg_expr_12}}.

Appending the 'id' in s"$fullName$id" to generate unique term name is source of 
conflict. Maybe simply using undersoce can solve this issue : $fullName_$id"


> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: 

[jira] [Updated] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-16 Thread Michel Davit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michel Davit updated SPARK-23986:
-
Attachment: spark-generated.java

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Priority: Major
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name 
> conflicts in the generated code: {{agg_expr_11}} and {{agg_expr_12}}.
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-15 Thread Michel Davit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michel Davit updated SPARK-23986:
-
Priority: Major  (was: Minor)

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Priority: Major
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name 
> conflicts in the generated code: {{agg_expr_11}} and {{agg_expr_12}}.
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org