[jira] [Updated] (SPARK-44759) Do not combine multiple Generate operators in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large

2023-08-13 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Component/s: Deploy
 Spark Core

> Do not combine multiple Generate operators in the same WholeStageCodeGen node 
> because it can  easily cause OOM failures if arrays are relatively large
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Optimizer, Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> This is an issue since the WSCG  implementation of the generate node. 
> Because WSCG compute rows in batches , the combination of WSCG and the 
> explode operation consume a lot of the dedicated executor memory. This is 
> even more true when the WSCG node contains multiple explode nodes. This is 
> the case when flattening a nested array.
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows.
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from snapshots added in the comments ,  the rows created in 
> the nested loop are saved in a writer buffer.  In this case because the rows 
> were big , the job failed with an Out Of Memory Exception error .
> 2_ The generated WholeStageCodeGen result in a nested loop that for each row  
> , will explode the parent array and then explode the inner array.  The rows 
> are accumulated in the writer buffer without accounting for the row size.
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whole 
> stage code gen node . Doing so leads to potential memory issues.
> In our case , the job execution failed with an  OOM error because the the 
> WSCG executed  into a nested for loop . 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44768:

Description: 
The code sample below is to showcase the wholestagecodegen generated when 
exploding nested arrays.  The data sample in the dataframe is quite small so it 
will not trigger the Out of Memory error . 

However if the array is larger and the row size is large , you will definitely 
end up with an OOM error .  

 

consider a scenario where you flatten  a nested array 

// e.g you can use the following steps to create the dataframe 

//create a partClass case class
case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
Double )

//create a nested array array class
case  class array_array_class (
 col_int: Int,
 arr_arr_string : Seq[Seq[String]],
 arr_arr_bigint : Seq[Seq[Long]],
 col_string     : String,
 parts_s        : Seq[Seq[partClass]]
 
)

//create a dataframe
var df_array_array = sc.parallelize(
 Seq(
 (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
  Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
 ) ,
 
 (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
Seq(0,3,-1)),"ItemPart2",
  
Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
 )
  
 )

).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")

//explode a nested array 

var  result   =  df_array_array.select( col("c1"), 
explode(col("c2"))).select('c1 , explode('col))

result.explain

 

The physical for this operator is seen below.

-
Physical plan 

== Physical Plan ==
*(1) Generate explode(col#27), [c1#17|#17], false, [col#30|#30]
+- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
   +- *(1) Generate explode(c2#18), [c1#17|#17], false, [col#27|#27]
      +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18|#6 AS c1#17, _2#7 AS c2#18]
         +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
            +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -2), StringType, ObjectType(class java.lang.String)), 
true, false, true), validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), ArrayType(StringType,true), 
ObjectType(interface scala.collection.Seq)), None), 
knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-3), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
true, -4), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -4), IntegerType, IntegerType)), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), ArrayType(IntegerType,false), 
ObjectType(interface scala.collection.Seq)), None), 
knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, true, 
false, true) AS _4#9, mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -5), mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -6), if 
(isnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
named_struct(PARTNAME, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PARTNAME, true, 
false, true), PartNumber, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 

[jira] [Commented] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.

2023-08-10 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753035#comment-17753035
 ] 

Franck Tago commented on SPARK-44768:
-

!image-2023-08-10-20-32-55-684.png!

> Improve WSCG handling of row buffer by accounting for executor memory  .  
> Exploding nested arrays can easily lead to out of memory errors. 
> ---
>
> Key: SPARK-44768
> URL: https://issues.apache.org/jira/browse/SPARK-44768
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-20-32-55-684.png, 
> spark-jira_wscg_code.txt
>
>
> consider a scenario where you flatten  a nested array 
> // e.g you can use the following steps to create the dataframe 
> //create a partClass case class
> case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
> Double )
> //create a nested array array class
> case  class array_array_class (
>  col_int: Int,
>  arr_arr_string : Seq[Seq[String]],
>  arr_arr_bigint : Seq[Seq[Long]],
>  col_string     : String,
>  parts_s        : Seq[Seq[partClass]]
>  
> )
> //create a dataframe
> var df_array_array = sc.parallelize(
>  Seq(
>  (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
> Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
>   Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
>  ) ,
>  
>  (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
> ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
> Seq(0,3,-1)),"ItemPart2",
>   
> Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
>  )
>   
>  )
> ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")
> //explode a nested array 
> var  result   =  df_array_array.select( col("c1"), 
> explode(col("c2"))).select('c1 , explode('col))
> result.explain
>  
> The physical for this operator is seen below.
> -
> Physical plan 
> == Physical Plan ==
> *(1) Generate explode(col#27), [c1#17|#17], false, [col#30|#30]
> +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
>    +- *(1) Generate explode(c2#18), [c1#17|#17], false, [col#27|#27]
>       +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18|#6 AS c1#17, _2#7 AS 
> c2#18]
>          +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
>             +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
> scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
> StringType, fromString, validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class 
> java.lang.String)), true, false, true), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(StringType,true), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), IntegerType, IntegerType)), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), ArrayType(IntegerType,false), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, 
> true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -5), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -6), 
> StructField(PARTNAME,StringType,true), 
> StructField(PartNumber,StringType,true), 
> StructField(PartPrice,DoubleType,false), ObjectType(class 
> $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
> named_struct(PARTNAME, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), 

[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44768:

Attachment: image-2023-08-10-20-32-55-684.png

> Improve WSCG handling of row buffer by accounting for executor memory  .  
> Exploding nested arrays can easily lead to out of memory errors. 
> ---
>
> Key: SPARK-44768
> URL: https://issues.apache.org/jira/browse/SPARK-44768
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-20-32-55-684.png, 
> spark-jira_wscg_code.txt
>
>
> consider a scenario where you flatten  a nested array 
> // e.g you can use the following steps to create the dataframe 
> //create a partClass case class
> case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
> Double )
> //create a nested array array class
> case  class array_array_class (
>  col_int: Int,
>  arr_arr_string : Seq[Seq[String]],
>  arr_arr_bigint : Seq[Seq[Long]],
>  col_string     : String,
>  parts_s        : Seq[Seq[partClass]]
>  
> )
> //create a dataframe
> var df_array_array = sc.parallelize(
>  Seq(
>  (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
> Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
>   Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
>  ) ,
>  
>  (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
> ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
> Seq(0,3,-1)),"ItemPart2",
>   
> Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
>  )
>   
>  )
> ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")
> //explode a nested array 
> var  result   =  df_array_array.select( col("c1"), 
> explode(col("c2"))).select('c1 , explode('col))
> result.explain
>  
> The physical for this operator is seen below.
> -
> Physical plan 
> == Physical Plan ==
> *(1) Generate explode(col#27), [c1#17|#17], false, [col#30|#30]
> +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
>    +- *(1) Generate explode(c2#18), [c1#17|#17], false, [col#27|#27]
>       +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18|#6 AS c1#17, _2#7 AS 
> c2#18]
>          +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
>             +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
> scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
> StringType, fromString, validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class 
> java.lang.String)), true, false, true), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(StringType,true), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), IntegerType, IntegerType)), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), ArrayType(IntegerType,false), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, 
> true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -5), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -6), 
> StructField(PARTNAME,StringType,true), 
> StructField(PartNumber,StringType,true), 
> StructField(PartPrice,DoubleType,false), ObjectType(class 
> $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
> named_struct(PARTNAME, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -6), 

[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44768:

Description: 
consider a scenario where you flatten  a nested array 

// e.g you can use the following steps to create the dataframe 

//create a partClass case class
case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
Double )

//create a nested array array class
case  class array_array_class (
 col_int: Int,
 arr_arr_string : Seq[Seq[String]],
 arr_arr_bigint : Seq[Seq[Long]],
 col_string     : String,
 parts_s        : Seq[Seq[partClass]]
 
)

//create a dataframe
var df_array_array = sc.parallelize(
 Seq(
 (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
  Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
 ) ,
 
 (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
Seq(0,3,-1)),"ItemPart2",
  
Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
 )
  
 )

).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")

//explode a nested array 

var  result   =  df_array_array.select( col("c1"), 
explode(col("c2"))).select('c1 , explode('col))

result.explain

 

The physical for this operator is seen below.

-
Physical plan 

== Physical Plan ==
*(1) Generate explode(col#27), [c1#17|#17], false, [col#30|#30]
+- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
   +- *(1) Generate explode(c2#18), [c1#17|#17], false, [col#27|#27]
      +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18|#6 AS c1#17, _2#7 AS c2#18]
         +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
            +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -2), StringType, ObjectType(class java.lang.String)), 
true, false, true), validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), ArrayType(StringType,true), 
ObjectType(interface scala.collection.Seq)), None), 
knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-3), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
true, -4), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -4), IntegerType, IntegerType)), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), ArrayType(IntegerType,false), 
ObjectType(interface scala.collection.Seq)), None), 
knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, true, 
false, true) AS _4#9, mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -5), mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -6), if 
(isnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
named_struct(PARTNAME, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PARTNAME, true, 
false, true), PartNumber, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PartNumber, true, 
false, true), PartPrice, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 

[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44768:

Attachment: spark-jira_wscg_code.txt

> Improve WSCG handling of row buffer by accounting for executor memory  .  
> Exploding nested arrays can easily lead to out of memory errors. 
> ---
>
> Key: SPARK-44768
> URL: https://issues.apache.org/jira/browse/SPARK-44768
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: spark-jira_wscg_code.txt
>
>
> consider a scenario where you flatten  a nested array 
> // e.g you can use the following steps to create the dataframe 
> /create a partClass
> case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
> Double )
> //create a nested array array class
> case  class array_array_class (
>  col_int: Int,
>  arr_arr_string : Seq[Seq[String]],
>  arr_arr_bigint : Seq[Seq[Long]],
>  col_string     : String,
>  parts_s        : Seq[Seq[partClass]]
>  
> )
> //create a dataframe
> var df_array_array = sc.parallelize(
>  Seq(
>  (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
> Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
>   Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
>  ) ,
>  
>  (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
> ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
> Seq(0,3,-1)),"ItemPart2",
>   
> Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
>  )
>   
>  )
> ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")
> //explode a nested array 
> var  result   =  df_array_array.select( col("c1"), 
> explode(col("c2"))).select('c1 , explode('col))
> result.explain
>  
> The physical for this operator is seen below.
> -
> Physical plan 
> == Physical Plan ==
> *(1) Generate explode(col#27), [c1#17], false, [col#30]
> +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
>    +- *(1) Generate explode(c2#18), [c1#17], false, [col#27]
>       +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18]
>          +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
>             +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
> scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
> StringType, fromString, validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class 
> java.lang.String)), true, false, true), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(StringType,true), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), IntegerType, IntegerType)), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), ArrayType(IntegerType,false), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, 
> true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -5), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -6), 
> StructField(PARTNAME,StringType,true), 
> StructField(PartNumber,StringType,true), 
> StructField(PartPrice,DoubleType,false), ObjectType(class 
> $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
> named_struct(PARTNAME, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
> StructField(PartNumber,StringType,true), 
> 

[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44768:

Summary: Improve WSCG handling of row buffer by accounting for executor 
memory  .  Exploding nested arrays can easily lead to out of memory errors.   
(was: Improve WSCG handling of row buffer by accounting for executor memory)

> Improve WSCG handling of row buffer by accounting for executor memory  .  
> Exploding nested arrays can easily lead to out of memory errors. 
> ---
>
> Key: SPARK-44768
> URL: https://issues.apache.org/jira/browse/SPARK-44768
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
>
> consider a scenario where you flatten  a nested array 
> // e.g you can use the following steps to create the dataframe 
> /create a partClass
> case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
> Double )
> //create a nested array array class
> case  class array_array_class (
>  col_int: Int,
>  arr_arr_string : Seq[Seq[String]],
>  arr_arr_bigint : Seq[Seq[Long]],
>  col_string     : String,
>  parts_s        : Seq[Seq[partClass]]
>  
> )
> //create a dataframe
> var df_array_array = sc.parallelize(
>  Seq(
>  (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
> Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
>   Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
>  ) ,
>  
>  (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
> ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
> Seq(0,3,-1)),"ItemPart2",
>   
> Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
>  )
>   
>  )
> ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")
> //explode a nested array 
> var  result   =  df_array_array.select( col("c1"), 
> explode(col("c2"))).select('c1 , explode('col))
> result.explain
>  
> The physical for this operator is seen below.
> -
> Physical plan 
> == Physical Plan ==
> *(1) Generate explode(col#27), [c1#17], false, [col#30]
> +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
>    +- *(1) Generate explode(c2#18), [c1#17], false, [col#27]
>       +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18]
>          +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
>             +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
> scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -1), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
> StringType, fromString, validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class 
> java.lang.String)), true, false, true), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -1), ArrayType(StringType,true), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), 
> assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -4), IntegerType, IntegerType)), 
> validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> java.lang.Object), true, -3), ArrayType(IntegerType,false), 
> ObjectType(interface scala.collection.Seq)), None), 
> knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, 
> true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -5), 
> mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
> true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, 
> ObjectType(class java.lang.Object), true, -6), 
> StructField(PARTNAME,StringType,true), 
> StructField(PartNumber,StringType,true), 
> StructField(PartPrice,DoubleType,false), ObjectType(class 
> $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
> named_struct(PARTNAME, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
> 

[jira] [Created] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory

2023-08-10 Thread Franck Tago (Jira)
Franck Tago created SPARK-44768:
---

 Summary: Improve WSCG handling of row buffer by accounting for 
executor memory
 Key: SPARK-44768
 URL: https://issues.apache.org/jira/browse/SPARK-44768
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.4.1, 3.4.0, 3.3.2
Reporter: Franck Tago


consider a scenario where you flatten  a nested array 

// e.g you can use the following steps to create the dataframe 

/create a partClass
case class partClass (PARTNAME: String , PartNumber: String , PartPrice : 
Double )

//create a nested array array class
case  class array_array_class (
 col_int: Int,
 arr_arr_string : Seq[Seq[String]],
 arr_arr_bigint : Seq[Seq[Long]],
 col_string     : String,
 parts_s        : Seq[Seq[partClass]]
 
)

//create a dataframe
var df_array_array = sc.parallelize(
 Seq(
 (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , 
Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1",
  Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75)))
 ) ,
 
 (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" 
,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), 
Seq(0,3,-1)),"ItemPart2",
  
Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75)))
 )
  
 )

).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5")

//explode a nested array 

var  result   =  df_array_array.select( col("c1"), 
explode(col("c2"))).select('c1 , explode('col))

result.explain

 

The physical for this operator is seen below.

-
Physical plan 

== Physical Plan ==
*(1) Generate explode(col#27), [c1#17], false, [col#30]
+- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27))
   +- *(1) Generate explode(c2#18), [c1#17], false, [col#27]
      +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18]
         +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7))
            +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, 
scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -2), StringType, ObjectType(class java.lang.String)), 
true, false, true), validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -1), ArrayType(StringType,true), 
ObjectType(interface scala.collection.Seq)), None), 
knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, 
mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, 
-3), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), 
true, -4), assertnotnull(validateexternaltype(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -4), IntegerType, IntegerType)), 
validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -3), ArrayType(IntegerType,false), 
ObjectType(interface scala.collection.Seq)), None), 
knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, true, 
false, true) AS _4#9, mapobjects(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -5), mapobjects(lambdavariable(MapObject, 
ObjectType(class java.lang.Object), true, -6), if 
(isnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else 
named_struct(PARTNAME, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PARTNAME, true, 
false, true), PartNumber, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class 
java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), 
StructField(PartNumber,StringType,true), 
StructField(PartPrice,DoubleType,false), ObjectType(class 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PartNumber, true, 
false, true), PartPrice, 

[jira] [Updated] (SPARK-44759) Do not combine multiple Generate operators in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Description: 
This is an issue since the WSCG  implementation of the generate node. 

Because WSCG compute rows in batches , the combination of WSCG and the explode 
operation consume a lot of the dedicated executor memory. This is even more 
true when the WSCG node contains multiple explode nodes. This is the case when 
flattening a nested array.

The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows.

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from snapshots added in the comments ,  the rows created in 
the nested loop are saved in a writer buffer.  In this case because the rows 
were big , the job failed with an Out Of Memory Exception error .

2_ The generated WholeStageCodeGen result in a nested loop that for each row  , 
will explode the parent array and then explode the inner array.  The rows are 
accumulated in the writer buffer without accounting for the row size.

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whole 
stage code gen node . Doing so leads to potential memory issues.

In our case , the job execution failed with an  OOM error because the the WSCG 
executed  into a nested for loop . 

 

 

 

  was:
This is an issue since the WSCG  implementation of the generate node. 

The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whole 
stage code gen node . Doing so leads to potential memory issues.

In our case , the job execution failed with an  OOM error because the the WSCG 
executed  into a nested for loop . 

 

 

 


> Do not combine multiple Generate operators in the same WholeStageCodeGen node 
> because it can  easily cause OOM failures if arrays are relatively large
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> This is an issue since the WSCG  implementation of the generate node. 
> Because WSCG compute rows in batches , the combination of WSCG and the 
> explode operation consume a lot of the dedicated executor memory. This is 
> even more true when the WSCG node contains multiple explode nodes. This is 
> the case when flattening a nested array.
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows.
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from snapshots added in the comments ,  the rows created in 
> the nested loop are saved in a writer buffer.  In this case 

[jira] [Updated] (SPARK-44759) Do not combine multiple Generate operators in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Summary: Do not combine multiple Generate operators in the same 
WholeStageCodeGen node because it can  easily cause OOM failures if arrays are 
relatively large  (was: Do not combine multiple Generate nodes in the same 
WholeStageCodeGen node because it can  easily cause OOM failures if arrays are 
relatively large)

> Do not combine multiple Generate operators in the same WholeStageCodeGen node 
> because it can  easily cause OOM failures if arrays are relatively large
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> This is an issue since the WSCG  implementation of the generate node. 
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whole 
> stage code gen node . Doing so leads to potential memory issues.
> In our case , the job execution failed with an  OOM error because the the 
> WSCG executed  into a nested for loop . 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Summary: Do not combine multiple Generate nodes in the same 
WholeStageCodeGen node because it can  easily cause OOM failures if arrays are 
relatively large  (was: Do not combine multiple Generate nodes in the same 
WholeStageCodeGen nodebecause it can  easily cause OOM failures if arrays are 
relatively large)

> Do not combine multiple Generate nodes in the same WholeStageCodeGen node 
> because it can  easily cause OOM failures if arrays are relatively large
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> This is an issue since the WSCG  implementation of the generate node. 
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whole 
> stage code gen node . Doing so leads to potential memory issues.
> In our case , the job execution failed with an  OOM error because the the 
> WSCG executed  into a nested for loop . 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Description: 
This is an issue since the WSCG  implementation of the generate node. 

The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whole 
stage code gen node . Doing so leads to potential memory issues.

In our case , the job execution failed with an  OOM error because the the WSCG 
executed  into a nested for loop . 

 

 

 

  was:
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whole 
stage code gen node . Doing so leads to potential memory issues.

In our case , the job execution failed with an  OOM error because the the WSCG 
executed  into a nested for loop . 

 

 

 


> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> This is an issue since the WSCG  implementation of the generate node. 
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whole 
> stage code gen node . Doing so leads to potential memory issues.
> In our case , the job execution failed with an  OOM error because the the 
> WSCG executed  

[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Affects Version/s: 3.4.1
   3.4.0
   3.3.2
   3.2.4
   3.2.3
   3.2.2
   3.2.1
   3.1.3
   3.2.0
   3.1.2
   3.1.1
   3.1.0
   3.0.3
   3.0.2
   3.0.1
   3.0.0

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whole 
> stage code gen node . Doing so leads to potential memory issues.
> In our case , the job execution failed with an  OOM error because the the 
> WSCG executed  into a nested for loop . 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Description: 
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whole 
stage code gen node . Doing so leads to potential memory issues.

In our case , the job execution failed with an  OOM error because the the WSCG 
executed  into a nested for loop . 

 

 

 

  was:
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whoe stage 
code gen node . Doing so leads to potential memory issues.

 

 

 

 

 

 

 

WSCG method for first Generate node

!image-2023-08-10-09-22-29-949.png!

 

WSCG for second Generate node 

As you can see the execution of generate_doConsume_1 and generate_doConsume_0  
triggers a nested loop.

!image-2023-08-10-09-24-02-755.png!

 


> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whole 
> stage code gen node . Doing so leads to potential memory issues.
> In our case , the job execution failed with an  OOM error because the the 
> WSCG executed  into a nested for loop . 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Description: 
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whoe stage 
code gen node . Doing so leads to potential memory issues.

 

 

 

 

 

 

 

WSCG method for first Generate node

!image-2023-08-10-09-22-29-949.png!

 

WSCG for second Generate node 

As you can see the execution of generate_doConsume_1 and generate_doConsume_0  
triggers a nested loop.

!image-2023-08-10-09-24-02-755.png!

 

  was:
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whoe stage 
code gen node . Doing so leads to potential memory issues.

 

!image-2023-08-10-02-18-24-758.png!

 

!image-2023-08-10-09-19-23-973.png!

 

!image-2023-08-10-09-21-32-371.png!

 

WSCG method for first Generate node

!image-2023-08-10-09-22-29-949.png!

 

WSCG for second Generate node 

As you can see the execution of generate_doConsume_1 and generate_doConsume_0  
triggers a nested loop.

!image-2023-08-10-09-24-02-755.png!

 


> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . 

[jira] [Commented] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752870#comment-17752870
 ] 

Franck Tago commented on SPARK-44759:
-

Spark Dag for the use case . The failure is from the execution of 
WholeStageCodeGen(2)

!image-2023-08-10-09-33-47-788.png!

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: image-2023-08-10-09-33-47-788.png

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> image-2023-08-10-09-33-47-788.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752868#comment-17752868
 ] 

Franck Tago commented on SPARK-44759:
-

WSCG  generated code for second Generate node 

!image-2023-08-10-09-32-46-163.png!

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: image-2023-08-10-09-32-46-163.png

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: image-2023-08-10-09-29-24-804.png

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752866#comment-17752866
 ] 

Franck Tago commented on SPARK-44759:
-

WSCG  generated code for first Generate node 

!image-2023-08-10-09-29-24-804.png!

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> image-2023-08-10-09-29-24-804.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752864#comment-17752864
 ] 

Franck Tago edited comment on SPARK-44759 at 8/10/23 4:28 PM:
--

WSCG  generated code that calls  generate_doConsume_0

!image-2023-08-10-09-27-24-124.png!


was (Author: tafra...@gmail.com):
!image-2023-08-10-09-27-24-124.png!

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: image-2023-08-10-09-27-24-124.png

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752864#comment-17752864
 ] 

Franck Tago commented on SPARK-44759:
-

!image-2023-08-10-09-27-24-124.png!

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2023-08-10-09-27-24-124.png, 
> wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Summary: Do not combine multiple Generate nodes in the same 
WholeStageCodeGen nodebecause it can  easily cause OOM failures if arrays are 
relatively large  (was: Do not combine multiple Generate nodes in the same 
WholeStageCodeGen because it can  easily cause OOM failures)

> Do not combine multiple Generate nodes in the same WholeStageCodeGen 
> nodebecause it can  easily cause OOM failures if arrays are relatively large
> -
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Description: 
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whoe stage 
code gen node . Doing so leads to potential memory issues.

 

!image-2023-08-10-02-18-24-758.png!

 

!image-2023-08-10-09-19-23-973.png!

 

!image-2023-08-10-09-21-32-371.png!

 

WSCG method for first Generate node

!image-2023-08-10-09-22-29-949.png!

 

WSCG for second Generate node 

As you can see the execution of generate_doConsume_1 and generate_doConsume_0  
triggers a nested loop.

!image-2023-08-10-09-24-02-755.png!

 

  was:
The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whoe stage 
code gen node . Doing so leads to potential memory issues.

 

!image-2023-08-10-02-18-24-758.png!

 


> Do not combine multiple Generate nodes in the same WholeStageCodeGen because 
> it can  easily cause OOM failures
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  
> !image-2023-08-10-09-19-23-973.png!
>  
> !image-2023-08-10-09-21-32-371.png!
>  
> WSCG method for first Generate node
> !image-2023-08-10-09-22-29-949.png!
>  
> WSCG for second Generate node 
> As you can see the execution of generate_doConsume_1 and generate_doConsume_0 
>  triggers a nested loop.
> !image-2023-08-10-09-24-02-755.png!
>  



--
This 

[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: (was: spark-verbosewithcodegenenabled)

> Do not combine multiple Generate nodes in the same WholeStageCodeGen because 
> it can  easily cause OOM failures
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: wholestagecodegen_wc1_debug_wholecodegen_passed

> Do not combine multiple Generate nodes in the same WholeStageCodeGen because 
> it can  easily cause OOM failures
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures

2023-08-10 Thread Franck Tago (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-44759:

Attachment: spark-verbosewithcodegenenabled

> Do not combine multiple Generate nodes in the same WholeStageCodeGen because 
> it can  easily cause OOM failures
> --
>
> Key: SPARK-44759
> URL: https://issues.apache.org/jira/browse/SPARK-44759
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: spark-verbosewithcodegenenabled
>
>
> The generate node used to flatten array generally  produces an amount of 
> output rows that is significantly higher than the input rows to the node. 
> the number of output rows generated is even drastically higher when 
> flattening a nested array .
> When we combine more that 1 generate node in the same WholeStageCodeGen  
> node, we run  a high risk of running out of memory for multiple reasons. 
> 1- As you can see from the attachment ,  the rows created in the nested loop 
> are saved in writer buffer.  In my case because the rows were big , I hit an 
> Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
> nested loop that for each row  , will explode the parent array and then 
> explode the inner array.  This is prone to OutOfmerry errors
>  
> Please view the attached Spark Gui and Spark Dag 
> In my case the wholestagecodegen includes 2 explode nodes. 
> Because the array elements are large , we end up with an Out Of Memory error. 
>  
> I recommend that we do not merge  multiple explode nodes in the same whoe 
> stage code gen node . Doing so leads to potential memory issues.
>  
> !image-2023-08-10-02-18-24-758.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures

2023-08-10 Thread Franck Tago (Jira)
Franck Tago created SPARK-44759:
---

 Summary: Do not combine multiple Generate nodes in the same 
WholeStageCodeGen because it can  easily cause OOM failures
 Key: SPARK-44759
 URL: https://issues.apache.org/jira/browse/SPARK-44759
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.3.1, 3.3.0
Reporter: Franck Tago


The generate node used to flatten array generally  produces an amount of output 
rows that is significantly higher than the input rows to the node. 

the number of output rows generated is even drastically higher when flattening 
a nested array .

When we combine more that 1 generate node in the same WholeStageCodeGen  node, 
we run  a high risk of running out of memory for multiple reasons. 

1- As you can see from the attachment ,  the rows created in the nested loop 
are saved in writer buffer.  In my case because the rows were big , I hit an 
Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a 
nested loop that for each row  , will explode the parent array and then explode 
the inner array.  This is prone to OutOfmerry errors

 

Please view the attached Spark Gui and Spark Dag 

In my case the wholestagecodegen includes 2 explode nodes. 

Because the array elements are large , we end up with an Out Of Memory error. 

 

I recommend that we do not merge  multiple explode nodes in the same whoe stage 
code gen node . Doing so leads to potential memory issues.

 

!image-2023-08-10-02-18-24-758.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15214) Implement code generation for Generate

2022-05-02 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531008#comment-17531008
 ] 

Franck Tago commented on SPARK-15214:
-

it appears that code generation is not supported for nested Generate cases ?  

That is if I call the explode function in order to flatten a nested array. 

> Implement code generation for Generate
> --
>
> Key: SPARK-15214
> URL: https://issues.apache.org/jira/browse/SPARK-15214
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 2.2.0
>
>
> {{Generate}} currently does not support code generation. Lets add support for 
> CG and for it and its most important generators: {{explode}} and 
> {{json_tuple}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-26 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916334#comment-16916334
 ] 

Franck Tago edited comment on SPARK-23519 at 8/27/19 4:10 AM:
--

[~viirya]

My mistake , i tested it with Oracle and MySql . I then assume that hive would 
honor the same . 

You are correct.  Hive does not appear to support this after all . 

 

I need to test this on newer versions of Hive. 


was (Author: tafra...@gmail.com):
[~viirya]

My mistake , i tested it with Oracle and MySql . I then assume that hive would 
honor the same . 

You are correct.  Hive does not appear to support this after all . 

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-26 Thread Franck Tago (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916334#comment-16916334
 ] 

Franck Tago commented on SPARK-23519:
-

[~viirya]

My mistake , i tested it with Oracle and MySql . I then assume that hive would 
honor the same . 

You are correct.  Hive does not appear to support this after all . 

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-14 Thread Franck Tago (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago reopened SPARK-23519:
-

Ok Spark Community 

I am sorry for being a pest about this , but I re-opening this Jira because I 
really believe that this should be addressed . 

Right now I do not have any way satisfying my  customer's requirement . 

My current use case is the following . 

My customer can provide any customer  Hive query . I am oblivious to the 
actually content of the query and parsing the query is not an option .

All I know if the number of fields projected from the customer query and the 
type of those fields . 

I do not know the name of the fields projected from the custom query.

What is currently  do with spark sql is run a  query of the form . 

Create view view_name 

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-05-10 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470843#comment-16470843
 ] 

Franck Tago commented on SPARK-23519:
-

I do not agree with the 'typical database' claim . 

mysql , oracle  , hive support this  syntax. 

 

example

!image-2018-05-10-10-48-57-259.png!

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-05-10 Thread Franck Tago (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-23519:

Attachment: image-2018-05-10-10-48-57-259.png

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-04-18 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443154#comment-16443154
 ] 

Franck Tago commented on SPARK-23519:
-

thanks for the suggestion [~shahid]

The issue with your suggestion is that I dynamically generate the create view 
statement  ;Moreover   the select statement is kind of Opaque to me because it 
is provided by the customer. 

 

It would be nice is spark could fix such a simple  case.

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Critical
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-03-20 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389150#comment-16389150
 ] 

Franck Tago edited comment on SPARK-23519 at 3/21/18 3:29 AM:
--

Any updates on this ?

Could someone assist with this?


was (Author: tafra...@gmail.com):
Any updates on this ?

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Critical
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-03-20 Thread Franck Tago (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-23519:

Description: 
1- create and populate a hive table  . I did this in a hive cli session .[ not 
that this matters ]

create table  atable (col1 int) ;

insert  into atable values (10 ) , (100)  ;

2. create a view from the table.  

[These actions were performed from a spark shell ]

spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
from atable ")
 java.lang.AssertionError: assertion failed: The view output (col1,col1) 
contains duplicate column name.
 at scala.Predef$.assert(Predef.scala:170)
 at 
org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
 at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
 at 
org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
 at org.apache.spark.sql.Dataset.(Dataset.scala:183)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)

  was:
1- create and populate a hive table  . I did this in a hive cli session .[ not 
that this matters ]

create table  atable (col1 int) ;

insert  into atable values (10 ) , (100)  ;

2. create a view form the table.   [ I did this from a spark shell ]

spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
from atable ")
java.lang.AssertionError: assertion failed: The view output (col1,col1) 
contains duplicate column name.
 at scala.Predef$.assert(Predef.scala:170)
 at 
org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
 at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
 at 
org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
 at org.apache.spark.sql.Dataset.(Dataset.scala:183)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)


> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Critical
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-03-20 Thread Franck Tago (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-23519:

Component/s: SQL

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Critical
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view form the table.   [ I did this from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
> java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-03-06 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389150#comment-16389150
 ] 

Franck Tago commented on SPARK-23519:
-

Any updates on this ?

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Critical
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view form the table.   [ I did this from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
> java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file

2018-02-26 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378046#comment-16378046
 ] 

Franck Tago commented on SPARK-19809:
-

Just to confirm , your  earlier comment referred to  spark 2.1.1. You meant  
spark 2.2.1  right ?  

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2018-02-26-20-29-49-410.png, 
> spark.sql.hive.convertMetastoreOrc.txt
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at 

[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file

2018-02-26 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378028#comment-16378028
 ] 

Franck Tago commented on SPARK-19809:
-

1- I am kind of constrained to  spark 2.2.1  at the moment . 

2- My understanding is that the only thing different with spark 2.3.0 is that  
spark.sql.hive.convertMetastoreOrc  is defaulted to true. 

I looked at  [https://github.com/apache/spark/pull/19948]  and  
[https://github.com/apache/spark/pull/19960] . Am I missing  anything ? 

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2018-02-26-20-29-49-410.png, 
> spark.sql.hive.convertMetastoreOrc.txt
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> 

[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file

2018-02-26 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378018#comment-16378018
 ] 

Franck Tago commented on SPARK-19809:
-

Need a pointer on the following.  

Env : Spark 2.2.1

1- I set the property  spark.sql.hive.convertMetastoreOrc to true

2- My hive table has the following  schema

CREATE TABLE `ft_orc`(
 `int` int,
 `double` double,
 `big+int` bigint,
 `$tring` string,
 `(decimal)` decimal(15,8),
 `flo@t` float,
 `datetime` date,
 `timestamp` timestamp,
 `01` int)
 CLUSTERED BY (
 `int`)
 INTO 20 BUCKETS
 ROW FORMAT SERDE
 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
 WITH SERDEPROPERTIES (
 'field.delim'=',',
 'serialization.format'=',')
 STORED AS INPUTFORMAT
 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
 OUTPUTFORMAT
 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' ;

 

I loaded the table  with 1 row of  data 

!image-2018-02-26-20-29-49-410.png!

I tried to  run the following simple statement  

scala> var res =spark.sql(" SELECT alias.`int` as a0, alias.`double` as a1, 
alias.`big+int` as a2, alias.`$tring` as a3, CAST(alias.`(decimal)` AS DOUBLE) 
as a4, CAST(alias.`flo@t` AS DOUBLE) as a5, CAST(alias.`datetime` AS TIMESTAMP) 
as a6, alias.`timestamp` as a7, alias.`01` as a8 FROM default.ft_orc alias" )
18/02/27 04:30:57 WARN HiveConf: HiveConf of name hive.conf.hidden.list does 
not exist
18/02/27 04:30:57 WARN HiveConf: HiveConf of name hive.conf.hidden.list does 
not exist
java.lang.IndexOutOfBoundsException
 at java.nio.Buffer.checkIndex(Buffer.java:540)
 at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
 at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:374)
 at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:316)
 at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:187)
 at 
org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$getFileReader$2.apply(OrcFileOperator.scala:68)
 at 
org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$getFileReader$2.apply(OrcFileOperator.scala:67)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
 at 
scala.collection.TraversableOnce$class.collectFirst(TraversableOnce.scala:145)
 at scala.collection.AbstractIterator.collectFirst(Iterator.scala:1336)
 at 
org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:69)
 at 
org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
 at 
org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
 at scala.collection.immutable.List.foreach(List.scala:381)
 at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
 at scala.collection.immutable.List.flatMap(List.scala:344)

 

Any pointer ? 

Should I file a separate Jira ? 

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2018-02-26-20-29-49-410.png
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at 

[jira] [Updated] (SPARK-19809) NullPointerException on zero-size ORC file

2018-02-26 Thread Franck Tago (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-19809:

Attachment: spark.sql.hive.convertMetastoreOrc.txt

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2018-02-26-20-29-49-410.png, 
> spark.sql.hive.convertMetastoreOrc.txt
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at 

[jira] [Updated] (SPARK-19809) NullPointerException on zero-size ORC file

2018-02-26 Thread Franck Tago (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-19809:

Attachment: image-2018-02-26-20-29-49-410.png

> NullPointerException on zero-size ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1
>Reporter: Michał Dawid
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: image-2018-02-26-20-29-49-410.png
>
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> 

[jira] [Created] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-02-26 Thread Franck Tago (JIRA)
Franck Tago created SPARK-23519:
---

 Summary: Create View Commands Fails with  The view output 
(col1,col1) contains duplicate column name
 Key: SPARK-23519
 URL: https://issues.apache.org/jira/browse/SPARK-23519
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Franck Tago


1- create and populate a hive table  . I did this in a hive cli session .[ not 
that this matters ]

create table  atable (col1 int) ;

insert  into atable values (10 ) , (100)  ;

2. create a view form the table.   [ I did this from a spark shell ]

spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
from atable ")
java.lang.AssertionError: assertion failed: The view output (col1,col1) 
contains duplicate column name.
 at scala.Predef$.assert(Predef.scala:170)
 at 
org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
 at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
 at 
org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
 at org.apache.spark.sql.Dataset.(Dataset.scala:183)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20235) Hive on S3 s3:sse and non S3:sse buckets

2017-05-16 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013290#comment-16013290
 ] 

Franck Tago commented on SPARK-20235:
-

was this comment meant for me? 
what does that mean ?

> Hive on S3  s3:sse and non S3:sse buckets
> -
>
> Key: SPARK-20235
> URL: https://issues.apache.org/jira/browse/SPARK-20235
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Franck Tago
>Priority: Minor
>
> my spark application writes into 2 hive tables .
> both tables are external with data residing on S3
> I want to encrypt the data when writing into hive table1 ,  but I do not want 
> to encrypt the data when writing into hive table 2.
> given that the parameter fs.s3a.server-side-encryption-algorithm is set 
> globally , I do not see how these use cases are supported in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20235) Hive on S3 s3:sse and non S3:sse buckets

2017-04-06 Thread Franck Tago (JIRA)
Franck Tago created SPARK-20235:
---

 Summary: Hive on S3  s3:sse and non S3:sse buckets
 Key: SPARK-20235
 URL: https://issues.apache.org/jira/browse/SPARK-20235
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Franck Tago
Priority: Minor


my spark application writes into 2 hive tables .
both tables are external with data residing on S3

I want to encrypt the data when writing into hive table1 ,  but I do not want 
to encrypt the data when writing into hive table 2.

given that the parameter fs.s3a.server-side-encryption-algorithm is set 
globally , I do not see how these use cases are supported in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application

2017-04-04 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15956346#comment-15956346
 ] 

Franck Tago commented on SPARK-20153:
-

ok thanks for the tips.  It appears that EMR 5.4.0 also supports the use of the 
s3a within a spark application.

http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html

This was  a painful restriction prior to this resolution.

> Support Multiple aws credentials in order to access multiple Hive on S3 table 
> in spark application 
> ---
>
> Key: SPARK-20153
> URL: https://issues.apache.org/jira/browse/SPARK-20153
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Franck Tago
>Priority: Minor
>
> I need to access multiple hive tables in my spark application where each hive 
> table is 
> 1- an external table with data sitting on S3
> 2- each table is own by a different AWS user so I need to provide different 
> AWS credentials. 
> I am familiar with setting the aws credentials in the hadoop configuration 
> object but that does not really help me because I can only set one pair of 
> (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey )
> From my research , there is no easy or elegant way to do this in spark .
> Why is that ?  
> How do I address this use case?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application

2017-04-04 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955695#comment-15955695
 ] 

Franck Tago commented on SPARK-20153:
-

oh I would definitely not consider including the key into the URL . It is a 
gigantic security hole in my opinion. Moreover consider that I am dealing with 
Hive on S3 where the uri is part of the table metadata. How would that work in 
this case.

Is there a way to encode the  accesId and secret key before calling 

Does spark provide anyway of masking or hiding the accessId and secretKey?

> Support Multiple aws credentials in order to access multiple Hive on S3 table 
> in spark application 
> ---
>
> Key: SPARK-20153
> URL: https://issues.apache.org/jira/browse/SPARK-20153
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Franck Tago
>Priority: Minor
>
> I need to access multiple hive tables in my spark application where each hive 
> table is 
> 1- an external table with data sitting on S3
> 2- each table is own by a different AWS user so I need to provide different 
> AWS credentials. 
> I am familiar with setting the aws credentials in the hadoop configuration 
> object but that does not really help me because I can only set one pair of 
> (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey )
> From my research , there is no easy or elegant way to do this in spark .
> Why is that ?  
> How do I address this use case?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application

2017-03-30 Thread Franck Tago (JIRA)
Franck Tago created SPARK-20153:
---

 Summary: Support Multiple aws credentials in order to access 
multiple Hive on S3 table in spark application 
 Key: SPARK-20153
 URL: https://issues.apache.org/jira/browse/SPARK-20153
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 2.0.1
Reporter: Franck Tago


I need to access multiple hive tables in my spark application where each hive 
table is 
1- an external table with data sitting on S3
2- each table is own by a different AWS user so I need to provide different AWS 
credentials. 

I am familiar with setting the aws credentials in the hadoop configuration 
object but that does not really help me because I can only set one pair of 
(fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey )

>From my research , there is no easy or elegant way to do this in spark .

Why is that ?  

How do I address this use case?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-11-01 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627406#comment-15627406
 ] 

Franck Tago commented on SPARK-17982:
-

Wanted to mention that I was able to successfully  verify my cases with the 
changes made under this request.

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>Priority: Blocker
>
> The following statement fails in the spark shell . 
> {noformat}
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> {noformat}
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-30 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15620750#comment-15620750
 ] 

Franck Tago commented on SPARK-15616:
-

SO was not able to  use the changes for the following  reasons . 
1-I forgot to mention that I am working off the spark 2.0.1  branch. 
2- I get the following error
[info] Compiling 30 Scala sources and 2 Java sources to 
/export/home/devbld/spark_world/Mercury/pvt/ftago/spark-2.0.1/sql/hive/target/scala-2.11/classes...
[error] 
/export/home/devbld/spark_world/Mercury/pvt/ftago/spark-2.0.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala:295:
 type mismatch;
[error]  found   : Seq[org.apache.spark.sql.catalyst.expressions.Expression]
[error]  required: Option[String]
[error] MetastoreRelation(databaseName, tableName, 
partitionPruningPred)(catalogTable, client, sparkSession)
[error]^
[error] one error found
[error] Compile failed 

Can you please  build a version of this fix off spark 2.0.1? I tried 
incorporating your changes but as pointed to the error message shown above , I 
was not able to .

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17612) Support `DESCRIBE table PARTITION` SQL syntax

2016-10-28 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616542#comment-15616542
 ] 

Franck Tago commented on SPARK-17612:
-

Hi 
Thanks for the quick reply. So I was only trying a long shot , because I wanted 
to know if the join issue that i described was related somehow to this issue . 
I know it is not probable but just wanted to check . 

> Support `DESCRIBE table PARTITION` SQL syntax
> -
>
> Key: SPARK-17612
> URL: https://issues.apache.org/jira/browse/SPARK-17612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.2, 2.1.0
>
>
> This issue implements `DESC PARTITION` SQL Syntax again. It was dropped since 
> Spark 2.0.0.
> h4. Spark 2.0.0
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> DESC partitioned_table PARTITION (c='Us', d=1)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   ... 48 elided
> {code}
> h4. Spark 1.6.2
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res2: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> 16/09/20 12:48:36 WARN LazyStruct: Extra bytes detected at the end of the 
> row! Ignoring similar problems.
> ++
> |result  |
> ++
> |a  string|
> |b  int   |
> |c  string|
> |d  string|
> ||
> |# Partition Information  
> |
> |# col_name data_type   comment |
> ||
> |c  string|
> |d  string|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-28 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616385#comment-15616385
 ] 

Franck Tago commented on SPARK-15616:
-

So I tried the changed that were made for this issue. SPARK-15616 but I hit a 
runtime issue when trying to test it.

scala> spark.conf.set("spark.sql.statistics.fallBackToHdfs" , "true")

scala> spark.conf.set("spark.sql.statistics.partitionPruner" ,"true")

scala> val df1= spark.table("ft_p")
df1: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]

scala> val df2=spark.table("ft_p_no")
df2: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]

scala> df2.join((df1.select($"col1".as("acol1"), 
$"col3".as("acol3")).filter($"acol3"===5)) , $"col1"===$"acol1" ).explain
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: col3#43



> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17612) Support `DESCRIBE table PARTITION` SQL syntax

2016-10-28 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616376#comment-15616376
 ] 

Franck Tago commented on SPARK-17612:
-

Hi  
Basically  I have an issue where I am performing the following operations.

Partitioned Large  Hive Table (hive table 1)  --  filter   ---join 

  /
 Non Partitioned  Large Hive Table

Basically I am join 2 large tables .  Both table raw size exceed the  broadcast 
join threshold.
The filter filter a specific partition . This partition is small enough so that 
its size is smaller than the broadcast join threshold.

With Spark 2.0 and Spark 2.0.1 , I do not see  a broadcast join . I see a  sort 
merge join.  
Which is really  surprising to me given that this could be a really common  
case. You can imagine a user who has a large log table partitioned by date and 
he filters on a specific date. We should be able to do a broadcast join in that 
case. 

The question now is the following .  

I do not think this Spark Issue addresses the cited problem but I could be 
wrong  . I tried incorporating the change in the spark 2.0 PR but I see the 
same behavior . That is no broadcast join.  

Question :  Is this spark issue supposed to address the problem that I 
mentioned ?  

- If not  , which i think is the case , do you know if spark currently has a 
fix for the cited issue.  
I also tried the fix under   SPARK-15616 but I hit a runtime failure .

There has got to be a solution to this problem somewhere.





> Support `DESCRIBE table PARTITION` SQL syntax
> -
>
> Key: SPARK-17612
> URL: https://issues.apache.org/jira/browse/SPARK-17612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.2, 2.1.0
>
>
> This issue implements `DESC PARTITION` SQL Syntax again. It was dropped since 
> Spark 2.0.0.
> h4. Spark 2.0.0
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> DESC partitioned_table PARTITION (c='Us', d=1)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   ... 48 elided
> {code}
> h4. Spark 1.6.2
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res2: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> 16/09/20 12:48:36 WARN LazyStruct: Extra bytes detected at the end of the 
> row! Ignoring similar problems.
> ++
> |result  |
> ++
> |a  string|
> |b  int   |
> |c  string|
> |d  string|
> ||
> |# Partition Information  
> |
> |# col_name data_type   comment |
> ||
> |c  string|
> |d  string|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Comment Edited] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-27 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612937#comment-15612937
 ] 

Franck Tago edited comment on SPARK-15616 at 10/27/16 7:35 PM:
---

Hi 

In my case  the filter is on a partition key , so  i would  need your changes 
in order to see a  broadcast join ?


was (Author: tafra...@gmail.com):
Hi 

In my case  the filter is on a partition key , so  i would  need these changes 
in order to see a  broadcast join ?

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-27 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612937#comment-15612937
 ] 

Franck Tago commented on SPARK-15616:
-

Hi 

In my case  the filter is on a partition key , so  i would  need these changes 
in order to see a  broadcast join ?

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-26 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610024#comment-15610024
 ] 

Franck Tago commented on SPARK-15616:
-

Does spark currently support  pruning  statistics for  hive table partitions?  

The question is consider an rdd that consists of 

hive table 1  ---   filter  -- join
 /
 large hive table2 ---

where the hive table1 and hive table 2  exceed the  
spark.sql.autoBroadcastJoinThreshold.

However ,  the filter only removes data from a single partition that is small 
enough to  fit  spark.sql.autoBroadcastJoinThreshold.

The question is  , will spark perform a broadcast join in this case ?

Using spark 2.0.0 and spark 2.0.1  my observations are that spark  will not. 

Any ideas?

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-26 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607872#comment-15607872
 ] 

Franck Tago edited comment on SPARK-15616 at 10/26/16 11:20 PM:


Hi   Wondering if there are any  updates on this issue


was (Author: tafra...@gmail.com):
Hi   Wondering if there are any  updates on this issue

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-26 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607872#comment-15607872
 ] 

Franck Tago commented on SPARK-15616:
-

Hi   Wondering if there are any  updates on this issue

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586918#comment-15586918
 ] 

Franck Tago commented on SPARK-17982:
-

Updated the  Title of the  Jira . 

I looked at the views.scala file and I want to know if setting the flag  
spark.sql.nativeView.canonical to false is an acceptable  workaround.  

I tested it and it works  but the question is that an acceptable workaround.

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.

2016-10-18 Thread Franck Tago (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-17982:

Summary: Spark 2.0.0  CREATE VIEW statement fails :: 
java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is 
possible there is a bug in Spark.  (was: Spark 2.0.0  CREATE VIEW statement 
fails when select statement contains limit clause)

> Spark 2.0.0  CREATE VIEW statement fails :: java.lang.RuntimeException: 
> Failed to analyze the canonicalized SQL. It is possible there is a bug in 
> Spark.
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause

2016-10-18 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586309#comment-15586309
 ] 

Franck Tago edited comment on SPARK-17982 at 10/18/16 6:50 PM:
---

Hi , 
couple of things that I want to point out here. 

First ,
your workaround will not be acceptable for me because I need to control the 
name of the view columns . This is due to the fact that I programatically 
generate the scala code for the spark application . There are some invariants 
that are required in my framework . 


Second ,
you claim that spark does not support column names in create view statements at 
all ? 

How come it does work in some cases ?

Is that a limitation that is documented somewhere?
  



was (Author: tafra...@gmail.com):
Hi , 
couple of things that I want to point out here. 

First ,
your workaround will not be acceptable for me because I need to control the 
name of the view columns . This is due to the fact that I programatically 
generate the scala code for the spark application . There are some invariants 
that are required in my framework . 


Second ,
you claim that spark does not support column names in create view statements at 
all ? 

How come it does work in some cases ?

  


> Spark 2.0.0  CREATE VIEW statement fails when select statement contains limit 
> clause
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause

2016-10-18 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586309#comment-15586309
 ] 

Franck Tago commented on SPARK-17982:
-

Hi , 
couple of things that I want to point out here. 

First ,
your workaround will not be acceptable for me because I need to control the 
name of the view columns . This is due to the fact that I programatically 
generate the scala code for the spark application . There are some invariants 
that are required in my framework . 


Second ,
you claim that spark does not support column names in create view statements at 
all ? 

How come it does work in some cases ?

  


> Spark 2.0.0  CREATE VIEW statement fails when select statement contains limit 
> clause
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause

2016-10-17 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584178#comment-15584178
 ] 

Franck Tago commented on SPARK-17982:
-

== SQL ==
SELECT `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
`gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
`gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
`gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
gen_subquery_1
^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:189)
  ... 64 more


> Spark 2.0.0  CREATE VIEW statement fails when select statement contains limit 
> clause
> 
>
> Key: SPARK-17982
> URL: https://issues.apache.org/jira/browse/SPARK-17982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark 2.0.0
>Reporter: Franck Tago
>
> The following statement fails in the spark shell . 
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> scala> spark.sql("CREATE VIEW 
> DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS 
> SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2")
> java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
> `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
> `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
> `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
> `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
> gen_subquery_1
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
> This appears to be a limitation of the create view statement .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause

2016-10-17 Thread Franck Tago (JIRA)
Franck Tago created SPARK-17982:
---

 Summary: Spark 2.0.0  CREATE VIEW statement fails when select 
statement contains limit clause
 Key: SPARK-17982
 URL: https://issues.apache.org/jira/browse/SPARK-17982
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
 Environment: spark 2.0.0
Reporter: Franck Tago


The following statement fails in the spark shell . 

scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where 
(WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` 
limit 2")

scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where 
(WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` 
limit 2")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT 
`gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT 
`gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT 
`gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS 
`gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS 
gen_subquery_1
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192)
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
  at org.apache.spark.sql.Dataset.(Dataset.scala:186)
  at org.apache.spark.sql.Dataset.(Dataset.scala:167)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)


This appears to be a limitation of the create view statement .




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17859) persist should not impede with spark's ability to perform a broadcast join.

2016-10-10 Thread Franck Tago (JIRA)
Franck Tago created SPARK-17859:
---

 Summary: persist should not impede with spark's ability to perform 
a broadcast join.
 Key: SPARK-17859
 URL: https://issues.apache.org/jira/browse/SPARK-17859
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.0.0
 Environment: spark 2.0.0 , Linux RedHat
Reporter: Franck Tago


I am using Spark 2.0.0 

My investigation leads me to conclude that calling persist could prevent 
broadcast join  from happening .

Example
Case1: No persist call 
var  df1 =spark.range(100).select($"id".as("id1"))
df1: org.apache.spark.sql.DataFrame = [id1: bigint]

 var df2 =spark.range(1000).select($"id".as("id2"))
df2: org.apache.spark.sql.DataFrame = [id2: bigint]

 df1.join(df2 , $"id1" === $"id2" ).explain 
== Physical Plan ==
*BroadcastHashJoin [id1#117L], [id2#123L], Inner, BuildRight
:- *Project [id#114L AS id1#117L]
:  +- *Range (0, 100, splits=2)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Project [id#120L AS id2#123L]
  +- *Range (0, 1000, splits=2)

Case 2:  persist call 

 df1.persist.join(df2 , $"id1" === $"id2" ).explain 
16/10/10 15:50:21 WARN CacheManager: Asked to cache already cached data.
== Physical Plan ==
*SortMergeJoin [id1#3L], [id2#9L], Inner
:- *Sort [id1#3L ASC], false, 0
:  +- Exchange hashpartitioning(id1#3L, 10)
: +- InMemoryTableScan [id1#3L]
::  +- InMemoryRelation [id1#3L], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas)
:: :  +- *Project [id#0L AS id1#3L]
:: : +- *Range (0, 100, splits=2)
+- *Sort [id2#9L ASC], false, 0
   +- Exchange hashpartitioning(id2#9L, 10)
  +- InMemoryTableScan [id2#9L]
 :  +- InMemoryRelation [id2#9L], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas)
 : :  +- *Project [id#6L AS id2#9L]
 : : +- *Range (0, 1000, splits=2)


Why does the persist call prevent the broadcast join . 

My opinion is that it should not .

I was made aware that the persist call is  lazy and that might have something 
to do with it , but I still contend that it should not . 

Losing broadcast joins is really costly.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-04 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547292#comment-15547292
 ] 

Franck Tago commented on SPARK-17758:
-

My first impression after taking a look at  First.scala leads me to conclude 
that the issue  depicted here should not happen for the First aggregate 
function call . 

Is that correct ?

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>  Labels: correctness
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544003#comment-15544003
 ] 

Franck Tago commented on SPARK-17758:
-

Thanks for the pointer 
Is it possible to write a UDAF  which supports  the combiner like  
functionality . I am referring to the partial_last , partial_min  .. 

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543926#comment-15543926
 ] 

Franck Tago commented on SPARK-17758:
-

is there any workaround that you could think of?  

One  simple solution cold be to call repartition on the data frame  in order to 
remove the empty partition but performance wise that is  just terrible .  

I am not sure that i comprehend why the concept of an empty partition is even 
allowed in the spark ecosystem .  Any documentation on why empty partitions are 
allowed would be greatly appreciated.

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543488#comment-15543488
 ] 

Franck Tago edited comment on SPARK-17758 at 10/3/16 11:04 PM:
---

[~hvanhovell] It should return a  non null  value in this case because the null 
 value  is a false representation of my data . 

My initial  input source  had 3 rows . 

a,1
b,2
c,10

>From  observations show 

partition 1  contains 2 rows  
a,1
b,2 

partition 2 contains 1 row  
c,10  

partition  3 contains  0 rows 

why should  last return  null in this case  . 

I understand that last would not be deterministic but why should should it 
return null in this case ?



was (Author: tafra...@gmail.com):
[~hvanhovell] It should return non null in this case because the null  value  
is a false representation of my data . 

My initial  input source  had 3 rows . 

a,1
b,2
c,10

>From  observations show 

partition 1  contains 2 rows  
a,1
b,2 

partition 2 contains 1 row  
c,10  

partition  3 contains  0 rows 

why should  last return  null in this case  . 

I understand that last would not be deterministic but why should should it 
return null in this case ?


> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543488#comment-15543488
 ] 

Franck Tago edited comment on SPARK-17758 at 10/3/16 11:04 PM:
---

[~hvanhovell] It should return non null in this case because the null  value  
is a false representation of my data . 

My initial  input source  had 3 rows . 

a,1
b,2
c,10

>From  observations show 

partition 1  contains 2 rows  
a,1
b,2 

partition 2 contains 1 row  
c,10  

partition  3 contains  0 rows 

why should  last return  null in this case  . 

I understand that last would not be deterministic but why should should it 
return null in this case ?



was (Author: tafra...@gmail.com):
[~hvanhovell] It should return on null in this case because the null  value  is 
a false representation of my data . 

My initial  input source  had 3 rows . 

a,1
b,2
c,10

>From  observations show 

partition 1  contains 2 rows  
a,1
b,2 

partition 2 contains 1 row  
c,10  

partition  3 contains  0 rows 

why should  last return  null in this case  . 

I understand that last would not be deterministic but why should should it 
return null in this case ?


> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543488#comment-15543488
 ] 

Franck Tago commented on SPARK-17758:
-

[~hvanhovell] It should return on null in this case because the null  value  is 
a false representation of my data . 

My initial  input source  had 3 rows . 

a,1
b,2
c,10

>From  observations show 

partition 1  contains 2 rows  
a,1
b,2 

partition 2 contains 1 row  
c,10  

partition  3 contains  0 rows 

why should  last return  null in this case  . 

I understand that last would not be deterministic but why should should it 
return null in this case ?


> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> {noformat}
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-03 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543097#comment-15543097
 ] 

Franck Tago commented on SPARK-17758:
-

I tested the behavior of the min and max function with sort aggregate and an 
empty partition and the results were correct. 

I would also note that this issue is not reproducible in spark version 1.6

> Spark Aggregate function  LAST returns null on an empty partition 
> --
>
> Key: SPARK-17758
> URL: https://issues.apache.org/jira/browse/SPARK-17758
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
>Reporter: Franck Tago
>
> My Environment 
> Spark 2.0.0  
> I have included the physical plan of my application below.
> Issue description
> The result from  a query that uses the LAST function are incorrect. 
> The output obtained for the column that corresponds to the last function is 
> null .  
> My input data contain 3 rows . 
> The application resulted in  2 stages 
> The first stage consisted of 3 tasks . 
> The first task/partition contains 2 rows
> The second task/partition contains 1 row
> The last task/partition contain  0 rows
> The result from the query executed for the LAST column call is NULL which I 
> believe is due to the  PARTIAL_LAST on the last partition . 
> I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
> partition should not return null .
> == Physical Plan ==
> InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
> +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) 
> AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
>+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
> output=[max(C3_0)#50,last(C3_1)#51])
>   +- SortAggregate(key=[], 
> functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
> output=[max#91,last#92])
>  +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 
> AS C3_1#41]
> +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
> bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
> DOUBLE)#27,last(C1_1)#28])
>+- Exchange SinglePartition
>   +- SortAggregate(key=[], 
> functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, 
> false)], output=[sum#95L,last#96])
>  +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
> +- HiveTableScan [field1#7, field#6], 
> MetastoreRelation default, bdm_3449_src, alias



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition

2016-10-01 Thread Franck Tago (JIRA)
Franck Tago created SPARK-17758:
---

 Summary: Spark Aggregate function  LAST returns null on an empty 
partition 
 Key: SPARK-17758
 URL: https://issues.apache.org/jira/browse/SPARK-17758
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.0.0
 Environment: Spark 2.0.0
Reporter: Franck Tago


My Environment 
Spark 2.0.0  

I have included the physical plan of my application below.
Issue description
The result from  a query that uses the LAST function are incorrect. 

The output obtained for the column that corresponds to the last function is 
null .  

My input data contain 3 rows . 

The application resulted in  2 stages 

The first stage consisted of 3 tasks . 

The first task/partition contains 2 rows
The second task/partition contains 1 row
The last task/partition contain  0 rows

The result from the query executed for the LAST column call is NULL which I 
believe is due to the  PARTIAL_LAST on the last partition . 

I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty 
partition should not return null .


== Physical Plan ==
InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false
+- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) AS 
field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104]
   +- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], 
output=[max(C3_0)#50,last(C3_1)#51])
  +- SortAggregate(key=[], 
functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], 
output=[max#91,last#92])
 +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 AS 
C3_1#41]
+- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as 
bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS 
DOUBLE)#27,last(C1_1)#28])
   +- Exchange SinglePartition
  +- SortAggregate(key=[], functions=[partial_sum(cast(C1_0#17 
as bigint)),partial_last(C1_1#18, false)], output=[sum#95L,last#96])
 +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18]
+- HiveTableScan [field1#7, field#6], MetastoreRelation 
default, bdm_3449_src, alias




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org