[jira] [Updated] (SPARK-44759) Do not combine multiple Generate operators in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Component/s: Deploy Spark Core > Do not combine multiple Generate operators in the same WholeStageCodeGen node > because it can easily cause OOM failures if arrays are relatively large > -- > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Deploy, Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, > image-2023-08-10-09-33-47-788.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > This is an issue since the WSCG implementation of the generate node. > Because WSCG compute rows in batches , the combination of WSCG and the > explode operation consume a lot of the dedicated executor memory. This is > even more true when the WSCG node contains multiple explode nodes. This is > the case when flattening a nested array. > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from snapshots added in the comments , the rows created in > the nested loop are saved in a writer buffer. In this case because the rows > were big , the job failed with an Out Of Memory Exception error . > 2_ The generated WholeStageCodeGen result in a nested loop that for each row > , will explode the parent array and then explode the inner array. The rows > are accumulated in the writer buffer without accounting for the row size. > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whole > stage code gen node . Doing so leads to potential memory issues. > In our case , the job execution failed with an OOM error because the the > WSCG executed into a nested for loop . > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.
[ https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44768: Description: The code sample below is to showcase the wholestagecodegen generated when exploding nested arrays. The data sample in the dataframe is quite small so it will not trigger the Out of Memory error . However if the array is larger and the row size is large , you will definitely end up with an OOM error . consider a scenario where you flatten a nested array // e.g you can use the following steps to create the dataframe //create a partClass case class case class partClass (PARTNAME: String , PartNumber: String , PartPrice : Double ) //create a nested array array class case class array_array_class ( col_int: Int, arr_arr_string : Seq[Seq[String]], arr_arr_bigint : Seq[Seq[Long]], col_string : String, parts_s : Seq[Seq[partClass]] ) //create a dataframe var df_array_array = sc.parallelize( Seq( (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1", Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75))) ) , (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), Seq(0,3,-1)),"ItemPart2", Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75))) ) ) ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5") //explode a nested array var result = df_array_array.select( col("c1"), explode(col("c2"))).select('c1 , explode('col)) result.explain The physical for this operator is seen below. - Physical plan == Physical Plan == *(1) Generate explode(col#27), [c1#17|#17], false, [col#30|#30] +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27)) +- *(1) Generate explode(c2#18), [c1#17|#17], false, [col#27|#27] +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18|#6 AS c1#17, _2#7 AS c2#18] +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7)) +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class java.lang.String)), true, false, true), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(StringType,true), ObjectType(interface scala.collection.Seq)), None), knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -4), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -4), IntegerType, IntegerType)), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), ArrayType(IntegerType,false), ObjectType(interface scala.collection.Seq)), None), knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -5), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), StructField(PartNumber,StringType,true), StructField(PartPrice,DoubleType,false), ObjectType(class $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else named_struct(PARTNAME, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), StructField(PartNumber,StringType,true), StructField(PartPrice,DoubleType,false), ObjectType(class $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PARTNAME, true, false, true), PartNumber, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), StructField(PartNumber,StringType,true), StructField(PartPrice,DoubleType,false), ObjectType(class
[jira] [Commented] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.
[ https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753035#comment-17753035 ] Franck Tago commented on SPARK-44768: - !image-2023-08-10-20-32-55-684.png! > Improve WSCG handling of row buffer by accounting for executor memory . > Exploding nested arrays can easily lead to out of memory errors. > --- > > Key: SPARK-44768 > URL: https://issues.apache.org/jira/browse/SPARK-44768 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.2, 3.4.0, 3.4.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-20-32-55-684.png, > spark-jira_wscg_code.txt > > > consider a scenario where you flatten a nested array > // e.g you can use the following steps to create the dataframe > //create a partClass case class > case class partClass (PARTNAME: String , PartNumber: String , PartPrice : > Double ) > //create a nested array array class > case class array_array_class ( > col_int: Int, > arr_arr_string : Seq[Seq[String]], > arr_arr_bigint : Seq[Seq[Long]], > col_string : String, > parts_s : Seq[Seq[partClass]] > > ) > //create a dataframe > var df_array_array = sc.parallelize( > Seq( > (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , > Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1", > Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75))) > ) , > > (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" > ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), > Seq(0,3,-1)),"ItemPart2", > > Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75))) > ) > > ) > ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5") > //explode a nested array > var result = df_array_array.select( col("c1"), > explode(col("c2"))).select('c1 , explode('col)) > result.explain > > The physical for this operator is seen below. > - > Physical plan > == Physical Plan == > *(1) Generate explode(col#27), [c1#17|#17], false, [col#30|#30] > +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27)) > +- *(1) Generate explode(c2#18), [c1#17|#17], false, [col#27|#27] > +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18|#6 AS c1#17, _2#7 AS > c2#18] > +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7)) > +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -1), > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, > StringType, fromString, validateexternaltype(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class > java.lang.String)), true, false, true), > validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -1), ArrayType(StringType,true), > ObjectType(interface scala.collection.Seq)), None), > knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -4), > assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -4), IntegerType, IntegerType)), > validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -3), ArrayType(IntegerType,false), > ObjectType(interface scala.collection.Seq)), None), > knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, > true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -5), > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -6), > StructField(PARTNAME,StringType,true), > StructField(PartNumber,StringType,true), > StructField(PartPrice,DoubleType,false), ObjectType(class > $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else > named_struct(PARTNAME, staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object),
[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.
[ https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44768: Attachment: image-2023-08-10-20-32-55-684.png > Improve WSCG handling of row buffer by accounting for executor memory . > Exploding nested arrays can easily lead to out of memory errors. > --- > > Key: SPARK-44768 > URL: https://issues.apache.org/jira/browse/SPARK-44768 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.2, 3.4.0, 3.4.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-20-32-55-684.png, > spark-jira_wscg_code.txt > > > consider a scenario where you flatten a nested array > // e.g you can use the following steps to create the dataframe > //create a partClass case class > case class partClass (PARTNAME: String , PartNumber: String , PartPrice : > Double ) > //create a nested array array class > case class array_array_class ( > col_int: Int, > arr_arr_string : Seq[Seq[String]], > arr_arr_bigint : Seq[Seq[Long]], > col_string : String, > parts_s : Seq[Seq[partClass]] > > ) > //create a dataframe > var df_array_array = sc.parallelize( > Seq( > (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , > Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1", > Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75))) > ) , > > (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" > ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), > Seq(0,3,-1)),"ItemPart2", > > Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75))) > ) > > ) > ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5") > //explode a nested array > var result = df_array_array.select( col("c1"), > explode(col("c2"))).select('c1 , explode('col)) > result.explain > > The physical for this operator is seen below. > - > Physical plan > == Physical Plan == > *(1) Generate explode(col#27), [c1#17|#17], false, [col#30|#30] > +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27)) > +- *(1) Generate explode(c2#18), [c1#17|#17], false, [col#27|#27] > +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18|#6 AS c1#17, _2#7 AS > c2#18] > +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7)) > +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -1), > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, > StringType, fromString, validateexternaltype(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class > java.lang.String)), true, false, true), > validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -1), ArrayType(StringType,true), > ObjectType(interface scala.collection.Seq)), None), > knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -4), > assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -4), IntegerType, IntegerType)), > validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -3), ArrayType(IntegerType,false), > ObjectType(interface scala.collection.Seq)), None), > knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, > true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -5), > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -6), > StructField(PARTNAME,StringType,true), > StructField(PartNumber,StringType,true), > StructField(PartPrice,DoubleType,false), ObjectType(class > $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else > named_struct(PARTNAME, staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -6),
[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.
[ https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44768: Description: consider a scenario where you flatten a nested array // e.g you can use the following steps to create the dataframe //create a partClass case class case class partClass (PARTNAME: String , PartNumber: String , PartPrice : Double ) //create a nested array array class case class array_array_class ( col_int: Int, arr_arr_string : Seq[Seq[String]], arr_arr_bigint : Seq[Seq[Long]], col_string : String, parts_s : Seq[Seq[partClass]] ) //create a dataframe var df_array_array = sc.parallelize( Seq( (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1", Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75))) ) , (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), Seq(0,3,-1)),"ItemPart2", Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75))) ) ) ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5") //explode a nested array var result = df_array_array.select( col("c1"), explode(col("c2"))).select('c1 , explode('col)) result.explain The physical for this operator is seen below. - Physical plan == Physical Plan == *(1) Generate explode(col#27), [c1#17|#17], false, [col#30|#30] +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27)) +- *(1) Generate explode(c2#18), [c1#17|#17], false, [col#27|#27] +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18|#6 AS c1#17, _2#7 AS c2#18] +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7)) +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class java.lang.String)), true, false, true), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(StringType,true), ObjectType(interface scala.collection.Seq)), None), knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -4), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -4), IntegerType, IntegerType)), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), ArrayType(IntegerType,false), ObjectType(interface scala.collection.Seq)), None), knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -5), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), StructField(PartNumber,StringType,true), StructField(PartPrice,DoubleType,false), ObjectType(class $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else named_struct(PARTNAME, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), StructField(PartNumber,StringType,true), StructField(PartPrice,DoubleType,false), ObjectType(class $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PARTNAME, true, false, true), PartNumber, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), StructField(PartNumber,StringType,true), StructField(PartPrice,DoubleType,false), ObjectType(class $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PartNumber, true, false, true), PartPrice, knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), StructField(PartNumber,StringType,true),
[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.
[ https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44768: Attachment: spark-jira_wscg_code.txt > Improve WSCG handling of row buffer by accounting for executor memory . > Exploding nested arrays can easily lead to out of memory errors. > --- > > Key: SPARK-44768 > URL: https://issues.apache.org/jira/browse/SPARK-44768 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.2, 3.4.0, 3.4.1 >Reporter: Franck Tago >Priority: Major > Attachments: spark-jira_wscg_code.txt > > > consider a scenario where you flatten a nested array > // e.g you can use the following steps to create the dataframe > /create a partClass > case class partClass (PARTNAME: String , PartNumber: String , PartPrice : > Double ) > //create a nested array array class > case class array_array_class ( > col_int: Int, > arr_arr_string : Seq[Seq[String]], > arr_arr_bigint : Seq[Seq[Long]], > col_string : String, > parts_s : Seq[Seq[partClass]] > > ) > //create a dataframe > var df_array_array = sc.parallelize( > Seq( > (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , > Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1", > Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75))) > ) , > > (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" > ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), > Seq(0,3,-1)),"ItemPart2", > > Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75))) > ) > > ) > ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5") > //explode a nested array > var result = df_array_array.select( col("c1"), > explode(col("c2"))).select('c1 , explode('col)) > result.explain > > The physical for this operator is seen below. > - > Physical plan > == Physical Plan == > *(1) Generate explode(col#27), [c1#17], false, [col#30] > +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27)) > +- *(1) Generate explode(c2#18), [c1#17], false, [col#27] > +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18] > +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7)) > +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -1), > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, > StringType, fromString, validateexternaltype(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class > java.lang.String)), true, false, true), > validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -1), ArrayType(StringType,true), > ObjectType(interface scala.collection.Seq)), None), > knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -4), > assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -4), IntegerType, IntegerType)), > validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -3), ArrayType(IntegerType,false), > ObjectType(interface scala.collection.Seq)), None), > knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, > true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -5), > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -6), > StructField(PARTNAME,StringType,true), > StructField(PartNumber,StringType,true), > StructField(PartPrice,DoubleType,false), ObjectType(class > $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else > named_struct(PARTNAME, staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), > StructField(PartNumber,StringType,true), >
[jira] [Updated] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors.
[ https://issues.apache.org/jira/browse/SPARK-44768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44768: Summary: Improve WSCG handling of row buffer by accounting for executor memory . Exploding nested arrays can easily lead to out of memory errors. (was: Improve WSCG handling of row buffer by accounting for executor memory) > Improve WSCG handling of row buffer by accounting for executor memory . > Exploding nested arrays can easily lead to out of memory errors. > --- > > Key: SPARK-44768 > URL: https://issues.apache.org/jira/browse/SPARK-44768 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.2, 3.4.0, 3.4.1 >Reporter: Franck Tago >Priority: Major > > consider a scenario where you flatten a nested array > // e.g you can use the following steps to create the dataframe > /create a partClass > case class partClass (PARTNAME: String , PartNumber: String , PartPrice : > Double ) > //create a nested array array class > case class array_array_class ( > col_int: Int, > arr_arr_string : Seq[Seq[String]], > arr_arr_bigint : Seq[Seq[Long]], > col_string : String, > parts_s : Seq[Seq[partClass]] > > ) > //create a dataframe > var df_array_array = sc.parallelize( > Seq( > (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , > Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1", > Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75))) > ) , > > (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" > ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), > Seq(0,3,-1)),"ItemPart2", > > Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75))) > ) > > ) > ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5") > //explode a nested array > var result = df_array_array.select( col("c1"), > explode(col("c2"))).select('c1 , explode('col)) > result.explain > > The physical for this operator is seen below. > - > Physical plan > == Physical Plan == > *(1) Generate explode(col#27), [c1#17], false, [col#30] > +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27)) > +- *(1) Generate explode(c2#18), [c1#17], false, [col#27] > +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18] > +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7)) > +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, > scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -1), > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, > StringType, fromString, validateexternaltype(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class > java.lang.String)), true, false, true), > validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -1), ArrayType(StringType,true), > ObjectType(interface scala.collection.Seq)), None), > knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -4), > assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -4), IntegerType, IntegerType)), > validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -3), ArrayType(IntegerType,false), > ObjectType(interface scala.collection.Seq)), None), > knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, > true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -5), > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -6), > StructField(PARTNAME,StringType,true), > StructField(PartNumber,StringType,true), > StructField(PartPrice,DoubleType,false), ObjectType(class > $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else > named_struct(PARTNAME, staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class >
[jira] [Created] (SPARK-44768) Improve WSCG handling of row buffer by accounting for executor memory
Franck Tago created SPARK-44768: --- Summary: Improve WSCG handling of row buffer by accounting for executor memory Key: SPARK-44768 URL: https://issues.apache.org/jira/browse/SPARK-44768 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 3.4.1, 3.4.0, 3.3.2 Reporter: Franck Tago consider a scenario where you flatten a nested array // e.g you can use the following steps to create the dataframe /create a partClass case class partClass (PARTNAME: String , PartNumber: String , PartPrice : Double ) //create a nested array array class case class array_array_class ( col_int: Int, arr_arr_string : Seq[Seq[String]], arr_arr_bigint : Seq[Seq[Long]], col_string : String, parts_s : Seq[Seq[partClass]] ) //create a dataframe var df_array_array = sc.parallelize( Seq( (1,Seq(Seq("a","b" ,"c" ,"d") ,Seq("aa","bb" ,"cc","dd")) , Seq(Seq(1000,2), Seq(3,-1)),"ItemPart1", Seq(Seq(partClass("PNAME1","P1",20.75),partClass("PNAME1_1","P1_1",30.75))) ) , (2,Seq(Seq("ab","bc" ,"cd" ,"de") ,Seq("aab","bbc" ,"ccd","dde"),Seq("aab")) , Seq(Seq(-1000,-2,-1,-2), Seq(0,3,-1)),"ItemPart2", Seq(Seq(partClass("PNAME2","P2",170.75),partClass("PNAME2_1","P2_1",33.75),partClass("PNAME2_2","P2_2",73.75))) ) ) ).toDF("c1" ,"c2" ,"c3" ,"c4" ,"c5") //explode a nested array var result = df_array_array.select( col("c1"), explode(col("c2"))).select('c1 , explode('col)) result.explain The physical for this operator is seen below. - Physical plan == Physical Plan == *(1) Generate explode(col#27), [c1#17], false, [col#30] +- *(1) Filter ((size(col#27, true) > 0) AND isnotnull(col#27)) +- *(1) Generate explode(c2#18), [c1#17], false, [col#27] +- *(1) Project [_1#6 AS c1#17, _2#7 AS c2#18] +- *(1) Filter ((size(_2#7, true) > 0) AND isnotnull(_2#7)) +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._1 AS _1#6, mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), StringType, ObjectType(class java.lang.String)), true, false, true), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(StringType,true), ObjectType(interface scala.collection.Seq)), None), knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._2, None) AS _2#7, mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -4), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -4), IntegerType, IntegerType)), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), ArrayType(IntegerType,false), ObjectType(interface scala.collection.Seq)), None), knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._3, None) AS _3#8, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple5, true]))._4, true, false, true) AS _4#9, mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -5), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), if (isnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), StructField(PartNumber,StringType,true), StructField(PartPrice,DoubleType,false), ObjectType(class $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass null else named_struct(PARTNAME, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), StructField(PartNumber,StringType,true), StructField(PartPrice,DoubleType,false), ObjectType(class $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PARTNAME, true, false, true), PartNumber, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -6), StructField(PARTNAME,StringType,true), StructField(PartNumber,StringType,true), StructField(PartPrice,DoubleType,false), ObjectType(class $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$partClass))).PartNumber, true, false, true), PartPrice,
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate operators in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Description: This is an issue since the WSCG implementation of the generate node. Because WSCG compute rows in batches , the combination of WSCG and the explode operation consume a lot of the dedicated executor memory. This is even more true when the WSCG node contains multiple explode nodes. This is the case when flattening a nested array. The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from snapshots added in the comments , the rows created in the nested loop are saved in a writer buffer. In this case because the rows were big , the job failed with an Out Of Memory Exception error . 2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. The rows are accumulated in the writer buffer without accounting for the row size. Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whole stage code gen node . Doing so leads to potential memory issues. In our case , the job execution failed with an OOM error because the the WSCG executed into a nested for loop . was: This is an issue since the WSCG implementation of the generate node. The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows to the node. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from the attachment , the rows created in the nested loop are saved in writer buffer. In my case because the rows were big , I hit an Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. This is prone to OutOfmerry errors Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whole stage code gen node . Doing so leads to potential memory issues. In our case , the job execution failed with an OOM error because the the WSCG executed into a nested for loop . > Do not combine multiple Generate operators in the same WholeStageCodeGen node > because it can easily cause OOM failures if arrays are relatively large > -- > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, > image-2023-08-10-09-33-47-788.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > This is an issue since the WSCG implementation of the generate node. > Because WSCG compute rows in batches , the combination of WSCG and the > explode operation consume a lot of the dedicated executor memory. This is > even more true when the WSCG node contains multiple explode nodes. This is > the case when flattening a nested array. > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from snapshots added in the comments , the rows created in > the nested loop are saved in a writer buffer. In this case
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate operators in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Summary: Do not combine multiple Generate operators in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large (was: Do not combine multiple Generate nodes in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large) > Do not combine multiple Generate operators in the same WholeStageCodeGen node > because it can easily cause OOM failures if arrays are relatively large > -- > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, > image-2023-08-10-09-33-47-788.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > This is an issue since the WSCG implementation of the generate node. > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whole > stage code gen node . Doing so leads to potential memory issues. > In our case , the job execution failed with an OOM error because the the > WSCG executed into a nested for loop . > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Summary: Do not combine multiple Generate nodes in the same WholeStageCodeGen node because it can easily cause OOM failures if arrays are relatively large (was: Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large) > Do not combine multiple Generate nodes in the same WholeStageCodeGen node > because it can easily cause OOM failures if arrays are relatively large > -- > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, > image-2023-08-10-09-33-47-788.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > This is an issue since the WSCG implementation of the generate node. > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whole > stage code gen node . Doing so leads to potential memory issues. > In our case , the job execution failed with an OOM error because the the > WSCG executed into a nested for loop . > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Description: This is an issue since the WSCG implementation of the generate node. The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows to the node. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from the attachment , the rows created in the nested loop are saved in writer buffer. In my case because the rows were big , I hit an Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. This is prone to OutOfmerry errors Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whole stage code gen node . Doing so leads to potential memory issues. In our case , the job execution failed with an OOM error because the the WSCG executed into a nested for loop . was: The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows to the node. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from the attachment , the rows created in the nested loop are saved in writer buffer. In my case because the rows were big , I hit an Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. This is prone to OutOfmerry errors Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whole stage code gen node . Doing so leads to potential memory issues. In our case , the job execution failed with an OOM error because the the WSCG executed into a nested for loop . > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, > image-2023-08-10-09-33-47-788.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > This is an issue since the WSCG implementation of the generate node. > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whole > stage code gen node . Doing so leads to potential memory issues. > In our case , the job execution failed with an OOM error because the the > WSCG executed
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Affects Version/s: 3.4.1 3.4.0 3.3.2 3.2.4 3.2.3 3.2.2 3.2.1 3.1.3 3.2.0 3.1.2 3.1.1 3.1.0 3.0.3 3.0.2 3.0.1 3.0.0 > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, 3.4.0, 3.4.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, > image-2023-08-10-09-33-47-788.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whole > stage code gen node . Doing so leads to potential memory issues. > In our case , the job execution failed with an OOM error because the the > WSCG executed into a nested for loop . > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Description: The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows to the node. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from the attachment , the rows created in the nested loop are saved in writer buffer. In my case because the rows were big , I hit an Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. This is prone to OutOfmerry errors Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whole stage code gen node . Doing so leads to potential memory issues. In our case , the job execution failed with an OOM error because the the WSCG executed into a nested for loop . was: The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows to the node. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from the attachment , the rows created in the nested loop are saved in writer buffer. In my case because the rows were big , I hit an Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. This is prone to OutOfmerry errors Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whoe stage code gen node . Doing so leads to potential memory issues. WSCG method for first Generate node !image-2023-08-10-09-22-29-949.png! WSCG for second Generate node As you can see the execution of generate_doConsume_1 and generate_doConsume_0 triggers a nested loop. !image-2023-08-10-09-24-02-755.png! > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, > image-2023-08-10-09-33-47-788.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whole > stage code gen node . Doing so leads to potential memory issues. > In our case , the job execution failed with an OOM error because the the > WSCG executed into a nested for loop . > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Description: The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows to the node. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from the attachment , the rows created in the nested loop are saved in writer buffer. In my case because the rows were big , I hit an Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. This is prone to OutOfmerry errors Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whoe stage code gen node . Doing so leads to potential memory issues. WSCG method for first Generate node !image-2023-08-10-09-22-29-949.png! WSCG for second Generate node As you can see the execution of generate_doConsume_1 and generate_doConsume_0 triggers a nested loop. !image-2023-08-10-09-24-02-755.png! was: The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows to the node. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from the attachment , the rows created in the nested loop are saved in writer buffer. In my case because the rows were big , I hit an Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. This is prone to OutOfmerry errors Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whoe stage code gen node . Doing so leads to potential memory issues. !image-2023-08-10-02-18-24-758.png! !image-2023-08-10-09-19-23-973.png! !image-2023-08-10-09-21-32-371.png! WSCG method for first Generate node !image-2023-08-10-09-22-29-949.png! WSCG for second Generate node As you can see the execution of generate_doConsume_1 and generate_doConsume_0 triggers a nested loop. !image-2023-08-10-09-24-02-755.png! > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, > image-2023-08-10-09-33-47-788.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node .
[jira] [Commented] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752870#comment-17752870 ] Franck Tago commented on SPARK-44759: - Spark Dag for the use case . The failure is from the execution of WholeStageCodeGen(2) !image-2023-08-10-09-33-47-788.png! > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, > image-2023-08-10-09-33-47-788.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > > !image-2023-08-10-09-19-23-973.png! > > !image-2023-08-10-09-21-32-371.png! > > WSCG method for first Generate node > !image-2023-08-10-09-22-29-949.png! > > WSCG for second Generate node > As you can see the execution of generate_doConsume_1 and generate_doConsume_0 > triggers a nested loop. > !image-2023-08-10-09-24-02-755.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Attachment: image-2023-08-10-09-33-47-788.png > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, > image-2023-08-10-09-33-47-788.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > > !image-2023-08-10-09-19-23-973.png! > > !image-2023-08-10-09-21-32-371.png! > > WSCG method for first Generate node > !image-2023-08-10-09-22-29-949.png! > > WSCG for second Generate node > As you can see the execution of generate_doConsume_1 and generate_doConsume_0 > triggers a nested loop. > !image-2023-08-10-09-24-02-755.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752868#comment-17752868 ] Franck Tago commented on SPARK-44759: - WSCG generated code for second Generate node !image-2023-08-10-09-32-46-163.png! > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > > !image-2023-08-10-09-19-23-973.png! > > !image-2023-08-10-09-21-32-371.png! > > WSCG method for first Generate node > !image-2023-08-10-09-22-29-949.png! > > WSCG for second Generate node > As you can see the execution of generate_doConsume_1 and generate_doConsume_0 > triggers a nested loop. > !image-2023-08-10-09-24-02-755.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Attachment: image-2023-08-10-09-32-46-163.png > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, image-2023-08-10-09-32-46-163.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > > !image-2023-08-10-09-19-23-973.png! > > !image-2023-08-10-09-21-32-371.png! > > WSCG method for first Generate node > !image-2023-08-10-09-22-29-949.png! > > WSCG for second Generate node > As you can see the execution of generate_doConsume_1 and generate_doConsume_0 > triggers a nested loop. > !image-2023-08-10-09-24-02-755.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Attachment: image-2023-08-10-09-29-24-804.png > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > > !image-2023-08-10-09-19-23-973.png! > > !image-2023-08-10-09-21-32-371.png! > > WSCG method for first Generate node > !image-2023-08-10-09-22-29-949.png! > > WSCG for second Generate node > As you can see the execution of generate_doConsume_1 and generate_doConsume_0 > triggers a nested loop. > !image-2023-08-10-09-24-02-755.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752866#comment-17752866 ] Franck Tago commented on SPARK-44759: - WSCG generated code for first Generate node !image-2023-08-10-09-29-24-804.png! > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > image-2023-08-10-09-29-24-804.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > > !image-2023-08-10-09-19-23-973.png! > > !image-2023-08-10-09-21-32-371.png! > > WSCG method for first Generate node > !image-2023-08-10-09-22-29-949.png! > > WSCG for second Generate node > As you can see the execution of generate_doConsume_1 and generate_doConsume_0 > triggers a nested loop. > !image-2023-08-10-09-24-02-755.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752864#comment-17752864 ] Franck Tago edited comment on SPARK-44759 at 8/10/23 4:28 PM: -- WSCG generated code that calls generate_doConsume_0 !image-2023-08-10-09-27-24-124.png! was (Author: tafra...@gmail.com): !image-2023-08-10-09-27-24-124.png! > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > > !image-2023-08-10-09-19-23-973.png! > > !image-2023-08-10-09-21-32-371.png! > > WSCG method for first Generate node > !image-2023-08-10-09-22-29-949.png! > > WSCG for second Generate node > As you can see the execution of generate_doConsume_1 and generate_doConsume_0 > triggers a nested loop. > !image-2023-08-10-09-24-02-755.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Attachment: image-2023-08-10-09-27-24-124.png > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > > !image-2023-08-10-09-19-23-973.png! > > !image-2023-08-10-09-21-32-371.png! > > WSCG method for first Generate node > !image-2023-08-10-09-22-29-949.png! > > WSCG for second Generate node > As you can see the execution of generate_doConsume_1 and generate_doConsume_0 > triggers a nested loop. > !image-2023-08-10-09-24-02-755.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752864#comment-17752864 ] Franck Tago commented on SPARK-44759: - !image-2023-08-10-09-27-24-124.png! > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2023-08-10-09-27-24-124.png, > wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > > !image-2023-08-10-09-19-23-973.png! > > !image-2023-08-10-09-21-32-371.png! > > WSCG method for first Generate node > !image-2023-08-10-09-22-29-949.png! > > WSCG for second Generate node > As you can see the execution of generate_doConsume_1 and generate_doConsume_0 > triggers a nested loop. > !image-2023-08-10-09-24-02-755.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Summary: Do not combine multiple Generate nodes in the same WholeStageCodeGen nodebecause it can easily cause OOM failures if arrays are relatively large (was: Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures) > Do not combine multiple Generate nodes in the same WholeStageCodeGen > nodebecause it can easily cause OOM failures if arrays are relatively large > - > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > > !image-2023-08-10-09-19-23-973.png! > > !image-2023-08-10-09-21-32-371.png! > > WSCG method for first Generate node > !image-2023-08-10-09-22-29-949.png! > > WSCG for second Generate node > As you can see the execution of generate_doConsume_1 and generate_doConsume_0 > triggers a nested loop. > !image-2023-08-10-09-24-02-755.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Description: The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows to the node. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from the attachment , the rows created in the nested loop are saved in writer buffer. In my case because the rows were big , I hit an Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. This is prone to OutOfmerry errors Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whoe stage code gen node . Doing so leads to potential memory issues. !image-2023-08-10-02-18-24-758.png! !image-2023-08-10-09-19-23-973.png! !image-2023-08-10-09-21-32-371.png! WSCG method for first Generate node !image-2023-08-10-09-22-29-949.png! WSCG for second Generate node As you can see the execution of generate_doConsume_1 and generate_doConsume_0 triggers a nested loop. !image-2023-08-10-09-24-02-755.png! was: The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows to the node. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from the attachment , the rows created in the nested loop are saved in writer buffer. In my case because the rows were big , I hit an Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. This is prone to OutOfmerry errors Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whoe stage code gen node . Doing so leads to potential memory issues. !image-2023-08-10-02-18-24-758.png! > Do not combine multiple Generate nodes in the same WholeStageCodeGen because > it can easily cause OOM failures > -- > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > > !image-2023-08-10-09-19-23-973.png! > > !image-2023-08-10-09-21-32-371.png! > > WSCG method for first Generate node > !image-2023-08-10-09-22-29-949.png! > > WSCG for second Generate node > As you can see the execution of generate_doConsume_1 and generate_doConsume_0 > triggers a nested loop. > !image-2023-08-10-09-24-02-755.png! > -- This
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Attachment: (was: spark-verbosewithcodegenenabled) > Do not combine multiple Generate nodes in the same WholeStageCodeGen because > it can easily cause OOM failures > -- > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Attachment: wholestagecodegen_wc1_debug_wholecodegen_passed > Do not combine multiple Generate nodes in the same WholeStageCodeGen because > it can easily cause OOM failures > -- > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: wholestagecodegen_wc1_debug_wholecodegen_passed > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures
[ https://issues.apache.org/jira/browse/SPARK-44759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-44759: Attachment: spark-verbosewithcodegenenabled > Do not combine multiple Generate nodes in the same WholeStageCodeGen because > it can easily cause OOM failures > -- > > Key: SPARK-44759 > URL: https://issues.apache.org/jira/browse/SPARK-44759 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.3.0, 3.3.1 >Reporter: Franck Tago >Priority: Major > Attachments: spark-verbosewithcodegenenabled > > > The generate node used to flatten array generally produces an amount of > output rows that is significantly higher than the input rows to the node. > the number of output rows generated is even drastically higher when > flattening a nested array . > When we combine more that 1 generate node in the same WholeStageCodeGen > node, we run a high risk of running out of memory for multiple reasons. > 1- As you can see from the attachment , the rows created in the nested loop > are saved in writer buffer. In my case because the rows were big , I hit an > Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a > nested loop that for each row , will explode the parent array and then > explode the inner array. This is prone to OutOfmerry errors > > Please view the attached Spark Gui and Spark Dag > In my case the wholestagecodegen includes 2 explode nodes. > Because the array elements are large , we end up with an Out Of Memory error. > > I recommend that we do not merge multiple explode nodes in the same whoe > stage code gen node . Doing so leads to potential memory issues. > > !image-2023-08-10-02-18-24-758.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44759) Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures
Franck Tago created SPARK-44759: --- Summary: Do not combine multiple Generate nodes in the same WholeStageCodeGen because it can easily cause OOM failures Key: SPARK-44759 URL: https://issues.apache.org/jira/browse/SPARK-44759 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 3.3.1, 3.3.0 Reporter: Franck Tago The generate node used to flatten array generally produces an amount of output rows that is significantly higher than the input rows to the node. the number of output rows generated is even drastically higher when flattening a nested array . When we combine more that 1 generate node in the same WholeStageCodeGen node, we run a high risk of running out of memory for multiple reasons. 1- As you can see from the attachment , the rows created in the nested loop are saved in writer buffer. In my case because the rows were big , I hit an Out Of Memory Exception error .2_ The generated WholeStageCodeGen result in a nested loop that for each row , will explode the parent array and then explode the inner array. This is prone to OutOfmerry errors Please view the attached Spark Gui and Spark Dag In my case the wholestagecodegen includes 2 explode nodes. Because the array elements are large , we end up with an Out Of Memory error. I recommend that we do not merge multiple explode nodes in the same whoe stage code gen node . Doing so leads to potential memory issues. !image-2023-08-10-02-18-24-758.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15214) Implement code generation for Generate
[ https://issues.apache.org/jira/browse/SPARK-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531008#comment-17531008 ] Franck Tago commented on SPARK-15214: - it appears that code generation is not supported for nested Generate cases ? That is if I call the explode function in order to flatten a nested array. > Implement code generation for Generate > -- > > Key: SPARK-15214 > URL: https://issues.apache.org/jira/browse/SPARK-15214 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 2.2.0 > > > {{Generate}} currently does not support code generation. Lets add support for > CG and for it and its most important generators: {{explode}} and > {{json_tuple}}. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916334#comment-16916334 ] Franck Tago edited comment on SPARK-23519 at 8/27/19 4:10 AM: -- [~viirya] My mistake , i tested it with Oracle and MySql . I then assume that hive would honor the same . You are correct. Hive does not appear to support this after all . I need to test this on newer versions of Hive. was (Author: tafra...@gmail.com): [~viirya] My mistake , i tested it with Oracle and MySql . I then assume that hive would honor the same . You are correct. Hive does not appear to support this after all . > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Major > Labels: bulk-closed > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916334#comment-16916334 ] Franck Tago commented on SPARK-23519: - [~viirya] My mistake , i tested it with Oracle and MySql . I then assume that hive would honor the same . You are correct. Hive does not appear to support this after all . > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Major > Labels: bulk-closed > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago reopened SPARK-23519: - Ok Spark Community I am sorry for being a pest about this , but I re-opening this Jira because I really believe that this should be addressed . Right now I do not have any way satisfying my customer's requirement . My current use case is the following . My customer can provide any customer Hive query . I am oblivious to the actually content of the query and parsing the query is not an option . All I know if the number of fields projected from the customer query and the type of those fields . I do not know the name of the fields projected from the custom query. What is currently do with spark sql is run a query of the form . Create view view_name > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Major > Labels: bulk-closed > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470843#comment-16470843 ] Franck Tago commented on SPARK-23519: - I do not agree with the 'typical database' claim . mysql , oracle , hive support this syntax. example !image-2018-05-10-10-48-57-259.png! > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-23519: Attachment: image-2018-05-10-10-48-57-259.png > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Major > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443154#comment-16443154 ] Franck Tago commented on SPARK-23519: - thanks for the suggestion [~shahid] The issue with your suggestion is that I dynamically generate the create view statement ;Moreover the select statement is kind of Opaque to me because it is provided by the customer. It would be nice is spark could fix such a simple case. > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Critical > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389150#comment-16389150 ] Franck Tago edited comment on SPARK-23519 at 3/21/18 3:29 AM: -- Any updates on this ? Could someone assist with this? was (Author: tafra...@gmail.com): Any updates on this ? > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Critical > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-23519: Description: 1- create and populate a hive table . I did this in a hive cli session .[ not that this matters ] create table atable (col1 int) ; insert into atable values (10 ) , (100) ; 2. create a view from the table. [These actions were performed from a spark shell ] spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 from atable ") java.lang.AssertionError: assertion failed: The view output (col1,col1) contains duplicate column name. at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) was: 1- create and populate a hive table . I did this in a hive cli session .[ not that this matters ] create table atable (col1 int) ; insert into atable values (10 ) , (100) ; 2. create a view form the table. [ I did this from a spark shell ] spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 from atable ") java.lang.AssertionError: assertion failed: The view output (col1,col1) contains duplicate column name. at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Critical > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-23519: Component/s: SQL > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Critical > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view form the table. [ I did this from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389150#comment-16389150 ] Franck Tago commented on SPARK-23519: - Any updates on this ? > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Critical > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view form the table. [ I did this from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378046#comment-16378046 ] Franck Tago commented on SPARK-19809: - Just to confirm , your earlier comment referred to spark 2.1.1. You meant spark 2.2.1 right ? > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2018-02-26-20-29-49-410.png, > spark.sql.hive.convertMetastoreOrc.txt > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at
[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378028#comment-16378028 ] Franck Tago commented on SPARK-19809: - 1- I am kind of constrained to spark 2.2.1 at the moment . 2- My understanding is that the only thing different with spark 2.3.0 is that spark.sql.hive.convertMetastoreOrc is defaulted to true. I looked at [https://github.com/apache/spark/pull/19948] and [https://github.com/apache/spark/pull/19960] . Am I missing anything ? > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2018-02-26-20-29-49-410.png, > spark.sql.hive.convertMetastoreOrc.txt > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at >
[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378018#comment-16378018 ] Franck Tago commented on SPARK-19809: - Need a pointer on the following. Env : Spark 2.2.1 1- I set the property spark.sql.hive.convertMetastoreOrc to true 2- My hive table has the following schema CREATE TABLE `ft_orc`( `int` int, `double` double, `big+int` bigint, `$tring` string, `(decimal)` decimal(15,8), `flo@t` float, `datetime` date, `timestamp` timestamp, `01` int) CLUSTERED BY ( `int`) INTO 20 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' WITH SERDEPROPERTIES ( 'field.delim'=',', 'serialization.format'=',') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' ; I loaded the table with 1 row of data !image-2018-02-26-20-29-49-410.png! I tried to run the following simple statement scala> var res =spark.sql(" SELECT alias.`int` as a0, alias.`double` as a1, alias.`big+int` as a2, alias.`$tring` as a3, CAST(alias.`(decimal)` AS DOUBLE) as a4, CAST(alias.`flo@t` AS DOUBLE) as a5, CAST(alias.`datetime` AS TIMESTAMP) as a6, alias.`timestamp` as a7, alias.`01` as a8 FROM default.ft_orc alias" ) 18/02/27 04:30:57 WARN HiveConf: HiveConf of name hive.conf.hidden.list does not exist 18/02/27 04:30:57 WARN HiveConf: HiveConf of name hive.conf.hidden.list does not exist java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkIndex(Buffer.java:540) at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:374) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:316) at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:187) at org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$getFileReader$2.apply(OrcFileOperator.scala:68) at org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$getFileReader$2.apply(OrcFileOperator.scala:67) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.TraversableOnce$class.collectFirst(TraversableOnce.scala:145) at scala.collection.AbstractIterator.collectFirst(Iterator.scala:1336) at org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:69) at org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77) at org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) Any pointer ? Should I file a separate Jira ? > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2018-02-26-20-29-49-410.png > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at
[jira] [Updated] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-19809: Attachment: spark.sql.hive.convertMetastoreOrc.txt > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2018-02-26-20-29-49-410.png, > spark.sql.hive.convertMetastoreOrc.txt > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at
[jira] [Updated] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-19809: Attachment: image-2018-02-26-20-29-49-410.png > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2018-02-26-20-29-49-410.png > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Created] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
Franck Tago created SPARK-23519: --- Summary: Create View Commands Fails with The view output (col1,col1) contains duplicate column name Key: SPARK-23519 URL: https://issues.apache.org/jira/browse/SPARK-23519 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Reporter: Franck Tago 1- create and populate a hive table . I did this in a hive cli session .[ not that this matters ] create table atable (col1 int) ; insert into atable values (10 ) , (100) ; 2. create a view form the table. [ I did this from a spark shell ] spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 from atable ") java.lang.AssertionError: assertion failed: The view output (col1,col1) contains duplicate column name. at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20235) Hive on S3 s3:sse and non S3:sse buckets
[ https://issues.apache.org/jira/browse/SPARK-20235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013290#comment-16013290 ] Franck Tago commented on SPARK-20235: - was this comment meant for me? what does that mean ? > Hive on S3 s3:sse and non S3:sse buckets > - > > Key: SPARK-20235 > URL: https://issues.apache.org/jira/browse/SPARK-20235 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Franck Tago >Priority: Minor > > my spark application writes into 2 hive tables . > both tables are external with data residing on S3 > I want to encrypt the data when writing into hive table1 , but I do not want > to encrypt the data when writing into hive table 2. > given that the parameter fs.s3a.server-side-encryption-algorithm is set > globally , I do not see how these use cases are supported in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20235) Hive on S3 s3:sse and non S3:sse buckets
Franck Tago created SPARK-20235: --- Summary: Hive on S3 s3:sse and non S3:sse buckets Key: SPARK-20235 URL: https://issues.apache.org/jira/browse/SPARK-20235 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Franck Tago Priority: Minor my spark application writes into 2 hive tables . both tables are external with data residing on S3 I want to encrypt the data when writing into hive table1 , but I do not want to encrypt the data when writing into hive table 2. given that the parameter fs.s3a.server-side-encryption-algorithm is set globally , I do not see how these use cases are supported in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application
[ https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15956346#comment-15956346 ] Franck Tago commented on SPARK-20153: - ok thanks for the tips. It appears that EMR 5.4.0 also supports the use of the s3a within a spark application. http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html This was a painful restriction prior to this resolution. > Support Multiple aws credentials in order to access multiple Hive on S3 table > in spark application > --- > > Key: SPARK-20153 > URL: https://issues.apache.org/jira/browse/SPARK-20153 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1, 2.1.0 >Reporter: Franck Tago >Priority: Minor > > I need to access multiple hive tables in my spark application where each hive > table is > 1- an external table with data sitting on S3 > 2- each table is own by a different AWS user so I need to provide different > AWS credentials. > I am familiar with setting the aws credentials in the hadoop configuration > object but that does not really help me because I can only set one pair of > (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey ) > From my research , there is no easy or elegant way to do this in spark . > Why is that ? > How do I address this use case? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application
[ https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955695#comment-15955695 ] Franck Tago commented on SPARK-20153: - oh I would definitely not consider including the key into the URL . It is a gigantic security hole in my opinion. Moreover consider that I am dealing with Hive on S3 where the uri is part of the table metadata. How would that work in this case. Is there a way to encode the accesId and secret key before calling Does spark provide anyway of masking or hiding the accessId and secretKey? > Support Multiple aws credentials in order to access multiple Hive on S3 table > in spark application > --- > > Key: SPARK-20153 > URL: https://issues.apache.org/jira/browse/SPARK-20153 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1, 2.1.0 >Reporter: Franck Tago >Priority: Minor > > I need to access multiple hive tables in my spark application where each hive > table is > 1- an external table with data sitting on S3 > 2- each table is own by a different AWS user so I need to provide different > AWS credentials. > I am familiar with setting the aws credentials in the hadoop configuration > object but that does not really help me because I can only set one pair of > (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey ) > From my research , there is no easy or elegant way to do this in spark . > Why is that ? > How do I address this use case? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application
Franck Tago created SPARK-20153: --- Summary: Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application Key: SPARK-20153 URL: https://issues.apache.org/jira/browse/SPARK-20153 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0, 2.0.1 Reporter: Franck Tago I need to access multiple hive tables in my spark application where each hive table is 1- an external table with data sitting on S3 2- each table is own by a different AWS user so I need to provide different AWS credentials. I am familiar with setting the aws credentials in the hadoop configuration object but that does not really help me because I can only set one pair of (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey ) >From my research , there is no easy or elegant way to do this in spark . Why is that ? How do I address this use case? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627406#comment-15627406 ] Franck Tago commented on SPARK-17982: - Wanted to mention that I was able to successfully verify my cases with the changes made under this request. > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago >Priority: Blocker > > The following statement fails in the spark shell . > {noformat} > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > {noformat} > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15620750#comment-15620750 ] Franck Tago commented on SPARK-15616: - SO was not able to use the changes for the following reasons . 1-I forgot to mention that I am working off the spark 2.0.1 branch. 2- I get the following error [info] Compiling 30 Scala sources and 2 Java sources to /export/home/devbld/spark_world/Mercury/pvt/ftago/spark-2.0.1/sql/hive/target/scala-2.11/classes... [error] /export/home/devbld/spark_world/Mercury/pvt/ftago/spark-2.0.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala:295: type mismatch; [error] found : Seq[org.apache.spark.sql.catalyst.expressions.Expression] [error] required: Option[String] [error] MetastoreRelation(databaseName, tableName, partitionPruningPred)(catalogTable, client, sparkSession) [error]^ [error] one error found [error] Compile failed Can you please build a version of this fix off spark 2.0.1? I tried incorporating your changes but as pointed to the error message shown above , I was not able to . > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17612) Support `DESCRIBE table PARTITION` SQL syntax
[ https://issues.apache.org/jira/browse/SPARK-17612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616542#comment-15616542 ] Franck Tago commented on SPARK-17612: - Hi Thanks for the quick reply. So I was only trying a long shot , because I wanted to know if the join issue that i described was related somehow to this issue . I know it is not probable but just wanted to check . > Support `DESCRIBE table PARTITION` SQL syntax > - > > Key: SPARK-17612 > URL: https://issues.apache.org/jira/browse/SPARK-17612 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.2, 2.1.0 > > > This issue implements `DESC PARTITION` SQL Syntax again. It was dropped since > Spark 2.0.0. > h4. Spark 2.0.0 > {code} > scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY > (c STRING, d STRING)") > res0: org.apache.spark.sql.DataFrame = [] > scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)") > res1: org.apache.spark.sql.DataFrame = [] > scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false) > org.apache.spark.sql.catalyst.parser.ParseException: > Unsupported SQL statement > == SQL == > DESC partitioned_table PARTITION (c='Us', d=1) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573) > ... 48 elided > {code} > h4. Spark 1.6.2 > {code} > scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY > (c STRING, d STRING)") > res1: org.apache.spark.sql.DataFrame = [result: string] > scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)") > res2: org.apache.spark.sql.DataFrame = [result: string] > scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false) > 16/09/20 12:48:36 WARN LazyStruct: Extra bytes detected at the end of the > row! Ignoring similar problems. > ++ > |result | > ++ > |a string| > |b int | > |c string| > |d string| > || > |# Partition Information > | > |# col_name data_type comment | > || > |c string| > |d string| > ++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616385#comment-15616385 ] Franck Tago commented on SPARK-15616: - So I tried the changed that were made for this issue. SPARK-15616 but I hit a runtime issue when trying to test it. scala> spark.conf.set("spark.sql.statistics.fallBackToHdfs" , "true") scala> spark.conf.set("spark.sql.statistics.partitionPruner" ,"true") scala> val df1= spark.table("ft_p") df1: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field] scala> val df2=spark.table("ft_p_no") df2: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field] scala> df2.join((df1.select($"col1".as("acol1"), $"col3".as("acol3")).filter($"acol3"===5)) , $"col1"===$"acol1" ).explain == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: col3#43 > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17612) Support `DESCRIBE table PARTITION` SQL syntax
[ https://issues.apache.org/jira/browse/SPARK-17612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616376#comment-15616376 ] Franck Tago commented on SPARK-17612: - Hi Basically I have an issue where I am performing the following operations. Partitioned Large Hive Table (hive table 1) -- filter ---join / Non Partitioned Large Hive Table Basically I am join 2 large tables . Both table raw size exceed the broadcast join threshold. The filter filter a specific partition . This partition is small enough so that its size is smaller than the broadcast join threshold. With Spark 2.0 and Spark 2.0.1 , I do not see a broadcast join . I see a sort merge join. Which is really surprising to me given that this could be a really common case. You can imagine a user who has a large log table partitioned by date and he filters on a specific date. We should be able to do a broadcast join in that case. The question now is the following . I do not think this Spark Issue addresses the cited problem but I could be wrong . I tried incorporating the change in the spark 2.0 PR but I see the same behavior . That is no broadcast join. Question : Is this spark issue supposed to address the problem that I mentioned ? - If not , which i think is the case , do you know if spark currently has a fix for the cited issue. I also tried the fix under SPARK-15616 but I hit a runtime failure . There has got to be a solution to this problem somewhere. > Support `DESCRIBE table PARTITION` SQL syntax > - > > Key: SPARK-17612 > URL: https://issues.apache.org/jira/browse/SPARK-17612 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.2, 2.1.0 > > > This issue implements `DESC PARTITION` SQL Syntax again. It was dropped since > Spark 2.0.0. > h4. Spark 2.0.0 > {code} > scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY > (c STRING, d STRING)") > res0: org.apache.spark.sql.DataFrame = [] > scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)") > res1: org.apache.spark.sql.DataFrame = [] > scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false) > org.apache.spark.sql.catalyst.parser.ParseException: > Unsupported SQL statement > == SQL == > DESC partitioned_table PARTITION (c='Us', d=1) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573) > ... 48 elided > {code} > h4. Spark 1.6.2 > {code} > scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY > (c STRING, d STRING)") > res1: org.apache.spark.sql.DataFrame = [result: string] > scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)") > res2: org.apache.spark.sql.DataFrame = [result: string] > scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false) > 16/09/20 12:48:36 WARN LazyStruct: Extra bytes detected at the end of the > row! Ignoring similar problems. > ++ > |result | > ++ > |a string| > |b int | > |c string| > |d string| > || > |# Partition Information > | > |# col_name data_type comment | > || > |c string| > |d string| > ++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Comment Edited] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612937#comment-15612937 ] Franck Tago edited comment on SPARK-15616 at 10/27/16 7:35 PM: --- Hi In my case the filter is on a partition key , so i would need your changes in order to see a broadcast join ? was (Author: tafra...@gmail.com): Hi In my case the filter is on a partition key , so i would need these changes in order to see a broadcast join ? > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612937#comment-15612937 ] Franck Tago commented on SPARK-15616: - Hi In my case the filter is on a partition key , so i would need these changes in order to see a broadcast join ? > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610024#comment-15610024 ] Franck Tago commented on SPARK-15616: - Does spark currently support pruning statistics for hive table partitions? The question is consider an rdd that consists of hive table 1 --- filter -- join / large hive table2 --- where the hive table1 and hive table 2 exceed the spark.sql.autoBroadcastJoinThreshold. However , the filter only removes data from a single partition that is small enough to fit spark.sql.autoBroadcastJoinThreshold. The question is , will spark perform a broadcast join in this case ? Using spark 2.0.0 and spark 2.0.1 my observations are that spark will not. Any ideas? > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607872#comment-15607872 ] Franck Tago edited comment on SPARK-15616 at 10/26/16 11:20 PM: Hi Wondering if there are any updates on this issue was (Author: tafra...@gmail.com): Hi Wondering if there are any updates on this issue > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607872#comment-15607872 ] Franck Tago commented on SPARK-15616: - Hi Wondering if there are any updates on this issue > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586918#comment-15586918 ] Franck Tago commented on SPARK-17982: - Updated the Title of the Jira . I looked at the views.scala file and I want to know if setting the flag spark.sql.nativeView.canonical to false is an acceptable workaround. I tested it and it works but the question is that an acceptable workaround. > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark.
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-17982: Summary: Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is possible there is a bug in Spark. (was: Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause) > Spark 2.0.0 CREATE VIEW statement fails :: java.lang.RuntimeException: > Failed to analyze the canonicalized SQL. It is possible there is a bug in > Spark. > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586309#comment-15586309 ] Franck Tago edited comment on SPARK-17982 at 10/18/16 6:50 PM: --- Hi , couple of things that I want to point out here. First , your workaround will not be acceptable for me because I need to control the name of the view columns . This is due to the fact that I programatically generate the scala code for the spark application . There are some invariants that are required in my framework . Second , you claim that spark does not support column names in create view statements at all ? How come it does work in some cases ? Is that a limitation that is documented somewhere? was (Author: tafra...@gmail.com): Hi , couple of things that I want to point out here. First , your workaround will not be acceptable for me because I need to control the name of the view columns . This is due to the fact that I programatically generate the scala code for the spark application . There are some invariants that are required in my framework . Second , you claim that spark does not support column names in create view statements at all ? How come it does work in some cases ? > Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit > clause > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586309#comment-15586309 ] Franck Tago commented on SPARK-17982: - Hi , couple of things that I want to point out here. First , your workaround will not be acceptable for me because I need to control the name of the view columns . This is due to the fact that I programatically generate the scala code for the spark application . There are some invariants that are required in my framework . Second , you claim that spark does not support column names in create view statements at all ? How come it does work in some cases ? > Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit > clause > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause
[ https://issues.apache.org/jira/browse/SPARK-17982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584178#comment-15584178 ] Franck Tago commented on SPARK-17982: - == SQL == SELECT `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS gen_subquery_1 ^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:189) ... 64 more > Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit > clause > > > Key: SPARK-17982 > URL: https://issues.apache.org/jira/browse/SPARK-17982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: spark 2.0.0 >Reporter: Franck Tago > > The following statement fails in the spark shell . > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > scala> spark.sql("CREATE VIEW > DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS > SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") > java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT > `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT > `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT > `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS > `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS > gen_subquery_1 > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at org.apache.spark.sql.Dataset.(Dataset.scala:186) > at org.apache.spark.sql.Dataset.(Dataset.scala:167) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) > This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17982) Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause
Franck Tago created SPARK-17982: --- Summary: Spark 2.0.0 CREATE VIEW statement fails when select statement contains limit clause Key: SPARK-17982 URL: https://issues.apache.org/jira/browse/SPARK-17982 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1, 2.0.0 Environment: spark 2.0.0 Reporter: Franck Tago The following statement fails in the spark shell . scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") scala> spark.sql("CREATE VIEW DEFAULT.sparkshell_2_VIEW__hive_quoted_with_where (WHERE_ID , WHERE_NAME ) AS SELECT `where`.id,`where`.name FROM DEFAULT.`where` limit 2") java.lang.RuntimeException: Failed to analyze the canonicalized SQL: SELECT `gen_attr_0` AS `WHERE_ID`, `gen_attr_2` AS `WHERE_NAME` FROM (SELECT `gen_attr_1` AS `gen_attr_0`, `gen_attr_3` AS `gen_attr_2` FROM SELECT `gen_attr_1`, `gen_attr_3` FROM (SELECT `id` AS `gen_attr_1`, `name` AS `gen_attr_3` FROM `default`.`where`) AS gen_subquery_0 LIMIT 2) AS gen_subquery_1 at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:192) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:122) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) at org.apache.spark.sql.Dataset.(Dataset.scala:186) at org.apache.spark.sql.Dataset.(Dataset.scala:167) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) This appears to be a limitation of the create view statement . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17859) persist should not impede with spark's ability to perform a broadcast join.
Franck Tago created SPARK-17859: --- Summary: persist should not impede with spark's ability to perform a broadcast join. Key: SPARK-17859 URL: https://issues.apache.org/jira/browse/SPARK-17859 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.0.0 Environment: spark 2.0.0 , Linux RedHat Reporter: Franck Tago I am using Spark 2.0.0 My investigation leads me to conclude that calling persist could prevent broadcast join from happening . Example Case1: No persist call var df1 =spark.range(100).select($"id".as("id1")) df1: org.apache.spark.sql.DataFrame = [id1: bigint] var df2 =spark.range(1000).select($"id".as("id2")) df2: org.apache.spark.sql.DataFrame = [id2: bigint] df1.join(df2 , $"id1" === $"id2" ).explain == Physical Plan == *BroadcastHashJoin [id1#117L], [id2#123L], Inner, BuildRight :- *Project [id#114L AS id1#117L] : +- *Range (0, 100, splits=2) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *Project [id#120L AS id2#123L] +- *Range (0, 1000, splits=2) Case 2: persist call df1.persist.join(df2 , $"id1" === $"id2" ).explain 16/10/10 15:50:21 WARN CacheManager: Asked to cache already cached data. == Physical Plan == *SortMergeJoin [id1#3L], [id2#9L], Inner :- *Sort [id1#3L ASC], false, 0 : +- Exchange hashpartitioning(id1#3L, 10) : +- InMemoryTableScan [id1#3L] :: +- InMemoryRelation [id1#3L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) :: : +- *Project [id#0L AS id1#3L] :: : +- *Range (0, 100, splits=2) +- *Sort [id2#9L ASC], false, 0 +- Exchange hashpartitioning(id2#9L, 10) +- InMemoryTableScan [id2#9L] : +- InMemoryRelation [id2#9L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) : : +- *Project [id#6L AS id2#9L] : : +- *Range (0, 1000, splits=2) Why does the persist call prevent the broadcast join . My opinion is that it should not . I was made aware that the persist call is lazy and that might have something to do with it , but I still contend that it should not . Losing broadcast joins is really costly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition
[ https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547292#comment-15547292 ] Franck Tago commented on SPARK-17758: - My first impression after taking a look at First.scala leads me to conclude that the issue depicted here should not happen for the First aggregate function call . Is that correct ? > Spark Aggregate function LAST returns null on an empty partition > -- > > Key: SPARK-17758 > URL: https://issues.apache.org/jira/browse/SPARK-17758 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Spark 2.0.0 >Reporter: Franck Tago > Labels: correctness > > My Environment > Spark 2.0.0 > I have included the physical plan of my application below. > Issue description > The result from a query that uses the LAST function are incorrect. > The output obtained for the column that corresponds to the last function is > null . > My input data contain 3 rows . > The application resulted in 2 stages > The first stage consisted of 3 tasks . > The first task/partition contains 2 rows > The second task/partition contains 1 row > The last task/partition contain 0 rows > The result from the query executed for the LAST column call is NULL which I > believe is due to the PARTIAL_LAST on the last partition . > I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty > partition should not return null . > {noformat} > == Physical Plan == > InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false > +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) > AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104] >+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], > output=[max(C3_0)#50,last(C3_1)#51]) > +- SortAggregate(key=[], > functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], > output=[max#91,last#92]) > +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 > AS C3_1#41] > +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as > bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS > DOUBLE)#27,last(C1_1)#28]) >+- Exchange SinglePartition > +- SortAggregate(key=[], > functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, > false)], output=[sum#95L,last#96]) > +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18] > +- HiveTableScan [field1#7, field#6], > MetastoreRelation default, bdm_3449_src, alias > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition
[ https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544003#comment-15544003 ] Franck Tago commented on SPARK-17758: - Thanks for the pointer Is it possible to write a UDAF which supports the combiner like functionality . I am referring to the partial_last , partial_min .. > Spark Aggregate function LAST returns null on an empty partition > -- > > Key: SPARK-17758 > URL: https://issues.apache.org/jira/browse/SPARK-17758 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Spark 2.0.0 >Reporter: Franck Tago > > My Environment > Spark 2.0.0 > I have included the physical plan of my application below. > Issue description > The result from a query that uses the LAST function are incorrect. > The output obtained for the column that corresponds to the last function is > null . > My input data contain 3 rows . > The application resulted in 2 stages > The first stage consisted of 3 tasks . > The first task/partition contains 2 rows > The second task/partition contains 1 row > The last task/partition contain 0 rows > The result from the query executed for the LAST column call is NULL which I > believe is due to the PARTIAL_LAST on the last partition . > I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty > partition should not return null . > {noformat} > == Physical Plan == > InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false > +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) > AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104] >+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], > output=[max(C3_0)#50,last(C3_1)#51]) > +- SortAggregate(key=[], > functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], > output=[max#91,last#92]) > +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 > AS C3_1#41] > +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as > bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS > DOUBLE)#27,last(C1_1)#28]) >+- Exchange SinglePartition > +- SortAggregate(key=[], > functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, > false)], output=[sum#95L,last#96]) > +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18] > +- HiveTableScan [field1#7, field#6], > MetastoreRelation default, bdm_3449_src, alias > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition
[ https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543926#comment-15543926 ] Franck Tago commented on SPARK-17758: - is there any workaround that you could think of? One simple solution cold be to call repartition on the data frame in order to remove the empty partition but performance wise that is just terrible . I am not sure that i comprehend why the concept of an empty partition is even allowed in the spark ecosystem . Any documentation on why empty partitions are allowed would be greatly appreciated. > Spark Aggregate function LAST returns null on an empty partition > -- > > Key: SPARK-17758 > URL: https://issues.apache.org/jira/browse/SPARK-17758 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Spark 2.0.0 >Reporter: Franck Tago > > My Environment > Spark 2.0.0 > I have included the physical plan of my application below. > Issue description > The result from a query that uses the LAST function are incorrect. > The output obtained for the column that corresponds to the last function is > null . > My input data contain 3 rows . > The application resulted in 2 stages > The first stage consisted of 3 tasks . > The first task/partition contains 2 rows > The second task/partition contains 1 row > The last task/partition contain 0 rows > The result from the query executed for the LAST column call is NULL which I > believe is due to the PARTIAL_LAST on the last partition . > I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty > partition should not return null . > {noformat} > == Physical Plan == > InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false > +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) > AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104] >+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], > output=[max(C3_0)#50,last(C3_1)#51]) > +- SortAggregate(key=[], > functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], > output=[max#91,last#92]) > +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 > AS C3_1#41] > +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as > bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS > DOUBLE)#27,last(C1_1)#28]) >+- Exchange SinglePartition > +- SortAggregate(key=[], > functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, > false)], output=[sum#95L,last#96]) > +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18] > +- HiveTableScan [field1#7, field#6], > MetastoreRelation default, bdm_3449_src, alias > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition
[ https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543488#comment-15543488 ] Franck Tago edited comment on SPARK-17758 at 10/3/16 11:04 PM: --- [~hvanhovell] It should return a non null value in this case because the null value is a false representation of my data . My initial input source had 3 rows . a,1 b,2 c,10 >From observations show partition 1 contains 2 rows a,1 b,2 partition 2 contains 1 row c,10 partition 3 contains 0 rows why should last return null in this case . I understand that last would not be deterministic but why should should it return null in this case ? was (Author: tafra...@gmail.com): [~hvanhovell] It should return non null in this case because the null value is a false representation of my data . My initial input source had 3 rows . a,1 b,2 c,10 >From observations show partition 1 contains 2 rows a,1 b,2 partition 2 contains 1 row c,10 partition 3 contains 0 rows why should last return null in this case . I understand that last would not be deterministic but why should should it return null in this case ? > Spark Aggregate function LAST returns null on an empty partition > -- > > Key: SPARK-17758 > URL: https://issues.apache.org/jira/browse/SPARK-17758 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Spark 2.0.0 >Reporter: Franck Tago > > My Environment > Spark 2.0.0 > I have included the physical plan of my application below. > Issue description > The result from a query that uses the LAST function are incorrect. > The output obtained for the column that corresponds to the last function is > null . > My input data contain 3 rows . > The application resulted in 2 stages > The first stage consisted of 3 tasks . > The first task/partition contains 2 rows > The second task/partition contains 1 row > The last task/partition contain 0 rows > The result from the query executed for the LAST column call is NULL which I > believe is due to the PARTIAL_LAST on the last partition . > I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty > partition should not return null . > {noformat} > == Physical Plan == > InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false > +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) > AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104] >+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], > output=[max(C3_0)#50,last(C3_1)#51]) > +- SortAggregate(key=[], > functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], > output=[max#91,last#92]) > +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 > AS C3_1#41] > +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as > bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS > DOUBLE)#27,last(C1_1)#28]) >+- Exchange SinglePartition > +- SortAggregate(key=[], > functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, > false)], output=[sum#95L,last#96]) > +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18] > +- HiveTableScan [field1#7, field#6], > MetastoreRelation default, bdm_3449_src, alias > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition
[ https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543488#comment-15543488 ] Franck Tago edited comment on SPARK-17758 at 10/3/16 11:04 PM: --- [~hvanhovell] It should return non null in this case because the null value is a false representation of my data . My initial input source had 3 rows . a,1 b,2 c,10 >From observations show partition 1 contains 2 rows a,1 b,2 partition 2 contains 1 row c,10 partition 3 contains 0 rows why should last return null in this case . I understand that last would not be deterministic but why should should it return null in this case ? was (Author: tafra...@gmail.com): [~hvanhovell] It should return on null in this case because the null value is a false representation of my data . My initial input source had 3 rows . a,1 b,2 c,10 >From observations show partition 1 contains 2 rows a,1 b,2 partition 2 contains 1 row c,10 partition 3 contains 0 rows why should last return null in this case . I understand that last would not be deterministic but why should should it return null in this case ? > Spark Aggregate function LAST returns null on an empty partition > -- > > Key: SPARK-17758 > URL: https://issues.apache.org/jira/browse/SPARK-17758 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Spark 2.0.0 >Reporter: Franck Tago > > My Environment > Spark 2.0.0 > I have included the physical plan of my application below. > Issue description > The result from a query that uses the LAST function are incorrect. > The output obtained for the column that corresponds to the last function is > null . > My input data contain 3 rows . > The application resulted in 2 stages > The first stage consisted of 3 tasks . > The first task/partition contains 2 rows > The second task/partition contains 1 row > The last task/partition contain 0 rows > The result from the query executed for the LAST column call is NULL which I > believe is due to the PARTIAL_LAST on the last partition . > I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty > partition should not return null . > {noformat} > == Physical Plan == > InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false > +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) > AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104] >+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], > output=[max(C3_0)#50,last(C3_1)#51]) > +- SortAggregate(key=[], > functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], > output=[max#91,last#92]) > +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 > AS C3_1#41] > +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as > bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS > DOUBLE)#27,last(C1_1)#28]) >+- Exchange SinglePartition > +- SortAggregate(key=[], > functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, > false)], output=[sum#95L,last#96]) > +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18] > +- HiveTableScan [field1#7, field#6], > MetastoreRelation default, bdm_3449_src, alias > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition
[ https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543488#comment-15543488 ] Franck Tago commented on SPARK-17758: - [~hvanhovell] It should return on null in this case because the null value is a false representation of my data . My initial input source had 3 rows . a,1 b,2 c,10 >From observations show partition 1 contains 2 rows a,1 b,2 partition 2 contains 1 row c,10 partition 3 contains 0 rows why should last return null in this case . I understand that last would not be deterministic but why should should it return null in this case ? > Spark Aggregate function LAST returns null on an empty partition > -- > > Key: SPARK-17758 > URL: https://issues.apache.org/jira/browse/SPARK-17758 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Spark 2.0.0 >Reporter: Franck Tago > > My Environment > Spark 2.0.0 > I have included the physical plan of my application below. > Issue description > The result from a query that uses the LAST function are incorrect. > The output obtained for the column that corresponds to the last function is > null . > My input data contain 3 rows . > The application resulted in 2 stages > The first stage consisted of 3 tasks . > The first task/partition contains 2 rows > The second task/partition contains 1 row > The last task/partition contain 0 rows > The result from the query executed for the LAST column call is NULL which I > believe is due to the PARTIAL_LAST on the last partition . > I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty > partition should not return null . > {noformat} > == Physical Plan == > InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false > +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) > AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104] >+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], > output=[max(C3_0)#50,last(C3_1)#51]) > +- SortAggregate(key=[], > functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], > output=[max#91,last#92]) > +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 > AS C3_1#41] > +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as > bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS > DOUBLE)#27,last(C1_1)#28]) >+- Exchange SinglePartition > +- SortAggregate(key=[], > functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, > false)], output=[sum#95L,last#96]) > +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18] > +- HiveTableScan [field1#7, field#6], > MetastoreRelation default, bdm_3449_src, alias > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition
[ https://issues.apache.org/jira/browse/SPARK-17758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543097#comment-15543097 ] Franck Tago commented on SPARK-17758: - I tested the behavior of the min and max function with sort aggregate and an empty partition and the results were correct. I would also note that this issue is not reproducible in spark version 1.6 > Spark Aggregate function LAST returns null on an empty partition > -- > > Key: SPARK-17758 > URL: https://issues.apache.org/jira/browse/SPARK-17758 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Spark 2.0.0 >Reporter: Franck Tago > > My Environment > Spark 2.0.0 > I have included the physical plan of my application below. > Issue description > The result from a query that uses the LAST function are incorrect. > The output obtained for the column that corresponds to the last function is > null . > My input data contain 3 rows . > The application resulted in 2 stages > The first stage consisted of 3 tasks . > The first task/partition contains 2 rows > The second task/partition contains 1 row > The last task/partition contain 0 rows > The result from the query executed for the LAST column call is NULL which I > believe is due to the PARTIAL_LAST on the last partition . > I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty > partition should not return null . > == Physical Plan == > InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false > +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) > AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104] >+- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], > output=[max(C3_0)#50,last(C3_1)#51]) > +- SortAggregate(key=[], > functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], > output=[max#91,last#92]) > +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 > AS C3_1#41] > +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as > bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS > DOUBLE)#27,last(C1_1)#28]) >+- Exchange SinglePartition > +- SortAggregate(key=[], > functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, > false)], output=[sum#95L,last#96]) > +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18] > +- HiveTableScan [field1#7, field#6], > MetastoreRelation default, bdm_3449_src, alias -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17758) Spark Aggregate function LAST returns null on an empty partition
Franck Tago created SPARK-17758: --- Summary: Spark Aggregate function LAST returns null on an empty partition Key: SPARK-17758 URL: https://issues.apache.org/jira/browse/SPARK-17758 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.0.0 Environment: Spark 2.0.0 Reporter: Franck Tago My Environment Spark 2.0.0 I have included the physical plan of my application below. Issue description The result from a query that uses the LAST function are incorrect. The output obtained for the column that corresponds to the last function is null . My input data contain 3 rows . The application resulted in 2 stages The first stage consisted of 3 tasks . The first task/partition contains 2 rows The second task/partition contains 1 row The last task/partition contain 0 rows The result from the query executed for the LAST column call is NULL which I believe is due to the PARTIAL_LAST on the last partition . I believe that this behavior is incorrect. The PARTIAL_LAST call on an empty partition should not return null . == Physical Plan == InsertIntoHiveTable MetastoreRelation default, bdm_3449_tgt20, true, false +- *Project [last(C3_1)#51 AS field#102, cast(round(max(C3_0)#50, 0) as int) AS field1#103, cast(round(max(C3_0)#50, 0) as int) AS field2#104] +- SortAggregate(key=[], functions=[max(C3_0#40),last(C3_1#41, false)], output=[max(C3_0)#50,last(C3_1)#51]) +- SortAggregate(key=[], functions=[partial_max(C3_0#40),partial_last(C3_1#41, false)], output=[max#91,last#92]) +- *Project [CAST(sum(C1_0) AS DOUBLE)#27 AS C3_0#40, last(C1_1)#28 AS C3_1#41] +- SortAggregate(key=[], functions=[sum(cast(C1_0#17 as bigint)),last(C1_1#18, false)], output=[CAST(sum(C1_0) AS DOUBLE)#27,last(C1_1)#28]) +- Exchange SinglePartition +- SortAggregate(key=[], functions=[partial_sum(cast(C1_0#17 as bigint)),partial_last(C1_1#18, false)], output=[sum#95L,last#96]) +- *Project [field1#7 AS C1_0#17, field#6 AS C1_1#18] +- HiveTableScan [field1#7, field#6], MetastoreRelation default, bdm_3449_src, alias -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org