[jira] [Comment Edited] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-10-27 Thread Ohad Raviv (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16222312#comment-16222312
 ] 

Ohad Raviv edited comment on SPARK-21657 at 10/27/17 12:53 PM:
---

Hi,
Just ran a profiler for this code:
{code:java}
val BASE = 1
val N = 10
val df = sc.parallelize(List(("1234567890", (BASE to (BASE+N)).map(x => 
(x.toString, (x+1).toString, (x+2).toString, (x+3).toString)).toList 
))).toDF("c1", "c_arr")
val df_exploded = df.select(expr("c1"), explode($"c_arr").as("c2"))
df_exploded.write.mode("overwrite").format("json").save("/tmp/blah_explode")
{code}

and it looks like [~srowen] is right, most of the time is spent in 
scala.collection.immutable.List.apply()   (72.1%). inside:
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()
  (100%)

I logged the generated code and found the problematic code:
{code:java}
 if (serializefromobject_funcResult1 != null) {
 serializefromobject_value5 = (scala.collection.immutable.List) 
serializefromobject_funcResult1;
   } else {
 serializefromobject_isNull5 = true;
 }
.
.
.
 while (serializefromobject_loopIndex < serializefromobject_dataLength) {
   MapObjects_loopValue0 = (scala.Tuple4) 
(serializefromobject_value5.apply(serializefromobject_loopIndex));
{code}

so that causes the quadratic time complexity.
However, I'm not sure where is the code that generates this list instead of 
array for the exploded array.


was (Author: uzadude):
Hi,
Just ran a profiler for this code:
{code:scala}
val BASE = 1
val N = 10
val df = sc.parallelize(List(("1234567890", (BASE to (BASE+N)).map(x => 
(x.toString, (x+1).toString, (x+2).toString, (x+3).toString)).toList 
))).toDF("c1", "c_arr")
val df_exploded = df.select(expr("c1"), explode($"c_arr").as("c2"))
df_exploded.write.mode("overwrite").format("json").save("/tmp/blah_explode")
{code}

and it looks like [~srowen] is right, most of the time is spent in 
scala.collection.immutable.List.apply()   (72.1%). inside:
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()
  (100%)

I logged the generated code and found the problematic code:
{code:scala}
 if (serializefromobject_funcResult1 != null) {
 serializefromobject_value5 = (scala.collection.immutable.List) 
serializefromobject_funcResult1;
   } else {
 serializefromobject_isNull5 = true;
 }
.
.
.
 while (serializefromobject_loopIndex < serializefromobject_dataLength) {
   MapObjects_loopValue0 = (scala.Tuple4) 
(serializefromobject_value5.apply(serializefromobject_loopIndex));
{code}

so that causes the quadratic time complexity.
However, I'm not sure where is the code that generates this list instead of 
array for the exploded array.

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-15 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127511#comment-16127511
 ] 

Ruslan Dautkhanov edited comment on SPARK-21657 at 8/15/17 4:59 PM:


Thank you [~maropu] and [~viirya], that commit is for Spark 2.2 so this problem 
might be worse in 2.2, but I don't think it's a root cause.
As we see the same exponential time complexity to explode a nested array in 
Spark 2.0 and 2.1.


was (Author: tagar):
Thank you [~maropu] and [~viirya], that commit is for Spark 2.2 so this problem 
might be worse in 2.2, but I don't think it's a root cause.
As we see the same exponential time complecity to explode in Spark 2.0 and 2.1.

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-11 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123050#comment-16123050
 ] 

Liang-Chi Hsieh edited comment on SPARK-21657 at 8/11/17 8:55 AM:
--

Maybe not very related to this issue. But I'm exploring Generate related code 
to get hint about this issue. I'm curious why we still don't enable codegen of 
Generate for now. [~hvanhovell] Maybe you know why it is disabled? Thanks.


was (Author: viirya):
Maybe not very related to this issue. But I'm exploring Generate related code. 
I'm curious why we still don't enable codegen of Generate for now. 
[~hvanhovell] Maybe you know why it is disabled? Thanks.

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sizes nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling 50,000 it took 7 hours to explode the nested collections (\!) of 
> 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org