[jira] [Updated] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2018-01-31 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21657:

Component/s: (was: Spark Core)

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>Assignee: Ohad Raviv
>Priority: Major
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Fix For: 2.3.0
>
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-10-21 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-21657:
--
Affects Version/s: 2.3.0
   Issue Type: Bug  (was: Improvement)

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-13 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-21657:
--
Description: 
It can take up to half a day to explode a modest-sized nested collection (0.5m).
On a recent Xeon processors.

See attached pyspark script that reproduces this problem.

{code}
cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
table_name).cache()
print sqlc.count()
{code}

This script generate a number of tables, with the same total number of records 
across all nested collection (see `scaling` variable in loops). `scaling` 
variable scales up how many nested elements in each record, but by the same 
factor scales down number of records in the table. So total number of records 
stays the same.

Time grows exponentially (notice log-10 vertical axis scale):
!ExponentialTimeGrowth.PNG!

At scaling of 50,000 (see attached pyspark script), it took 7 hours to explode 
the nested collections (\!) of 8k records.

After 1000 elements in nested collection, time grows exponentially.


  was:
It can take up to half a day to explode a modest-sizes nested collection (0.5m).
On a recent Xeon processors.

See attached pyspark script that reproduces this problem.

{code}
cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
table_name).cache()
print sqlc.count()
{code}

This script generate a number of tables, with the same total number of records 
across all nested collection (see `scaling` variable in loops). `scaling` 
variable scales up how many nested elements in each record, but by the same 
factor scales down number of records in the table. So total number of records 
stays the same.

Time grows exponentially (notice log-10 vertical axis scale):
!ExponentialTimeGrowth.PNG!

At scaling 50,000 it took 7 hours to explode the nested collections (\!) of 8k 
records.

After 1000 elements in nested collection, time grows exponentially.



> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-07 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-21657:
--
Labels: cache caching collections nested_types performance pyspark sparksql 
sql  (was: )

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sizes nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling 50,000 it took 7 hours to explode the nested collections (\!) of 
> 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21657:
--
  Priority: Major  (was: Critical)
Issue Type: Improvement  (was: Bug)

(Not a bug)
I doubt this is meant to be efficient at the scale you're using it. Is this a 
real use case?
What change are you proposing?

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sizes nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling 50,000 it took 7 hours to explode the nested collections (\!) of 
> 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-07 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-21657:
--
Description: 
It can take up to half a day to explode a modest-sizes nested collection (0.5m).
On a recent Xeon processors.

See attached pyspark script that reproduces this problem.

{code}
cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
table_name).cache()
print sqlc.count()
{code}

This script generate a number of tables, with the same total number of records 
across all nested collection (see `scaling` variable in loops). `scaling` 
variable scales up how many nested elements in each record, but by the same 
factor scales down number of records in the table. So total number of records 
stays the same.

Time grows exponentially (notice log-10 vertical axis scale):
!ExponentialTimeGrowth.PNG!

At scaling 50,000 it took 7 hours to explode the nested collections (\!) of 8k 
records.

After 1000 elements in nested collection, time grows exponentially.


  was:
It can take up to half a day to explode a modest-sizes nested collection (0.5m).
On a recent Xeon processors.

See attached pyspark script that reproduces this problem.

{code}
cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
table_name).cache()
print sqlc.count()
{code}

This script generate a number of tables, with the same total number of records 
across all nested collection (see `scaling` variable in loops). `scaling` 
variable scales up how many nested elements in each record, but by the same 
factor scales down number of records in the table. So total number of records 
stays the same.

Time grows exponentially (notice log-10 vertical axis scale):
!ExponentialTimeGrowth.PNG!

At scaling 50,000 it took 7 hours to explode the nested collections (\!) of 8k 
records.



> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Critical
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sizes nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling 50,000 it took 7 hours to explode the nested collections (\!) of 
> 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-07 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-21657:
--
Description: 
It can take up to half a day to explode a modest-sizes nested collection (0.5m).
On a recent Xeon processors.

See attached pyspark script that reproduces this problem.

{code}
cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
table_name).cache()
print sqlc.count()
{code}

This script generate a number of tables, with the same total number of records 
across all nested collection (see `scaling` variable in loops). `scaling` 
variable scales up how many nested elements in each record, but by the same 
factor scales down number of records in the table. So total number of records 
stays the same.

Time grows exponentially (notice log-10 vertical axis scale):
!ExponentialTimeGrowth.PNG!

At scaling 50,000 it took 7 hours to explode the nested collections (\!) of 8k 
records.


  was:
It can take up to half a day to explode a modest-sizes nested collection (0.5m).
On a recent Xeon processors.

See attached pyspark script that reproduces this problem.

{code}
cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
table_name).cache()
print sqlc.count()
{code}

This script generate a number of tables, with the same total number of records 
across all nested collection (see `scaling` variable in loops). `scaling` 
variable scales up how many nested elements in each record, but by the same 
factor scales down number of records in the table. So total number of records 
stays the same.

Time grows exponentially (notice log-10 vertical axis scale).



> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Critical
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sizes nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling 50,000 it took 7 hours to explode the nested collections (\!) of 
> 8k records.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-07 Thread Ruslan Dautkhanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-21657:
--
Attachment: ExponentialTimeGrowth.PNG
nested-data-generator-and-test.py

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Critical
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sizes nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org