[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448854#comment-16448854
 ] 

Apache Spark commented on SPARK-24013:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21133

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448365#comment-16448365
 ] 

Marco Gaido commented on SPARK-24013:
-

[~juliuszsompolski] I have been able to reproduce with 1000. Probably 
SPARK-17439 is related. The problem is that the compress method is called too 
many times in this condition. The fix is easy, I'll submit a patch soon, but I 
am not so familiar with this algorithm and the real root cause of the problem, 
so I have to study it a bit in order to check if there are other problems 
causing the performance issue.

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Juliusz Sompolski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448093#comment-16448093
 ] 

Juliusz Sompolski commented on SPARK-24013:
---

Hi [~mgaido]
I tested again on current master (afbdf427302aba858f95205ecef7667f412b2a6a) and 
I reproduce it:
 !screenshot-1.png! 

Maybe you need to bump up 100 to something higher when running on a bigger 
cluster that splits the range into more tasks?
For me it grinds to a halt after about 250 per task.

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448058#comment-16448058
 ] 

Marco Gaido commented on SPARK-24013:
-

I cannot reproduce on current master. For me it was very fast the second query 
too.

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-18 Thread Juliusz Sompolski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442477#comment-16442477
 ] 

Juliusz Sompolski commented on SPARK-24013:
---

This hits when trying to create histogram statistics (SPARK-21975) on columns 
like monotonically increasing id - histograms cannot be created in reasonable 
time.

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org