[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.
[ https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448854#comment-16448854 ] Apache Spark commented on SPARK-24013: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21133 > ApproximatePercentile grinds to a halt on sorted input. > --- > > Key: SPARK-24013 > URL: https://issues.apache.org/jira/browse/SPARK-24013 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Juliusz Sompolski >Priority: Major > Attachments: screenshot-1.png > > > Running > {code} > sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid > from range(1000))").collect() > {code} > takes 7 seconds, while > {code} > sql("select approx_percentile(id, array(0.1)) from range(1000)").collect() > {code} > grinds to a halt - processes the first million rows quickly, and then slows > down to a few thousands rows / second (4m rows processed after 20 minutes). > Thread dumps show that it spends time in QuantileSummary.compress. > Seems it hits some edge case inefficiency when dealing with sorted data? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.
[ https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448365#comment-16448365 ] Marco Gaido commented on SPARK-24013: - [~juliuszsompolski] I have been able to reproduce with 1000. Probably SPARK-17439 is related. The problem is that the compress method is called too many times in this condition. The fix is easy, I'll submit a patch soon, but I am not so familiar with this algorithm and the real root cause of the problem, so I have to study it a bit in order to check if there are other problems causing the performance issue. > ApproximatePercentile grinds to a halt on sorted input. > --- > > Key: SPARK-24013 > URL: https://issues.apache.org/jira/browse/SPARK-24013 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Juliusz Sompolski >Priority: Major > Attachments: screenshot-1.png > > > Running > {code} > sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid > from range(1000))").collect() > {code} > takes 7 seconds, while > {code} > sql("select approx_percentile(id, array(0.1)) from range(1000)").collect() > {code} > grinds to a halt - processes the first million rows quickly, and then slows > down to a few thousands rows / second (4m rows processed after 20 minutes). > Thread dumps show that it spends time in QuantileSummary.compress. > Seems it hits some edge case inefficiency when dealing with sorted data? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.
[ https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448093#comment-16448093 ] Juliusz Sompolski commented on SPARK-24013: --- Hi [~mgaido] I tested again on current master (afbdf427302aba858f95205ecef7667f412b2a6a) and I reproduce it: !screenshot-1.png! Maybe you need to bump up 100 to something higher when running on a bigger cluster that splits the range into more tasks? For me it grinds to a halt after about 250 per task. > ApproximatePercentile grinds to a halt on sorted input. > --- > > Key: SPARK-24013 > URL: https://issues.apache.org/jira/browse/SPARK-24013 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Juliusz Sompolski >Priority: Major > Attachments: screenshot-1.png > > > Running > {code} > sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid > from range(1000))").collect() > {code} > takes 7 seconds, while > {code} > sql("select approx_percentile(id, array(0.1)) from range(1000)").collect() > {code} > grinds to a halt - processes the first million rows quickly, and then slows > down to a few thousands rows / second (4m rows processed after 20 minutes). > Thread dumps show that it spends time in QuantileSummary.compress. > Seems it hits some edge case inefficiency when dealing with sorted data? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.
[ https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448058#comment-16448058 ] Marco Gaido commented on SPARK-24013: - I cannot reproduce on current master. For me it was very fast the second query too. > ApproximatePercentile grinds to a halt on sorted input. > --- > > Key: SPARK-24013 > URL: https://issues.apache.org/jira/browse/SPARK-24013 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Juliusz Sompolski >Priority: Major > > Running > {code} > sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid > from range(1000))").collect() > {code} > takes 7 seconds, while > {code} > sql("select approx_percentile(id, array(0.1)) from range(1000)").collect() > {code} > grinds to a halt - processes the first million rows quickly, and then slows > down to a few thousands rows / second (4m rows processed after 20 minutes). > Thread dumps show that it spends time in QuantileSummary.compress. > Seems it hits some edge case inefficiency when dealing with sorted data? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.
[ https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442477#comment-16442477 ] Juliusz Sompolski commented on SPARK-24013: --- This hits when trying to create histogram statistics (SPARK-21975) on columns like monotonically increasing id - histograms cannot be created in reasonable time. > ApproximatePercentile grinds to a halt on sorted input. > --- > > Key: SPARK-24013 > URL: https://issues.apache.org/jira/browse/SPARK-24013 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Juliusz Sompolski >Priority: Major > > Running > {code} > sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid > from range(1000))").collect() > {code} > takes 7 seconds, while > {code} > sql("select approx_percentile(id, array(0.1)) from range(1000)").collect() > {code} > grinds to a halt - processes the first million rows quickly, and then slows > down to a few thousands rows / second (4m rows processed after 20 minutes). > Thread dumps show that it spends time in QuantileSummary.compress. > Seems it hits some edge case inefficiency when dealing with sorted data? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org