[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge
[ https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324337#comment-16324337 ] ASF GitHub Bot commented on DRILL-6030: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/1075 > Managed sort should minimize number of batches in a k-way merge > --- > > Key: DRILL-6030 > URL: https://issues.apache.org/jira/browse/DRILL-6030 > Project: Apache Drill > Issue Type: Improvement >Reporter: Vlad Rozov >Assignee: Vlad Rozov > Labels: ready-to-commit > > The time complexity of the algorithm is O(n*k*log(k)) where k is a number of > batches to merge and n is a number of records in each batch (assuming equal > size batches). As n*k is the total number of record to merge and it can be > quite large, minimizing k should give better results. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge
[ https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300269#comment-16300269 ] ASF GitHub Bot commented on DRILL-6030: --- Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1075 @paul-rogers Please review > Managed sort should minimize number of batches in a k-way merge > --- > > Key: DRILL-6030 > URL: https://issues.apache.org/jira/browse/DRILL-6030 > Project: Apache Drill > Issue Type: Improvement >Reporter: Vlad Rozov >Assignee: Vlad Rozov > > The time complexity of the algorithm is O(n*k*log(k)) where k is a number of > batches to merge and n is a number of records in each batch (assuming equal > size batches). As n*k is the total number of record to merge and it can be > quite large, minimizing k should give better results. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge
[ https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299174#comment-16299174 ] ASF GitHub Bot commented on DRILL-6030: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1075#discussion_r158150320 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortConfig.java --- @@ -84,7 +85,7 @@ public SortConfig(DrillConfig config) { if (limit > 0) { mergeLimit = Math.max(limit, MIN_MERGE_LIMIT); } else { - mergeLimit = Integer.MAX_VALUE; + mergeLimit = DEFAULT_MERGE_LIMIT; --- End diff -- There may be a misunderstanding of how config options work. We define the defaults in Drill's own source code: `drill-module.conf` in each module. (Here it is in `java-exec`.) To change the default option, we change the value in `drill-module.conf`. In the highly unlikely case that a user has overridden this value in `drill-override.conf`, their value will be used. But, the option is not documented in `drill-override-example.conf` so it is very, very unlikely that anyone created an override. (The property is meant to be internal, for use in tests.) So, rather than introducing yet another variable, we might as well use the existing config property. This has the added advantage that, if experience suggests that we need a smaller or larger limit for some scenarios, we can make the adjustment in the field via the config system. > Managed sort should minimize number of batches in a k-way merge > --- > > Key: DRILL-6030 > URL: https://issues.apache.org/jira/browse/DRILL-6030 > Project: Apache Drill > Issue Type: Improvement >Reporter: Vlad Rozov >Assignee: Vlad Rozov > > The time complexity of the algorithm is O(n*k*log(k)) where k is a number of > batches to merge and n is a number of records in each batch (assuming equal > size batches). As n*k is the total number of record to merge and it can be > quite large, minimizing k should give better results. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge
[ https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297509#comment-16297509 ] ASF GitHub Bot commented on DRILL-6030: --- Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1075 The scenario when all batches can be merged in memory is covered by 'if (canUseMemoryMerge())` check in `SortImpl.java:399`. The affected code path applies only to cases where merge between spilled and in-memory batches is necessary. Note that this is a short term fix to improve managed sort performance, in a long run, it is necessary to have an ability to merge all batches in memory (using SV4) without spilling and be able to merge it with the spilled data. > Managed sort should minimize number of batches in a k-way merge > --- > > Key: DRILL-6030 > URL: https://issues.apache.org/jira/browse/DRILL-6030 > Project: Apache Drill > Issue Type: Improvement >Reporter: Vlad Rozov >Assignee: Vlad Rozov > > The time complexity of the algorithm is O(n*k*log(k)) where k is a number of > batches to merge and n is a number of records in each batch (assuming equal > size batches). As n*k is the total number of record to merge and it can be > quite large, minimizing k should give better results. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge
[ https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297490#comment-16297490 ] ASF GitHub Bot commented on DRILL-6030: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1075#discussion_r157885846 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortConfig.java --- @@ -84,7 +85,7 @@ public SortConfig(DrillConfig config) { if (limit > 0) { mergeLimit = Math.max(limit, MIN_MERGE_LIMIT); } else { - mergeLimit = Integer.MAX_VALUE; + mergeLimit = DEFAULT_MERGE_LIMIT; --- End diff -- IMO, it is better to change the default to avoid upgrade problems. In an upgrade scenario, users may simply overwrite `drill-override.conf` from their prior installations and forget to set the merge limit. Is there a reason not to change the default merge limit? > Managed sort should minimize number of batches in a k-way merge > --- > > Key: DRILL-6030 > URL: https://issues.apache.org/jira/browse/DRILL-6030 > Project: Apache Drill > Issue Type: Improvement >Reporter: Vlad Rozov >Assignee: Vlad Rozov > > The time complexity of the algorithm is O(n*k*log(k)) where k is a number of > batches to merge and n is a number of records in each batch (assuming equal > size batches). As n*k is the total number of record to merge and it can be > quite large, minimizing k should give better results. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge
[ https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297218#comment-16297218 ] ASF GitHub Bot commented on DRILL-6030: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1075 One additional thought. This bug was found when sorting 18 GB of data in 8 GB of memory. That is, a case in which the sort must spill. What happens in the case in which the 18 GB of data is sorted in, say, 20 GB of memory (an in-memory sort)? We don't want the merge limit to force a spill in this case; kind of defeats the purpose of an in-memory sort. So: 1. Does the limit affect in memory sort? If so, we need to revise the solution. 2. Does the in-memory sort suffer from a similar performance issue? If so, we need to revise the in memory sort. One possible solution is to: 1. Defer sorting of individual batches until necessary. 2. Sort batches just before spilling. 3. If all batches fit in memory, do a single, combined sort (using an SV4). > Managed sort should minimize number of batches in a k-way merge > --- > > Key: DRILL-6030 > URL: https://issues.apache.org/jira/browse/DRILL-6030 > Project: Apache Drill > Issue Type: Improvement >Reporter: Vlad Rozov >Assignee: Vlad Rozov > > The time complexity of the algorithm is O(n*k*log(k)) where k is a number of > batches to merge and n is a number of records in each batch (assuming equal > size batches). As n*k is the total number of record to merge and it can be > quite large, minimizing k should give better results. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge
[ https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297169#comment-16297169 ] ASF GitHub Bot commented on DRILL-6030: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1075#discussion_r157830252 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortConfig.java --- @@ -84,7 +85,7 @@ public SortConfig(DrillConfig config) { if (limit > 0) { mergeLimit = Math.max(limit, MIN_MERGE_LIMIT); } else { - mergeLimit = Integer.MAX_VALUE; + mergeLimit = DEFAULT_MERGE_LIMIT; --- End diff -- The merge limit is already a config option. (I'd forgotten about that.) The comment on the config option says "Limit on the number of spilled batches that can be merged in a single pass." So, let's just set that default (in `drill-override-conf`) to your new value of 128 and leave the code here unchanged. > Managed sort should minimize number of batches in a k-way merge > --- > > Key: DRILL-6030 > URL: https://issues.apache.org/jira/browse/DRILL-6030 > Project: Apache Drill > Issue Type: Improvement >Reporter: Vlad Rozov >Assignee: Vlad Rozov > > The time complexity of the algorithm is O(n*k*log(k)) where k is a number of > batches to merge and n is a number of records in each batch (assuming equal > size batches). As n*k is the total number of record to merge and it can be > quite large, minimizing k should give better results. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge
[ https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295312#comment-16295312 ] ASF GitHub Bot commented on DRILL-6030: --- GitHub user vrozov opened a pull request: https://github.com/apache/drill/pull/1075 DRILL-6030: Managed sort should minimize number of batches in a k-way merge @paul-rogers Please review You can merge this pull request into a Git repository by running: $ git pull https://github.com/vrozov/drill DRILL-6030 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1075.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1075 commit f40d962cf35574c0406937de1ee86ef853234b3e Author: Vlad Rozov Date: 2017-12-17T17:25:55Z DRILL-6030: Managed sort should minimize number of batches in a k-way merge > Managed sort should minimize number of batches in a k-way merge > --- > > Key: DRILL-6030 > URL: https://issues.apache.org/jira/browse/DRILL-6030 > Project: Apache Drill > Issue Type: Improvement >Reporter: Vlad Rozov >Assignee: Vlad Rozov > > The time complexity of the algorithm is O(n*k*log(k)) where k is a number of > batches to merge and n is a number of records in each batch (assuming equal > size batches). As n*k is the total number of record to merge and it can be > quite large, minimizing k should give better results. -- This message was sent by Atlassian JIRA (v6.4.14#64029)