[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge

2018-01-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324337#comment-16324337
 ] 

ASF GitHub Bot commented on DRILL-6030:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1075


> Managed sort should minimize number of batches in a k-way merge
> ---
>
> Key: DRILL-6030
> URL: https://issues.apache.org/jira/browse/DRILL-6030
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Vlad Rozov
>Assignee: Vlad Rozov
>  Labels: ready-to-commit
>
> The time complexity of the algorithm is O(n*k*log(k)) where k is a number of 
> batches to merge and n is a number of records in each batch (assuming equal 
> size batches). As n*k is the total number of record to merge and it can be 
> quite large, minimizing k should give better results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge

2017-12-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300269#comment-16300269
 ] 

ASF GitHub Bot commented on DRILL-6030:
---

Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1075
  
@paul-rogers Please review


> Managed sort should minimize number of batches in a k-way merge
> ---
>
> Key: DRILL-6030
> URL: https://issues.apache.org/jira/browse/DRILL-6030
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Vlad Rozov
>Assignee: Vlad Rozov
>
> The time complexity of the algorithm is O(n*k*log(k)) where k is a number of 
> batches to merge and n is a number of records in each batch (assuming equal 
> size batches). As n*k is the total number of record to merge and it can be 
> quite large, minimizing k should give better results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299174#comment-16299174
 ] 

ASF GitHub Bot commented on DRILL-6030:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1075#discussion_r158150320
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortConfig.java
 ---
@@ -84,7 +85,7 @@ public SortConfig(DrillConfig config) {
 if (limit > 0) {
   mergeLimit = Math.max(limit, MIN_MERGE_LIMIT);
 } else {
-  mergeLimit = Integer.MAX_VALUE;
+  mergeLimit = DEFAULT_MERGE_LIMIT;
--- End diff --

There may be a misunderstanding of how config options work. We define the 
defaults in Drill's own source code: `drill-module.conf` in each module. (Here 
it is in `java-exec`.)

To change the default option, we change the value in `drill-module.conf`. 
In the highly unlikely case that a user has overridden this value in 
`drill-override.conf`, their value will be used. But, the option is not 
documented in `drill-override-example.conf` so it is very, very unlikely that 
anyone created an override. (The property is meant to be internal, for use in 
tests.)

So, rather than introducing yet another variable, we might as well use the 
existing config property. This has the added advantage that, if experience 
suggests that we need a smaller or larger limit for some scenarios, we can make 
the adjustment in the field via the config system.


> Managed sort should minimize number of batches in a k-way merge
> ---
>
> Key: DRILL-6030
> URL: https://issues.apache.org/jira/browse/DRILL-6030
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Vlad Rozov
>Assignee: Vlad Rozov
>
> The time complexity of the algorithm is O(n*k*log(k)) where k is a number of 
> batches to merge and n is a number of records in each batch (assuming equal 
> size batches). As n*k is the total number of record to merge and it can be 
> quite large, minimizing k should give better results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge

2017-12-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297509#comment-16297509
 ] 

ASF GitHub Bot commented on DRILL-6030:
---

Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1075
  
The scenario when all batches can be merged in memory is covered by 'if 
(canUseMemoryMerge())` check in `SortImpl.java:399`. The affected code path 
applies only to cases where merge between spilled and in-memory batches is 
necessary. Note that this is a short term fix to improve managed sort 
performance, in a long run, it is necessary to have an ability to merge all 
batches in memory (using SV4) without spilling and be able to merge it with the 
spilled data.


> Managed sort should minimize number of batches in a k-way merge
> ---
>
> Key: DRILL-6030
> URL: https://issues.apache.org/jira/browse/DRILL-6030
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Vlad Rozov
>Assignee: Vlad Rozov
>
> The time complexity of the algorithm is O(n*k*log(k)) where k is a number of 
> batches to merge and n is a number of records in each batch (assuming equal 
> size batches). As n*k is the total number of record to merge and it can be 
> quite large, minimizing k should give better results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge

2017-12-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297490#comment-16297490
 ] 

ASF GitHub Bot commented on DRILL-6030:
---

Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1075#discussion_r157885846
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortConfig.java
 ---
@@ -84,7 +85,7 @@ public SortConfig(DrillConfig config) {
 if (limit > 0) {
   mergeLimit = Math.max(limit, MIN_MERGE_LIMIT);
 } else {
-  mergeLimit = Integer.MAX_VALUE;
+  mergeLimit = DEFAULT_MERGE_LIMIT;
--- End diff --

IMO, it is better to change the default to avoid upgrade problems. In an 
upgrade scenario,  users may simply overwrite `drill-override.conf` from their 
prior installations and forget to set the merge limit. Is there a reason not to 
change the default merge limit?


> Managed sort should minimize number of batches in a k-way merge
> ---
>
> Key: DRILL-6030
> URL: https://issues.apache.org/jira/browse/DRILL-6030
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Vlad Rozov
>Assignee: Vlad Rozov
>
> The time complexity of the algorithm is O(n*k*log(k)) where k is a number of 
> batches to merge and n is a number of records in each batch (assuming equal 
> size batches). As n*k is the total number of record to merge and it can be 
> quite large, minimizing k should give better results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge

2017-12-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297218#comment-16297218
 ] 

ASF GitHub Bot commented on DRILL-6030:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1075
  
One additional thought. This bug was found when sorting 18 GB of data in 8 
GB of memory. That is, a case in which the sort must spill.

What happens in the case in which the 18 GB of data is sorted in, say, 20 
GB of memory (an in-memory sort)? We don't want the merge limit to force a 
spill in this case; kind of defeats the purpose of an in-memory sort.

So:

1. Does the limit affect in memory sort? If so, we need to revise the 
solution.
2. Does the in-memory sort suffer from a similar performance issue? If so, 
we need to revise the in memory sort.

One possible solution is to:

1. Defer sorting of individual batches until necessary.
2. Sort batches just before spilling.
3. If all batches fit in memory, do a single, combined sort (using an SV4).


> Managed sort should minimize number of batches in a k-way merge
> ---
>
> Key: DRILL-6030
> URL: https://issues.apache.org/jira/browse/DRILL-6030
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Vlad Rozov
>Assignee: Vlad Rozov
>
> The time complexity of the algorithm is O(n*k*log(k)) where k is a number of 
> batches to merge and n is a number of records in each batch (assuming equal 
> size batches). As n*k is the total number of record to merge and it can be 
> quite large, minimizing k should give better results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge

2017-12-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297169#comment-16297169
 ] 

ASF GitHub Bot commented on DRILL-6030:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1075#discussion_r157830252
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortConfig.java
 ---
@@ -84,7 +85,7 @@ public SortConfig(DrillConfig config) {
 if (limit > 0) {
   mergeLimit = Math.max(limit, MIN_MERGE_LIMIT);
 } else {
-  mergeLimit = Integer.MAX_VALUE;
+  mergeLimit = DEFAULT_MERGE_LIMIT;
--- End diff --

The merge limit is already a config option. (I'd forgotten about that.) The 
comment on the config option says "Limit on the number of spilled batches that 
can be merged in a single pass." So, let's just set that default (in 
`drill-override-conf`) to your new value of 128 and leave the code here 
unchanged.


> Managed sort should minimize number of batches in a k-way merge
> ---
>
> Key: DRILL-6030
> URL: https://issues.apache.org/jira/browse/DRILL-6030
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Vlad Rozov
>Assignee: Vlad Rozov
>
> The time complexity of the algorithm is O(n*k*log(k)) where k is a number of 
> batches to merge and n is a number of records in each batch (assuming equal 
> size batches). As n*k is the total number of record to merge and it can be 
> quite large, minimizing k should give better results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-6030) Managed sort should minimize number of batches in a k-way merge

2017-12-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295312#comment-16295312
 ] 

ASF GitHub Bot commented on DRILL-6030:
---

GitHub user vrozov opened a pull request:

https://github.com/apache/drill/pull/1075

DRILL-6030: Managed sort should minimize number of batches in a k-way merge

@paul-rogers Please review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vrozov/drill DRILL-6030

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1075.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1075


commit f40d962cf35574c0406937de1ee86ef853234b3e
Author: Vlad Rozov 
Date:   2017-12-17T17:25:55Z

DRILL-6030: Managed sort should minimize number of batches in a k-way merge




> Managed sort should minimize number of batches in a k-way merge
> ---
>
> Key: DRILL-6030
> URL: https://issues.apache.org/jira/browse/DRILL-6030
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Vlad Rozov
>Assignee: Vlad Rozov
>
> The time complexity of the algorithm is O(n*k*log(k)) where k is a number of 
> batches to merge and n is a number of records in each batch (assuming equal 
> size batches). As n*k is the total number of record to merge and it can be 
> quite large, minimizing k should give better results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)