[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2019-06-28 Thread Julian Reschke (JIRA)


[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874915#comment-16874915
 ] 

Julian Reschke commented on OAK-4780:
-

trunk: (1.7.0) [r1792051|http://svn.apache.org/r1792051] 
[r1791681|http://svn.apache.org/r1791681] 
[r1790796|http://svn.apache.org/r1790796]

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>Priority: Major
> Fix For: 1.7.0, 1.8.0
>
> Attachments: OAK-4780-core-1.patch, OAK-4780-core-2.diff, 
> OAK-4780-core-4.diff, OAK-4780-core-5.diff, OAK-4780-core.diff, 
> OAK-4780-rdb.diff, leafnodes-v2.diff, leafnodes-v3.diff, leafnodes.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-04-20 Thread Marcel Reutegger (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976627#comment-15976627
 ] 

Marcel Reutegger commented on OAK-4780:
---

Fixed a potential division by zero: http://svn.apache.org/r1792051

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Fix For: 1.7.0, 1.8
>
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff, 
> OAK-4780-core-1.patch, OAK-4780-core-2.diff, OAK-4780-core-4.diff, 
> OAK-4780-core-5.diff, OAK-4780-core.diff, OAK-4780-rdb.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-04-20 Thread Marcel Reutegger (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976459#comment-15976459
 ] 

Marcel Reutegger commented on OAK-4780:
---

Created OAK-6109 for the revisions command in oak-run proposed by 
[~stefan.eissing]. 

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Fix For: 1.7.0, 1.8
>
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff, 
> OAK-4780-core-1.patch, OAK-4780-core-2.diff, OAK-4780-core-4.diff, 
> OAK-4780-core-5.diff, OAK-4780-core.diff, OAK-4780-rdb.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-04-17 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970990#comment-15970990
 ] 

Julian Reschke commented on OAK-4780:
-

(moved the RDB extensions to OAK-6083 so we can close this one)

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff, 
> OAK-4780-core-1.patch, OAK-4780-core-2.diff, OAK-4780-core-4.diff, 
> OAK-4780-core-5.diff, OAK-4780-core.diff, OAK-4780-rdb.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-04-17 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970979#comment-15970979
 ] 

Julian Reschke commented on OAK-4780:
-

trunk: [r1791681|http://svn.apache.org/r1791681] 
[r1790796|http://svn.apache.org/r1790796]


> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff, 
> OAK-4780-core-1.patch, OAK-4780-core-2.diff, OAK-4780-core-4.diff, 
> OAK-4780-core-5.diff, OAK-4780-core.diff, OAK-4780-rdb.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-04-10 Thread Marcel Reutegger (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962693#comment-15962693
 ] 

Marcel Reutegger commented on OAK-4780:
---

[~reschke], I leave this issue open because I did not apply the RDB specific 
patch. Please apply if you think it is ready and resolve this issue.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff, 
> OAK-4780-core-1.patch, OAK-4780-core-2.diff, OAK-4780-core-4.diff, 
> OAK-4780-core-5.diff, OAK-4780-core.diff, OAK-4780-rdb.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-04-10 Thread Marcel Reutegger (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962685#comment-15962685
 ] 

Marcel Reutegger commented on OAK-4780:
---

Looks good to me. I applied a slightly modified patch to trunk: 
http://svn.apache.org/r1790796

- Committed fixes for VersionGCDeletionTest separately
- Use MongoDB count() method with read preference to avoid usage of method only 
available with 3.4 driver
- Add serialVersionUID to LimitExceededException

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff, 
> OAK-4780-core-1.patch, OAK-4780-core-2.diff, OAK-4780-core-4.diff, 
> OAK-4780-core-5.diff, OAK-4780-core.diff, OAK-4780-rdb.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-29 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946925#comment-15946925
 ] 

Julian Reschke commented on OAK-4780:
-

[~stefan.eissing] - could you please bring your fork up-to-date with trunk, 
produce a patch, and attach it over here?

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff, 
> OAJK-4780-rdb.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-20 Thread Marcel Reutegger (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15932275#comment-15932275
 ] 

Marcel Reutegger commented on OAK-4780:
---

bq. You mean the class itself or special VersionGarbageCollector tests? I 
assume the former.

The former. Tests for the new TimeInterval class also help to better understand 
its contract. E.g. are bounds inclusive?

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-17 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930041#comment-15930041
 ] 

Julian Reschke commented on OAK-4780:
-

bq. At the end of all iterations, the VersionGarbageCollector returns a 
VersionGCStats that accumulates data from all iterations made. Only the 
stopwatches are currently not, as Stopwatch lacks methods to add time from 
other watches. And re-using the same instances in each iteration will show the 
incorrect data for each iteration. 

The code sort of works:

{noformat}
package org.apache.jackrabbit.oak.plugins.document;

import java.util.concurrent.TimeUnit;

import com.google.common.base.Stopwatch;
import com.google.common.base.Ticker;

public class StopWatchTest {

public static void main(String[] args) throws InterruptedException {

final Stopwatch w1 = Stopwatch.createStarted();
Thread.sleep(123);
w1.stop();
System.out.println(w1);

final Stopwatch w2 = Stopwatch.createStarted();
Thread.sleep(456);
w2.stop();
System.out.println(w2);

Stopwatch sum = Stopwatch.createStarted(new Ticker() {

private boolean calledOnce = false;

@Override
public long read() {
if (calledOnce) {
return w1.elapsed(TimeUnit.NANOSECONDS) + 
w2.elapsed(TimeUnit.NANOSECONDS);
} else {
calledOnce = true;
return 0;
}
}
});
sum.stop();
System.out.println(sum);
}

}
{noformat}

output:

{noformat}
123,2 ms
456,5 ms
579,7 ms
{noformat}


> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-17 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930035#comment-15930035
 ] 

Julian Reschke commented on OAK-4780:
-

bq. At the end of all iterations, the VersionGarbageCollector returns a 
VersionGCStats that accumulates data from all iterations made. Only the 
stopwatches are currently not, as Stopwatch lacks methods to add time from 
other watches. And re-using the same instances in each iteration will show the 
incorrect data for each iteration. 

The code below works (for some value of "works"):

{noformat}
public static void main(String[] args) throws InterruptedException {

final Stopwatch w1 =Stopwatch.createStarted();
Thread.sleep(123);
w1.stop();
System.out.println(w1);

final Stopwatch w2 =Stopwatch.createStarted();
Thread.sleep(456);
w2.stop();
System.out.println(w2);

Stopwatch sum = Stopwatch.createStarted(new Ticker() {

private boolean calledOnce = false;
@Override
public long read() {
if (calledOnce) {
return w1.elapsed(TimeUnit.MILLISECONDS) + 
w2.elapsed(TimeUnit.MILLISECONDS);
}
else {
calledOnce = true;
return 0;
}
}
});
sum.stop();
System.out.println(sum);
}
{noformat}

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-17 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929996#comment-15929996
 ] 

Stefan Eissing commented on OAK-4780:
-

And I will make two patches, one for oak-core and one for oak-run on top of 
that. The second one will get its own ticket.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-17 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929993#comment-15929993
 ] 

Stefan Eissing commented on OAK-4780:
-

bq. VersionGarbageCollector.reset() can be simplified with just the remove() 
call. It will be a noop if the document doesn't exist.
ack.

bq. Can you please add tests for TimeInterval?
You mean the class itself or special VersionGarbageCollector tests? I assume 
the former.

bq. Did you consider moving the new set methods on VersionGarbageCollector to a 
new class (e.g. VersionGCOptions) and pass it as an argument to gc()? I think 
with the current patch it is possible to influence a running GC by calling one 
of those set methods.
That is true. Will add some such immutable options object that is part of  an 
active run.

bq. What is the TODO about in VersionGCStats.addRun()?
At the end of all iterations, the VersionGarbageCollector returns a 
VersionGCStats that accumulates data from all iterations made. Only the 
stopwatches are currently not, as Stopwatch lacks methods to add time from 
other watches. And re-using the same instances in each iteration will show the 
incorrect data for each iteration. 

bq. Usage of LimitExceededException from javax.naming is a bit funky  but I 
guess you didn't want to invent yet another exception class
;-) Will add a new internal exception class.

bq. VersionGarbageCollector.delayOnModification() should use Clock.waitUntil(). 
This allows to write efficient tests with a virtual clock.
ack.

bq. Only minor: the diff for VersionGarbageCollector also contains a couple of 
indentation changes for anonymous inner classes, which are unrelated to this 
improvement.
I will cleanup the diffs before attaching them here. Idea is sometimes 
overeager to reformat things.

bq. In MongoVersionGCSupport.getDeletedOnceCount(): 
ReadPreference.nearest().secondaryPreferred(). You cannot have both nearest and 
secondaryPreferred. The class will always give you a secondaryPreferred 
ReadPreference.
ack. will change.

bq. Minor: some unused imports in VersionGCSupport
ack.


> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-16 Thread Marcel Reutegger (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928304#comment-15928304
 ] 

Marcel Reutegger commented on OAK-4780:
---

This looks very promising. I'd like to include those changes step by step. That 
is, first the VersionGC part in oak-core and in a second step the new run mode 
for oak-run. I would even prefer if the second part goes into a separate issue.

Regarding your github branch. It contains a 'patches' directory with two diffs. 
What are those changes?

Some more comments:

- VersionGarbageCollector.reset() can be simplified with just the remove() 
call. It will be a noop if the document doesn't exist. 
- Can you please add tests for TimeInterval?
- Did you consider moving the new set methods on VersionGarbageCollector to a 
new class (e.g. VersionGCOptions) and pass it as an argument to gc()? I think 
with the current patch it is possible to influence a running GC by calling one 
of those set methods.
- What is the TODO about in VersionGCStats.addRun()?
- Usage of LimitExceededException from javax.naming is a big funky ;) but I 
guess you didn't want to invent yet another exception class
- VersionGarbageCollector.delayOnModification() should use Clock.waitUntil(). 
This allows to write efficient tests with a virtual clock.
- Only minor: the diff for VersionGarbageCollector also contains a couple of 
indentation changes for anonymous inner classes, which are unrelated to this 
improvement.
- In MongoVersionGCSupport.getDeletedOnceCount(): 
{{ReadPreference.nearest().secondaryPreferred()}}. You cannot have both nearest 
and secondaryPreferred. The class will always give you a secondaryPreferred 
ReadPreference.
- Minor: some unused imports in VersionGCSupport

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-10 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904841#comment-15904841
 ] 

Stefan Eissing commented on OAK-4780:
-

Result from run of current version against a large customer database:

h4. Run Timings
{code}
 Started:  2017-03-09 18:39:22.787
   Ended:  2017-03-10 06:56:01.721
Duration:  12.28 hours
   Stats:  VersionGCStats{ignoredGCDueToCheckPoint=true, 
canceled=true, deletedDocGCCount=2011642 (of which leaf: 687880), 
updateResurrectedGCCount=25229600, splitDocGCCount=92, 
intermediateSplitDocGCCount=0, iterationCount=60, timeActive=12,28 h, 
timeToCollectDeletedDocs=0,000 ns, timeToCheckDeletedDocs=0,000 ns, 
timeToSortDocIds=0,000 ns, timeTakenToUpdateResurrectedDocs=0,000 ns, 
timeTakenToDeleteDeletedDocs=0,000 ns, 
timeTakenToCollectAndDeleteSplitDocs=0,000 ns}
{code}
_(stopwatches are not accumulative over runs atm, so they show 0 for the total)_

The RGC run was initiated by:
{code}
java -jar target/oak-run-1.8-SNAPSHOT.jar revisions 
mongodb://localhost/ collect
{code}
It performed 60 iterations over almost 12 hours  (the database has ~250 million 
nodes of which 11% had {{_deletedOnce}} set), starting with:
{code}
18:39:22.789 INFO start 1. run (avg duration 0.0 sec)
18:39:22.792 DEBUG No lastOldestTimestamp found, querying for the oldest 
deletedOnce candidate
18:39:37.861 DEBUG lastOldestTimestamp found: 2016-10-20 17:47:49.999
18:39:50.491 DEBUG deletedOnce candidates: 27241631 found, 95000 preferred, 
scope now [2016-10-20 17:47:49.999, 2016-10-21 05:31:15.835]
...
18:39:54.469 DEBUG successful run using 0.0% of limit, raising recommended 
interval to 63308 seconds
18:39:54.471 INFO  Revision garbage collection finished in 3,973 s. 
VersionGCStats{ignoredGCDueToCheckPoint=false, canceled=false, 
deletedDocGCCount=0 (of which leaf: 0), updateResurrectedGCCount=9, 
splitDocGCCount=0, intermediateSplitDocGCCount=0, iterationCount=0, 
timeActive=31,68 s, timeToCollectDeletedDocs=3,935 s, 
timeToCheckDeletedDocs=2,953 ms, timeToSortDocIds=0,000 ns, 
timeTakenToUpdateResurrectedDocs=19,46 ms, timeTakenToDeleteDeletedDocs=0,000 
ns, timeTakenToCollectAndDeleteSplitDocs=13,27 ms}
{code}
until the last one:
{code}
06:56:01.717 INFO start 60. run (avg duration 749.118 sec)
06:56:01.717 DEBUG previous runs recommend a 68783 sec duration, scope now 
[2017-01-29 21:09:00.734, 2017-01-30 16:15:24.416]
06:56:01.717 WARN Ignoring RGC run because a valid checkpoint [revision: 
"r159ebd85640-0-4", clusterId: 4, time: "2017-01-29 21:09:00.736"] exists 
inside minimal scope [2017-01-29 21:09:00.734, 2017-01-29 21:10:00.734].
{code}

h4. Time Interval Handling

The runs enforced the default collection limit (10), so that no disk file 
sorting was necessary. Because the delete candidates were not evenly 
distributed, it aborted 15 runs and halfed the time interval used. (those runs 
did still delete leaf nodes and reset _deletedOnce for visible documents 
encountered, so they are not "lost").

The time interval was again raised by a factor of 1.5 whenever the collected 
nodes (non-leaf) stayed below 66% of the limit. This happened initially on 
every run, as they encountered only nodes where _deletedOnce needed to be 
reset. Intervals grew from ~9 hours to 13 days, then needed to shrink to 
cleanup the lump of really deleted non-leaf docs, using 13.25 minutes as 
shortest scope for a run.

In detail:
{code}
18:39:54.469 used 0.0% of limit, rec. interval of 63308 seconds
18:39:54.485 used 0.0% of limit, rec. interval of 94963 seconds
18:39:54.497 used 0.0% of limit, rec. interval of 142444 seconds
18:39:54.509 used 0.0% of limit, rec. interval of 213667 seconds
18:44:01.693 used 0.0% of limit, rec. interval of 320500 seconds
18:44:01.924 used 0.0% of limit, rec. interval of 480750 seconds
18:52:36.998 used 0.0% of limit, rec. interval of 721126 seconds
19:11:03.765 used 0.0% of limit, rec. interval of 1081689 seconds
00:27:51.562 used 0.0% of limit, rec. interval of 1622534 seconds
01:08:25.855 used 0.0% of limit, rec. interval of 2433801 seconds
03:41:52.568 used 0.0% of limit, rec. interval of 3650701 seconds
06:37:03.949 limit exceeded, reducing interval to 762539 seconds
06:41:30.360 used 0.0% of limit, rec. interval of 1143809 seconds
06:42:27.120 limit exceeded, reducing interval to 381269 seconds
06:42:48.380 used 0.0% of limit, rec. interval of 571904 seconds
06:42:51.860 limit exceeded, reducing interval to 190634 seconds
06:43:06.386 limit exceeded, reducing interval to 95317 seconds
06:43:21.085 used 0.0% of limit, rec. interval of 142976 seconds
06:43:23.816 limit exceeded, reducing interval to 71488 seconds
06:43:40.364 used 7.8% of limit, rec. interval of 107232 seconds
06:43:43.166 limit exceeded, reducing interval to 53616 seconds
06:43:59.157 limit exceeded, reducing interval to 13404 seconds
06:43:55.608 limit exceeded, 

[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-09 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903456#comment-15903456
 ] 

Julian Reschke commented on OAK-4780:
-

I'm tracking the Stefan's changes in 
https://github.com/reschke/jackrabbit-oak/tree/revision-garbage-collector - 
where I include the RDB specific changes taken from OAK-5855.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-08 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901533#comment-15901533
 ] 

Stefan Eissing commented on OAK-4780:
-

Updated 
https://github.com/apache/jackrabbit-oak/compare/trunk...icing:revision-garbage-collector?expand=1
 with some of the promised changes. Most relevant, this patch now adds the 
command 'revisions' to the oak-run suite.

{{java -jar target/oak-run-1.8-SNAPSHOT.jar revisions mongodb://host/dbname}}

Without further arguments will just print information about last run, 
recommended parameters and number of delete candidates found. This is the 
command {{info}}, which is default.

{{collect}} will run the real revision garbage collection. There are some 
options, for example {{--once}} to only run a single iteration. The other 
command currently supported is {{reset}} which clears all persisted meta 
information from rgc.

On a sample customer base with ~250 million nodes and ~25 million delete 
candidates overall, a query for candidates which does not select any node runs 
for about 10 minutes on my developer machine on this database. My initial 
algorithm to find the oldest time did finish, but took over an hour. I made a 
mongo specific query implementation which takes about 3 minutes to find the 
same result. Since this is normally only run once, this seems fine.

It now runs here with unlimited iterations. I will report back tomorrow how it 
went.



> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-07 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899710#comment-15899710
 ] 

Stefan Eissing commented on OAK-4780:
-

 * I will merge current trunk when you have added your OAK-3070 changes.
 * Revision.getCurrentTimestamp() will change that to store.getClock().getTime()
 * {{maxIterations}} is only good for testing, {{maxDuration}} is good when 
people want to run it in a maintenance window. I know the goal could be to have 
it run all the time, but {{maxDuration}} is for the fallback.
 * {{batchDelay}} as factor - interesting. Will do.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-07 Thread Marcel Reutegger (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899438#comment-15899438
 ] 

Marcel Reutegger commented on OAK-4780:
---

bq. Shall it repeat itself when it has not caught up to "now"

I'd say, yes. If needed, the GC can be canceled already.

bq. What is the best value for "precisionMs", the minimal time interval for 
queries?

I don't think a one minute resolution is needed. Maybe it's easier we define 
how many iterations are done to find the 'oldest' _deletedOnce? But then a time 
is more specific than a rather abstract number of iterations.

Other comments on your patch:

- We should first resolve OAK-3070 and remove that part from your patch. 
- VersionGCSupport.getOldestDeletedOnceTimestamp(long) uses 
System.currentTimeMillis(). Might be useful to use the Clock abstraction 
instead, which allows usage of a virtual clock for tests.
- Similar for VersionGarbageCollector.gc(long, TimeUnit): 
Revision.getCurrentTimestamp() does give you the current time of a Clock, but I 
think it would be better to use the clock from the DocumentNodeStore passed in 
the constructor.
- {{maxIterations}} and {{maxDuration}}: are those really necessary? I think it 
would be easier to use if those are implementation details and all you need to 
do is trigger gc() with a maxRevisionAge. The GC would stop iterations when it 
reaches currentTime - maxRevisionAge or when it is canceled.
- {{batchDelay}}, I like the feature, but would prefer a more adaptive 
approach. That is, have a value that defines the delay multiplier which is 
applied to the time it took for some operation. Let's say it took 500 ms to 
remove a batch of documents and the delay multiplier is 0.5, then the VGC would 
wait 250 ms until it proceeds to the next bach.


> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-03 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894387#comment-15894387
 ] 

Stefan Eissing commented on OAK-4780:
-

Updated my github clone with the following:
 * configure ```maxIterations``` that a gc run is allowed to make (default 0 == 
no limit)
 * configure ```maxDuration``` that a gc run might take (default 0 == no limit)
 * configure ```batchDelay``` that gc shall sleep between modification batches 
(default == 0, no delay)

added test case for cleanup in iterations.

The idea how to use these configuration parameters is:
 * use ```maxIterations``` only in test setups where one wants to check the 
immediate results
 * use ```maxDuration``` when the gc runs in a daily (weekly?) maintenance 
window, e.g. during the night and shall stop iterating when working hours 
resume.
 * use ```batchDelay``` when gc shall run during busy times or all the time, 
e.g. on 24/7 systems. A small delay should prevent the gc from taking over the 
write locks (on db/table/index), depending on database used.



> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-03-01 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890290#comment-15890290
 ] 

Stefan Eissing commented on OAK-4780:
-

Here

https://github.com/apache/jackrabbit-oak/compare/trunk...icing:revision-garbage-collector?expand=1

is a version 0.1 of an incremental, adapting VersionGarbageCollector strategy.

Key points:
- it tracks the lastOldestTimestamp, just like OAK-3070
- it queries only against the newly calculated timestamp (just like trunk now)
- it uses a precisionMs for the minimal time interval to use, configurable
- it enforces a maximum number of collected documents (to be sorted), 
configurable

The algorithm tries to find the best time interval, calculated from the last 
successful run or the oldest revision to be found, so that Garbage Collection 
may run without hitting the limit on the amount of documents to be sorted.

If collection limit is reached, the run is stopped and the time interval for 
the next run is reduced by 50%. So, this shrinks very fast if it encounters a 
time with many positive delete candidates. It grows a bit slower when the 
number of collected docs stayed significantly below the limit.

It has no preferred maintenance interval. It can be called repeatedly. 

Questions:
1. Shall it repeat itself when it has not caught up to "now" (minus 
MaxRevisionAge).
2. What is the best value for "precisionMs", the minimal time interval for 
queries?


> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-02-21 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875914#comment-15875914
 ] 

Julian Reschke commented on OAK-4780:
-

bq. Julian Reschke any estimation how long that "eventually" will take on a 
never cleaned up 140TB cluster? And what would the GC do when this strategy 
does not lead to success during a maintenance interval?

Too many variables in that question.

bq. Measurements from large clusters showed that the collect phase of a GC can 
take as long as 4 hours - only processing changes from the last day. How would 
this work? Let's say there are 10 million node candidates daily. How would you 
configure the limit to operate in this environment? 

The proposal is about cases where the VGC hasn't been run regularly. If it 
*does* run regularly but still can't keep up, we have a different problem, 
right?

bq. How would the same setting work in a cluster never cleaned up (worst case, 
I know)?

There seem to be two choices: restrict ourselves to a maintenance window, which 
may mean that we'll never recover. Or allow to run beyond the maintenance 
window.

(FWIW, we currently do not have a defined window, just an interval and the hope 
that one run has finished before the next is supposed to start)




> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-02-21 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875897#comment-15875897
 ] 

Stefan Eissing commented on OAK-4780:
-

[~reschke] any estimation how long that "eventually" will take on a never 
cleaned up 140TB cluster? And what would the GC do when this strategy does not 
lead to success during a maintenance interval?

Measurements from large clusters showed that the collect phase of a GC can take 
as long as 4 hours - only processing changes from the last day. How would this 
work? Let's say there are 10 million node candidates daily. How would you 
configure the limit to operate in this environment? How would the same setting 
work in a cluster never cleaned up (worst case, I know)?

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-02-21 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875870#comment-15875870
 ] 

Julian Reschke commented on OAK-4780:
-

Here's an approach that might be simpler but in the end achieves the same goal:

- set a limit for the collection phase, both for elapsed time and # of documents
- when limit reached, sort the collected IDs by modified date, and compute a 
new upper limit so that half of the documents become out of range; throw these 
entries as well
- continue the collection with the smaller time window (this just needs an 
internal API that allows to specify the _id to start with)
- compute new limit for elapsed time (half of the original?)

Eventually, we should have a set of documents that we *can* garbage collect.

Finally, if maintenance window still open, just rerun the GC again.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-02-20 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15874699#comment-15874699
 ] 

Stefan Eissing commented on OAK-4780:
-

Assuming there is the following in place:
1. Time/Revision of last successful run (as patch proposed in OAK-3070)
2. Reset of _deletedOnce flags for nodes that have become visible/valid again 
(not ticket)
3. remove leaf nodes early (implemented via OAK-5571)

An incremental strategy needs to adapt to various situations:

Case A: the RGC starts and finds the Timestamp(TS)/Revision from the last 
successful run
 * A.1: the TS is from last night which should be the 99% case
 * A.2: the TS is from several days/weeks/months before. Possible causes:
   - a) the cluster was down for that time period
   - b) the cluster was down every time the RGC was supposed to run (e.g. other 
system
maintenance colliding with daily maintenance window)
   - c) the cluster is in a state where the RGC never finishes successfully
   - d) a backup of the database was restored  

Case B: the RGC starts and does not find any data from a previous run. That 
means
 * B.1: its a new installation and it runs the first time
 * B.2: its an updated installation where an earlier version of the RGC
   - a) has run successfully before
   - b) never did run (successfully)

The question is when to apply which strategy to make RGC reach the stable case 
A.1 again.

Assuming a common customer is one who has his system running 24/7 (more or 
less) and
has configured a "maintenance window" periodically (every day/night or every 
weekend) 
where tasks as RGC are supposed to run. During this time interval, load on the 
system 
by RGC is acceptable. On other times not.

The degenerate case is where the system produces so much to clean up that the 
maintenance
time interval is insufficient to perform the necessary tasks. The installation 
needs
either a larger maintenance window or more CPU/database power or another RGC 
strategy
than the one discussed here. --> TODO

Let's focus on how to do the task under the assumption that it is possible to 
do A.1 in
the time given, plus a safety margin of XX% for fluctuations. And that XX% 
needs to be 
large enough to allow the system recovery from the other scenarios.

Example: the maintenance window has a safety margin of 100%, meaning that its 
duration
is twice as large as the average work that needs to be done by the RGC after a
maintenance period (let's assume daily).

100% margin means that the RGC can do 2 days of work during on maintenance run. 
Looking
at the scenarios above, this means in
 * A.1: RGC is finished in half the time on average
 * A.2.a: RGC will catch up all work, next run will find A.1 case
 * A.2.b: RGC is able to catch up one day per run. If it did not run for N 
days, 
  it will take N-1 days to reach A.1
 * A.2.c: panic mode, red alert
 * A.2.d: will reach A.1 on the next run
 * B.1: will reach A.1 on the next run
 * B.2.a: could reach A.1, question is how
 * B.2.b: will need several maintenance runs to catch up to A.1, depending on 
amount 
  of nodes to clean up, but how exactly?

And then there is the case where the 100% margin on maintenance time is only 
valid on 
*average* and that there can be more cleanup work after a busy day. It could be 
more
work than what is possible to do in one maintenance run.


Folding A and B cases:

If the TS information is missing, the RGC should select the oldest (_modified) 
node 
with _deletedOnce and use that as TS (maybe an hour before that).

TODO: find out the cost of such a query on our large scale test clusters

With A/B folded, RGC start with a last TS and a MaxRevisionAge setting that 
gives
a time duration TD. In the normal case A.1, TD < 24 hours (or whatever
the maintenance interval is). In the abnormal cases, this can be several times
as long.

Assuming the maintenance time is configured correctly, the chunks that can be 
safely
tackled by RGC are 1 maintenance interval. For ease of description, let's 
assume this
to be a day. So, we have
{code}
   MP := Maintenance Period (e.g. 24 hours)
   MD := Allowed Maintenance Duration (e.g. 4 hours)
   MStart := time RGC started 
   MEnd := MStart + MD
   LMax := max duration of run (0 at start)

   while Now() < MEnd - LMax:
TS := as persisted or oldest _deletedOnce node
TD := MStart - MaxRevisionAge - TS
TM := Min( TD, MP )
collectAndInspectNopes({ _deletedOnce: 1, TS < _modified < ( TS + TM ) 
})
save TS := TS + TM
LMax := Max( LMax, duration of this run)
{code}
This tries to cleanup nodes in the time interval of 1 Maintenance period, 
repeatedly,
until the time for maintenance runs out (by taking the max time of each 
iteration 
into account).

Example: the daily maintenance has 2 hours, the last run was 3 days ago. In the 
first
iteration, the RGC would collect nodes deleted and 

[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-02-17 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871632#comment-15871632
 ] 

Julian Reschke commented on OAK-4780:
-

To cover the case of a VGC that has never run, or one that hasn't run for a 
long time, we'll need a way to constrain the scope of the scan to a smaller 
time window.

One method would be, as previously proposed, to time-box the collection phase, 
to track the lastmod range upon reaching the limit, to give up and try with a 
smaller time window.

An alternative might be to extend the {{VersionGCSupport}} even more and to 
allow just returning the *count* of candidate nodes. With that information, we 
might be able to shrink the time window of the candidate scan *before* actually 
trying to collect the nodes.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-02-17 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871594#comment-15871594
 ] 

Julian Reschke commented on OAK-4780:
-

[~stefan.eissing] - OAK-5571 improves the situation, but IMHO doesn't 
completely reach the goal. Optimally, the version GC will "eventually" be able 
to complete garbage collection, no matter for how long it did not run before.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-02-17 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871558#comment-15871558
 ] 

Stefan Eissing commented on OAK-4780:
-

[~reschke] With OAK-5571 committed in trunk, this one can be closed, right?

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: core, documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-02-02 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849819#comment-15849819
 ] 

Stefan Eissing commented on OAK-4780:
-

+1 for OAK-5571

In regards to resilience, the VGC needs improvements when it
# encounters and needs to cope with a huge amounts of delete candidates
# needs to work "during the day" where other vital task also need capacity

One such improvement was already discussed, namely tagging the repository with 
the timestamp of the last successful run and only collecting delete candidates 
in that time interval.

Another idea is to adjust the collected time interval so that the amount of 
collected nodes can be kept in check. This can be either done:
# statically, by setting a max time interval to clean. Example: max is 1.5 
days, last clean was 5 days ago, so collect only nodes modified between -5 days 
and -3.5 days.
# dynamically, store the proposed time interval for collection in the 
repository. Collect nodes in that interval since the last cleanup. On reaching 
a threshold, e.g. 100.000, abort collecting, shrink the time interval and try 
again. When it runs through, process the next interval. Compare the amount 
collected each time and grow the time interval if there is room.

Another idea would be to sleep() in between delete batches, in order to avoid 
spamming the database.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-02-02 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849614#comment-15849614
 ] 

Julian Reschke commented on OAK-4780:
-

As the proposed change really is not about making things run incrementally, I 
have opened a separate ticket for *this* specific change: OAK-5571.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-01-18 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828487#comment-15828487
 ] 

Julian Reschke commented on OAK-4780:
-

OK, so here's a question: why can't we delete leaf nodes that have previous 
document eagerly (as long as we make sure we remove the previous documents 
later)?

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-01-18 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828482#comment-15828482
 ] 

Julian Reschke commented on OAK-4780:
-

Yep, mine. As long as it's not committed...

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-01-18 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828405#comment-15828405
 ] 

Stefan Eissing commented on OAK-4780:
-

[~mreutegg], thanks for the review. I clearly lay the blame of the 
System.err.println() on Julian's patch. Will look tomorrow into the other 
things.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-01-18 Thread Marcel Reutegger (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828372#comment-15828372
 ] 

Marcel Reutegger commented on OAK-4780:
---

[~stefan.eissing], thanks for the patch. A couple of comments.

- There's a System.err.println() in GCJob.gc()
- I would prefer Guava's collection factory methods instead of diamond 
operations. That makes it easier to backport changes to branches that are still 
on Java 6. I'm not suggesting we should backport this change to 1.2. This is 
just a general note.
- The check whether a document references previous documents can be simplified 
to: {{doc.getPreviousRanges().isEmpty()}}.
- Please add tests for the new functionality

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-01-18 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828227#comment-15828227
 ] 

Stefan Eissing commented on OAK-4780:
-

Ah, yes. worked in the change that split leaf nodes do not get deleted right 
away.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-01-18 Thread Marcel Reutegger (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827799#comment-15827799
 ] 

Marcel Reutegger commented on OAK-4780:
---

The VGC will abort when an operation fails with an exception, yes. Orphaned 
previous documents will get collected in a next run when the VGC collects old 
previous documents. See {{VersionGCSupport.deleteSplitDocuments()}}.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-01-18 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827723#comment-15827723
 ] 

Julian Reschke commented on OAK-4780:
-

[~mreutegg] - another question - am I right that if any of the delete 
operations fails, the whole VGC will abort? What would happen with "orphaned" 
prev docs then?

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-01-18 Thread Stefan Eissing (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827716#comment-15827716
 ] 

Stefan Eissing commented on OAK-4780:
-

Since Julian is currently busy with other things, I will give it a shot, take 
his patch and continue along those lines. I plan to keep the whole things 
single-threaded as several people said that parallel deletes against DBs like 
mongo do not benefit performance.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-01-18 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827717#comment-15827717
 ] 

Julian Reschke commented on OAK-4780:
-

Yes, work in progress - something else came up and thus I posted what I 
currently have.

The tricky part is to actually take advantage of the leaf node characteristics 
without making the code change too intrusive.

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-01-18 Thread Marcel Reutegger (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827707#comment-15827707
 ] 

Marcel Reutegger commented on OAK-4780:
---

I like the idea of deleting leaf documents eagerly. AFAICS, the patch still 
deletes leaf documents in the last phase of the garbage collection. Is this 
just work in progress? The garbage collector could remove a batch of leaf 
documents as soon as it reaches a threshold and it could collect those paths is 
a in-memory set. To simplify things, it is probably best to only consider leaf 
documents without links to previous documents. The garbage collector must only 
remove previous documents when the associated main document was successfully 
deleted (i.e. no concurrent modification of the document).

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
> Attachments: leafnodes.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-01-13 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821886#comment-15821886
 ] 

Julian Reschke commented on OAK-4780:
-

Maybe we can shortcut certain operations for nodes where _children != true? In 
those cases, deletion order really doesn't matter, right?

(We just counted nodes on a large instance, and approximately 1/3 of the nodes 
with _deletedOnce == true were leaf nodes)

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally

2017-01-11 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818348#comment-15818348
 ] 

Julian Reschke commented on OAK-4780:
-

...or by adding logic that pushes the lastmod boundary if the collection phase 
finds to many deletion candidates...

> VersionGarbageCollector should be able to run incrementally
> ---
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: documentmk
>Reporter: Julian Reschke
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been 
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is 
> interrupted during the path collection phase, maybe due to other maintenance 
> tasks. On the next run, the number of paths to be collected will be even 
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in 
> chunks; maybe by partitioning the path space by top level directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)