Re: [MongoMK] BlobStore garbage collection

2012-11-06 Thread Thomas Mueller
Hi,

1- What's considered an old node or commit? Technically, anything other
than the head revision is old but can we remove them right away or do we
need to retain a number of revisions? If the latter, then how far back do
we need to retain?

we discussed this a while back, no good solution back then[1]

Yes. Somebody has to decide which revisions are no longer needed. Luckily
it doesn't need to be us :-) We might set a default value (10 minutes or
so), and then give the user the ability to change that, depending on
whether he cares more about disk space or the ability to read old data /
roll back to an old state.

To free up disk space, BlobStore garbage collection is actually more
important, because usually 90% of the disk space is used by the BlobStore.
So it would be nice if items (files) in the BlobStore are deleted as soon
as possible after deleting old revisions. In Jackrabbit 2.x we have seen
that node and data store garbage collection that has to traverse the whole
repository is problematic if the repository is large. So garbage
collection can be a scalability issue: if we have to traverse all
revisions of all nodes in order to delete unused data, we basically tie
garbage collection speed with repository size, unless if we find a way to
run it in parallel. But running mark  sweep garbage collection completely
in parallel is not easy (is it even possible? if yes I would have guessed
modern JVMs should have it since a long time). So I think if we don't need
to traverse the repository to delete old nodes, but just traverse the
journal, this would be much less of a problem.

Regards,
Thomas



Re: [MongoMK] BlobStore garbage collection

2012-11-06 Thread Mete Atamel
Hi,

On 11/6/12 9:24 AM, Thomas Mueller muel...@adobe.com wrote:

Yes. Somebody has to decide which revisions are no longer needed. Luckily
it doesn't need to be us :-) We might set a default value (10 minutes or
so), and then give the user the ability to change that, depending on
whether he cares more about disk space or the ability to read old data /
roll back to an old state.

If we go down this path for node GC, doesn't MicroKernel interface have to
change to account for this? Where would you change this default 10 minutes
value as far as MicroKernel is concerned?

-MEte



Re: [MongoMK] BlobStore garbage collection

2012-11-06 Thread Thomas Mueller
Hi,

If we go down this path for node GC

With this path of node GC, do you mean the ability to configure the
lifetime of a revision?

, doesn't MicroKernel interface have to
change to account for this? Where would you change this default 10 minutes
value as far as MicroKernel is concerned?

I think it would be nice to have a configuration API, but I'm not sure how
it should look like exactly. Possibly it's simpler to call this an
implementation detail.

Regards,
Thomas



Re: [MongoMK] BlobStore garbage collection

2012-11-06 Thread Stefan Guggisberg
On Tue, Nov 6, 2012 at 9:45 AM, Mete Atamel mata...@adobe.com wrote:
 Hi,

 On 11/6/12 9:24 AM, Thomas Mueller muel...@adobe.com wrote:

Yes. Somebody has to decide which revisions are no longer needed. Luckily
it doesn't need to be us :-) We might set a default value (10 minutes or
so), and then give the user the ability to change that, depending on
whether he cares more about disk space or the ability to read old data /
roll back to an old state.

 If we go down this path for node GC, doesn't MicroKernel interface have to
 change to account for this? Where would you change this default 10 minutes
 value as far as MicroKernel is concerned?

there's a jira issue [0]. so far we've not been able to resolve this issue.

there's no single 'right' retention policy as different use cases
imply different
strategies.

personally i tend to not specify a retention policy on the API level
but rather leave it implementation specific (configurable).

cheers
stefan

[0[ https://issues.apache.org/jira/browse/OAK-114


 -MEte



Re: [MongoMK] BlobStore garbage collection

2012-11-06 Thread Michael Dürig



On 5.11.12 13:04, Thomas Mueller wrote:

If possible, I would try to avoid having to traverse over the whole
repository.


Isn't node and revision store GC in the current MicrokernelImpl doing 
exactly that? AFAIR it implements a mark and sweep algorithm which 
periodically traverses the whole repository.


Michael


RE: svn commit: r1406080 - in /jackrabbit/oak/trunk: oak-core/ oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/nodetype/ oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/p

2012-11-06 Thread Marcel Reutegger
Hi,

I separated the two because I think it's the responsibility of the query
engine to use multiple indexes based on cost when there is more than
one restriction on a filter. This is not implemented right now, but I think
we will have to do that anyway to efficiently execute this query:

//element(*, nt:resource)[jcr:contains(., 'foo')]

the node type index can take care of 'element(*, nt:resource)' but for
the jcr:contains(., 'foo') we'd probably want to leverage the lucene
index implementation. neither of the two should know about the
other.

Regards
 Marcel

 -Original Message-
 From: Jukka Zitting [mailto:jukka.zitt...@gmail.com]
 Sent: Dienstag, 6. November 2012 11:18
 To: Oak devs
 Subject: Re: svn commit: r1406080 - in /jackrabbit/oak/trunk: oak-core/ oak-
 core/src/main/java/org/apache/jackrabbit/oak/plugins/index/nodetype/
 oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/property/
 oak-core/src/main/java/org/apache/jackrabbi...
 
 Hi,
 
 On Tue, Nov 6, 2012 at 11:04 AM,  mreut...@apache.org wrote:
  Added:
  jackrabbit/oak/trunk/oak-
 core/src/main/java/org/apache/jackrabbit/oak/plugins/index/nodetype/
 
 Do we need a separate index implementation for this? I'd rather simply
 have this functionality as a part of PropertyIndex. That way a query
 that combines node type and property restrictions could still be
 efficiently executed.
 
 BR,
 
 Jukka Zitting


Re: [MongoMK] BlobStore garbage collection

2012-11-06 Thread Michael Marth
this might be a weird question from the leftfield, but are we actually sure 
that the existing data store concept is worth the trouble? afaiu it saves us 
from storing the same binary twice, but leads into the DSGC topic. would it be 
possible to make it optional to store/address binaries by hash (and thus not 
need DSGC for these configurations)?

In any case we should definitely avoid to require repo traversal for DSGC. This 
would operationally limit the repo sizes Oak can support.


--
Michael Marth | Engineering Manager
+41 61 226 55 22 | mma...@adobe.commailto:mma...@adobe.com
Barfüsserplatz 6, CH-4001 Basel, Switzerland

On Nov 6, 2012, at 9:24 AM, Thomas Mueller wrote:

Hi,

1- What's considered an old node or commit? Technically, anything other
than the head revision is old but can we remove them right away or do we
need to retain a number of revisions? If the latter, then how far back do
we need to retain?

we discussed this a while back, no good solution back then[1]

Yes. Somebody has to decide which revisions are no longer needed. Luckily
it doesn't need to be us :-) We might set a default value (10 minutes or
so), and then give the user the ability to change that, depending on
whether he cares more about disk space or the ability to read old data /
roll back to an old state.

To free up disk space, BlobStore garbage collection is actually more
important, because usually 90% of the disk space is used by the BlobStore.
So it would be nice if items (files) in the BlobStore are deleted as soon
as possible after deleting old revisions. In Jackrabbit 2.x we have seen
that node and data store garbage collection that has to traverse the whole
repository is problematic if the repository is large. So garbage
collection can be a scalability issue: if we have to traverse all
revisions of all nodes in order to delete unused data, we basically tie
garbage collection speed with repository size, unless if we find a way to
run it in parallel. But running mark  sweep garbage collection completely
in parallel is not easy (is it even possible? if yes I would have guessed
modern JVMs should have it since a long time). So I think if we don't need
to traverse the repository to delete old nodes, but just traverse the
journal, this would be much less of a problem.

Regards,
Thomas