Re: [MongoMK] BlobStore garbage collection
Hi, 1- What's considered an old node or commit? Technically, anything other than the head revision is old but can we remove them right away or do we need to retain a number of revisions? If the latter, then how far back do we need to retain? we discussed this a while back, no good solution back then[1] Yes. Somebody has to decide which revisions are no longer needed. Luckily it doesn't need to be us :-) We might set a default value (10 minutes or so), and then give the user the ability to change that, depending on whether he cares more about disk space or the ability to read old data / roll back to an old state. To free up disk space, BlobStore garbage collection is actually more important, because usually 90% of the disk space is used by the BlobStore. So it would be nice if items (files) in the BlobStore are deleted as soon as possible after deleting old revisions. In Jackrabbit 2.x we have seen that node and data store garbage collection that has to traverse the whole repository is problematic if the repository is large. So garbage collection can be a scalability issue: if we have to traverse all revisions of all nodes in order to delete unused data, we basically tie garbage collection speed with repository size, unless if we find a way to run it in parallel. But running mark sweep garbage collection completely in parallel is not easy (is it even possible? if yes I would have guessed modern JVMs should have it since a long time). So I think if we don't need to traverse the repository to delete old nodes, but just traverse the journal, this would be much less of a problem. Regards, Thomas
Re: [MongoMK] BlobStore garbage collection
Hi, On 11/6/12 9:24 AM, Thomas Mueller muel...@adobe.com wrote: Yes. Somebody has to decide which revisions are no longer needed. Luckily it doesn't need to be us :-) We might set a default value (10 minutes or so), and then give the user the ability to change that, depending on whether he cares more about disk space or the ability to read old data / roll back to an old state. If we go down this path for node GC, doesn't MicroKernel interface have to change to account for this? Where would you change this default 10 minutes value as far as MicroKernel is concerned? -MEte
Re: [MongoMK] BlobStore garbage collection
Hi, If we go down this path for node GC With this path of node GC, do you mean the ability to configure the lifetime of a revision? , doesn't MicroKernel interface have to change to account for this? Where would you change this default 10 minutes value as far as MicroKernel is concerned? I think it would be nice to have a configuration API, but I'm not sure how it should look like exactly. Possibly it's simpler to call this an implementation detail. Regards, Thomas
Re: [MongoMK] BlobStore garbage collection
On Tue, Nov 6, 2012 at 9:45 AM, Mete Atamel mata...@adobe.com wrote: Hi, On 11/6/12 9:24 AM, Thomas Mueller muel...@adobe.com wrote: Yes. Somebody has to decide which revisions are no longer needed. Luckily it doesn't need to be us :-) We might set a default value (10 minutes or so), and then give the user the ability to change that, depending on whether he cares more about disk space or the ability to read old data / roll back to an old state. If we go down this path for node GC, doesn't MicroKernel interface have to change to account for this? Where would you change this default 10 minutes value as far as MicroKernel is concerned? there's a jira issue [0]. so far we've not been able to resolve this issue. there's no single 'right' retention policy as different use cases imply different strategies. personally i tend to not specify a retention policy on the API level but rather leave it implementation specific (configurable). cheers stefan [0[ https://issues.apache.org/jira/browse/OAK-114 -MEte
Re: [MongoMK] BlobStore garbage collection
On 5.11.12 13:04, Thomas Mueller wrote: If possible, I would try to avoid having to traverse over the whole repository. Isn't node and revision store GC in the current MicrokernelImpl doing exactly that? AFAIR it implements a mark and sweep algorithm which periodically traverses the whole repository. Michael
RE: svn commit: r1406080 - in /jackrabbit/oak/trunk: oak-core/ oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/nodetype/ oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/p
Hi, I separated the two because I think it's the responsibility of the query engine to use multiple indexes based on cost when there is more than one restriction on a filter. This is not implemented right now, but I think we will have to do that anyway to efficiently execute this query: //element(*, nt:resource)[jcr:contains(., 'foo')] the node type index can take care of 'element(*, nt:resource)' but for the jcr:contains(., 'foo') we'd probably want to leverage the lucene index implementation. neither of the two should know about the other. Regards Marcel -Original Message- From: Jukka Zitting [mailto:jukka.zitt...@gmail.com] Sent: Dienstag, 6. November 2012 11:18 To: Oak devs Subject: Re: svn commit: r1406080 - in /jackrabbit/oak/trunk: oak-core/ oak- core/src/main/java/org/apache/jackrabbit/oak/plugins/index/nodetype/ oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/property/ oak-core/src/main/java/org/apache/jackrabbi... Hi, On Tue, Nov 6, 2012 at 11:04 AM, mreut...@apache.org wrote: Added: jackrabbit/oak/trunk/oak- core/src/main/java/org/apache/jackrabbit/oak/plugins/index/nodetype/ Do we need a separate index implementation for this? I'd rather simply have this functionality as a part of PropertyIndex. That way a query that combines node type and property restrictions could still be efficiently executed. BR, Jukka Zitting
Re: [MongoMK] BlobStore garbage collection
this might be a weird question from the leftfield, but are we actually sure that the existing data store concept is worth the trouble? afaiu it saves us from storing the same binary twice, but leads into the DSGC topic. would it be possible to make it optional to store/address binaries by hash (and thus not need DSGC for these configurations)? In any case we should definitely avoid to require repo traversal for DSGC. This would operationally limit the repo sizes Oak can support. -- Michael Marth | Engineering Manager +41 61 226 55 22 | mma...@adobe.commailto:mma...@adobe.com Barfüsserplatz 6, CH-4001 Basel, Switzerland On Nov 6, 2012, at 9:24 AM, Thomas Mueller wrote: Hi, 1- What's considered an old node or commit? Technically, anything other than the head revision is old but can we remove them right away or do we need to retain a number of revisions? If the latter, then how far back do we need to retain? we discussed this a while back, no good solution back then[1] Yes. Somebody has to decide which revisions are no longer needed. Luckily it doesn't need to be us :-) We might set a default value (10 minutes or so), and then give the user the ability to change that, depending on whether he cares more about disk space or the ability to read old data / roll back to an old state. To free up disk space, BlobStore garbage collection is actually more important, because usually 90% of the disk space is used by the BlobStore. So it would be nice if items (files) in the BlobStore are deleted as soon as possible after deleting old revisions. In Jackrabbit 2.x we have seen that node and data store garbage collection that has to traverse the whole repository is problematic if the repository is large. So garbage collection can be a scalability issue: if we have to traverse all revisions of all nodes in order to delete unused data, we basically tie garbage collection speed with repository size, unless if we find a way to run it in parallel. But running mark sweep garbage collection completely in parallel is not easy (is it even possible? if yes I would have guessed modern JVMs should have it since a long time). So I think if we don't need to traverse the repository to delete old nodes, but just traverse the journal, this would be much less of a problem. Regards, Thomas