Re: fixVersion
Hi Julian, Thx for that. That was indeed an unlucky typo from my side.. Cheers, Stefan On 30.07.20 10:00, Julian Reschke wrote: Hi, please be careful when setting fixVersion in Jira. I just fixed a few recently resolved tickets where a change in trunk was advertised to fix 1.2.32, not 1.34.0 as it should. Best regards, Julian
Re: [DISCUSS] Branching and release: version numbers
+1 Cheers, Stefan On 27.09.19 11:40, Julian Reschke wrote: On 04.03.2019 14:29, Davide Giannella wrote: ... Picking up an old thread... So we've released 1.12.0, 1.14.0, 1.16.0, and will release 1.18.0 next week. What we apparently did not discuss what the project version for trunk should be in the meantime. So far, we've been using 1.12-SNAPSHOT, etc, and we are on 1.20-SNAPHOT right now. This however seems incorrect to me; shouldn't it be 1.19-SNAPSHOT? For this release I'd like to avoid any changes, but for future releases I'd like to document that we're using an odd-numbered version. Feedback appreciated, Julian
Intent to backport OAK-8351
Hi, I'd like to backport OAK-8351 [0] to the 1.8 and 1.10 branches unless someone objects. OAK-8351 changes a MongoDB query that was introduced in this form in 1.8 Cheers, Stefan -- [0] https://issues.apache.org/jira/browse/OAK-8351
Re: Intent to backport OAK-6953
+1 Cheers, Stefan On 20.11.17, 09:24, "Marcel Reutegger"wrote: >Hi, > >I'd like to backport OAK-6953 to the maintenance branches. In some cases, >it is desirable to disable a cache, which is not possible with the >current CacheLIRS implementation in Oak. Instead of changing the >CacheLIRS implementation, OAK-6953 uses the Guava Cache implementation >when the cache size is set to zero, which immediately evicts entries when >loaded. > >Regards > Marcel >
Re: single node cluster
Hi Mostafa, I'd suggest to narrow down why that lease update failed, esp if you have it reproducible. By default a lease is updated every 10 seconds and is valid for 2min (and could in theory be changed but that's not recommended necessarily). Besides mentioned DB issues, other cases where lease updates failed were JVMs running low on memory thus doing too long GC-stop-the-worlds. If you can rule out both, then here's some more ideas to investigate: a) check for warnings in the form of: "BackgroundLeaseUpdate.execute: time since last renewClusterIdLease() call longer than expected" to see if the lease update became slow already before it finally expired. Perhaps that gives some clues already. b) enable trace logging for 'org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo' to see all details about lease updates happening (or not). c) analyse thread dumps to rule out blocked lease update thread Cheers, Stefan On 01/08/17 15:45, "Mostafa Mahdieh"wrote: >Hi, > >I'm using jackrabbit oak as the content repository of a document >management >system product. Currently there is no need to scale out, therefore I'm >using jackrabbit oak in a single node environment. However, I'm >experiencing issues related to clustering and lease time, such as the >following exception which is appearing all over my tomcat logs: > >WARN: Background operation failed: >org.apache.jackrabbit.oak.plugins.document.DocumentStoreException: This >oak >instance failed to update the lease in time and can therefore no longer >access this DocumentNodeStore. > >After some research, It seems that there is no way to use jackrabbit oak >forcing it to use a single node and not having any concerns related to >clustering. > >Am I using the right tool? I thought maybe jackrabbit 2 might be better >for >my current use case, however oak seemed as the future of jackrabbit, and >attracted me (adding scalability is also in my future vision). Do you >suggest oak for my usecase or jackrabbit 2? How can I adapt oak for a >single node environment without getting issues regarding lease time and >clustering? > >Best Regards >-- >Mostafa Mahdieh
Re: [discuss] expose way to detect "eventual consistency delay"
On 30/05/17 14:51, "Stefan Egli" <stefane...@apache.org> wrote: >on how Oak could "expose a way to detect the eventual delay". ... "to detect the eventual consistency delay" ... of course ...
[discuss] expose way to detect "eventual consistency delay"
Hi all, I'd like to invite those interested to join a discussion in https://issues.apache.org/jira/browse/OAK-6276 on how Oak could "expose a way to detect the eventual delay". This is a requirement coming from the integration with an external messaging system in an Oak-based application. One way suggested so far is that this could simply be done by exposing a "normalized head revision vector" via a repository descriptor. But let's discuss over in OAK-6276. Thanks, Cheers, Stefan
Re: MongoMK failover behaviour.
Hi, On 04/05/17 16:56, "Justin Edelson"wrote: >>Hmm, depending on the Oak version, this may also be caused by OAK-5528. >> The current fix versions are 1.4.15 and 1.6.0. >> > >Would this show up in thread dumps? Based on the description, it seems >like >it should. Not necessarily. In OAK-5528 the lease update thread goes into performLeaseCheck which will do a 5x1sec retry loop. So if the thread dump is taken during that time one would see it - if taken afterwards not. Cheers, Stefan
Re: ObservationTest with Thread.sleep()
Hi Marcel, IIUC then the sleeps are used to check for expected *and* unexpected events. The expected part could be easily replaced with a busy-check loop. The unexpected part is a bit more tricky though, but the test could be rewritten to be more of a white-box test where not only both ends are tested but also the middle (observation queue) part, that would work. So I guess yes, the sleeps could be avoided - with a bit of effort though. Cheers, Stefan On 25/04/17 10:56, "Marcel Reutegger"wrote: >Hi, > >there is a test in oak-jcr >(org.apache.jackrabbit.oak.jcr.observation.ObservationTest) with many >Thread.sleep() calls. This means, the test mostly sleeps and slows down >the build. What's the reason for those sleeps and can we somehow remove >them? > >Regards > Marcel
Re: [Observation] Should listeners require constant inflow of commits to get all events?
>> >>But agreed, this is a bug and we should fix it. >> >Actually, I'm not too sure as long as we concretely document the >behavior and potentially have a sample abstract >commit-creator/listener which does the job well (may be similar to the >hack I used) I've created OAK-5740 and attached test case that reproduces this. We can follow up there if/when/how we want to fix this. Cheers, Stefan
Re: ChangeProcessor potentially warns only once for queue being full during its lifetime (without CommitRateLimiter)
+1, looks like a bug to me. Cheers, Stefan On 09/02/17 23:17, "Vikas Saurabh"wrote: >Hi, > >_Disclaimer_ : I get confused with change processor code, so not sure >if this is an issue or PEBKAC > >ChangeProcessor#queueSizeChanged sets blocking flag to true if queue >size is hit (or beyond). The warning "Revision queue is full. Further >revisions will be compacted." is logged only when it *wasn't* >blocking. > >BUT, when queue empties, blocking flag is reset inside if block for >commitRateLimiter!=null. That, to me seems like >qFull->log->qEmpties->qFull won't log another warn. This sounds wrong >to me. > >Thanks, >Vikas
Re: incomplete diffManyChildren during a persisted branch merge
On 31/01/17 18:07, "Stefan Egli" <stefane...@apache.org> wrote: >I'm following up on failure case in oak 1.2.14 where as part of a >persisted >branch merge commit hooks do not propagate through all affected changes, >resulting in an inconsistent state. >https://issues.apache.org/jira/browse/OAK-5557 I believe the problem is indeed related to a rebase that happens before merging a persisted branch. The diffManyChildren subsequently takes the rebased revision timestamp as the minValue, instead of taking the branch's previous purges into account. This seems to (only) occur when between the last purge and the actual merge another session does a merge. One possible fix I see is to detect such a situation (in diffManyChildren) - ie check if one of the revisions is a branch revision - and fall back to not using the _modified index then. This will definitely find all potential child nodes - but it has the downside that it becomes slow/doesn't scale well with a very large list of child nodes. Other ideas? Cheers, Stefan
Re: incomplete diffManyChildren during a persisted branch merge
On 01/02/17 09:16, "Marcel Reutegger"wrote: >I think in trunk the code path is also a bit different because of >OAK-4528. It may be possible that the issue still exists in trunk, but >does not call diffManyChildren() anymore. > >What happens when you disable the journal diff mechanism in trunk with >-Doak.disableJournalDiff=true ? Good idea, however that alone doesn't let the test fail yet, as both the local diff cache as well as the node children cache avoid diffManyChildren from being used. But if I use brute force and bypass those two caches explicitly - and at the same time increase the test size by increasing # of nodes - then the test fails on trunk too. So indeed trunk seems to avoid this problem as it doesn't go into diffManyChildren for the cases triggered by the test. Cheers, Stefan
incomplete diffManyChildren during a persisted branch merge
Hi, I'm following up on failure case in oak 1.2.14 where as part of a persisted branch merge commit hooks do not propagate through all affected changes, resulting in an inconsistent state. It's unclear how realistic this scenario is and/or if it's relevant, but I was able to produce such a scenario in a test case. Interesting thing is that it's quite easily reproducible in 1.2.14 while as later in the 1.2 branch it takes longer for the test (which loops until it fails) to fail. Also, it doesn't fail in trunk even after eg 500 iterations. Does this ring a bell with anyone - diffManyChildren / wrong _modified calculation / branch - perhaps this was fixed in trunk a while ago and not backported? Cheers, Stefan -- https://issues.apache.org/jira/browse/OAK-5557
Re: Detecting if setup is a cluster or a single node via repository Descriptors
Hi Chetan, I think the discoverylite and the new 'clustered' property options have different characteristics. The former describes the current status of the cluster, irrespective of whether it can be clustered at all. While the latter is about a capability whether the node store supports clustering or not. And assuming that you're after the capability 'cluster support' alone, then I think handling this separate is indeed more appropriate. Cheers, Stefan On 15/11/16 10:37, "Chetan Mehrotra"wrote: >Hi Team, > >For OAK-2108 Killing a cluster node may stop async index update to to >30 minutes. > >One possible fix can be that AsyncIndexUpdate can determine if the >repository is part of cluster or its a single instance. In case its a >single instance we can reduce the timeout as its known that there are >no other processes involved. > >Currently for SegmentNodeStore a Descriptor with name >'oak.discoverylite.clusterview' is registered whose value is as below > >--- >{"seq":1,"final":true,"me":1,"id":"80a1fb91-83bc-4eac-b855-53d7b8a04092"," >active":[1],"deactivating":[],"inactive":[]} >--- > >AsyncIndexerService can get access to 'Descriptors' and look for that >key and check if 'active' is 1. > >However there should be a better way to detect this. Can we have an >explicit descriptor defined say OAK_CLUSTERED having boolean value. A >false means its not a cluster while true means it "might" be part of >cluster. > >Thoughts? > >Chetan Mehrotra
Re: [REVIEW] OAK-4908 in 1.5.13: prefiltering (enabled by default)
FYI: Assuming lazy consensus I've now committed this one to unblock 1.5.13. We can do post-review in case. Cheers, Stefan On 04/11/16 15:59, "Stefan Egli" <stefane...@apache.org> wrote: >Hi, > >I'd like to commit OAK-4908 which would introduce prefiltering for >observation listeners. This is based on OAK-4907 (population of a >ChangeSet >during the commit) and OAK-4916 (FilteringObserver-wrapper for the >BackgroundObserver) - and it works fine with the new filters (OAK-5019-23) >too. > >The reason I raise this on the list is that this is quite a change and it >would thus be good if there was an agreement that we want this in for >1.5.13 >(Monday). I know it's a bit a tight schedule, but I think it would be good >to have that in to allow for more testing in real life scenarios. I've >thus >marked it a blocker for 1.5.13. If you disagree, pls let me know. > >Wdyt? > >Cheers, >Stefan > >
Re: [REVIEW][API] Additions to JackrabbitEventFilter
(+oak-dev as it meanwhile has move to oak features alone) I've committed OAK-5013 which introduces the OakEventFilter extension and a number of such extensions (OAK-5019, OAK-5020, OAK-5021, OAK-5022, OAK-5023). While they should all in principle work I don't consider them as done yet as the test coverage is minimal and there's room for code(-style) improvement. But the point of this heads-up is about the API of OakEventFilter that should ideally not have to change anymore, so if you're interested pls have a look. Cheers, Stefan On 26/10/16 19:09, "Stefan Egli" <stefane...@apache.org> wrote: >On 26/10/16 16:48, "Michael Dürig" <mdue...@apache.org> wrote: > >>... Just ensure we expose the required >>functionality on the Oak side as a proper API. That is, interface and >>utility only and proper package versioning... > >Opened OAK-5013 for that which is just about the API. >(it's a beautified version of the previous patches) > >Cheers, >Stefan > >
Re: globbing: oak style vs sling style
I've created https://issues.apache.org/jira/browse/OAK-5039 to follow up Cheers, Stefan On 31/10/16 14:18, "Stefan Egli" <stefane...@apache.org> wrote: >Hi, > >As being discussed in [0] in OAK-5021 there are 2 different ways how >globbing is currently defined in Oak vs in Sling. In Oak globbing is >restricted to ** being 0-n path elements and * being 1 path element, while >in Sling it is more generic in that * means 0-n characters excluding path >boundaries. > >IIUC then the GlobbingPathFilter is basically where Oak implements this >and >it looks like this is not yet exposed, as that's internal to observation >filtering only. > >So my suggestion would be to simply extend the GlobbingPathFilter's >globbing >definition to match that of Sling. > >Any objections? > >Cheers, >Stefan >-- >[0] - >https://issues.apache.org/jira/browse/OAK-5021?focusedCommentId=15622005 >ag >e=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment >-1 >5622005 >[1] - >https://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/p >lu >gins/observation/filter/GlobbingPathFilter.html > > >
globbing: oak style vs sling style
Hi, As being discussed in [0] in OAK-5021 there are 2 different ways how globbing is currently defined in Oak vs in Sling. In Oak globbing is restricted to ** being 0-n path elements and * being 1 path element, while in Sling it is more generic in that * means 0-n characters excluding path boundaries. IIUC then the GlobbingPathFilter is basically where Oak implements this and it looks like this is not yet exposed, as that's internal to observation filtering only. So my suggestion would be to simply extend the GlobbingPathFilter's globbing definition to match that of Sling. Any objections? Cheers, Stefan -- [0] - https://issues.apache.org/jira/browse/OAK-5021?focusedCommentId=15622005 e=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1 5622005 [1] - https://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/plu gins/observation/filter/GlobbingPathFilter.html
Re: [observation] more options in JackrabbitEventFilter
On 13/09/16 15:27, "Davide Giannella" <dav...@apache.org> wrote: >On 12/09/2016 09:48, Stefan Egli wrote: >> IIUC then EventListeners are registered via either JCR's >> ObservationManager or Jackrabbit's extension at [0]. If you want to do >> this in Oak (ie not in Jackrabbit) then would you extend Oak's >> Observationmanager ([1]) directly? >Didn't look at the code and didn't think all the implications. > >Would it be an option to expose, >javax.jcr.oak.observation.OakObservationManager that extends >javax.jcr.observation.ObservationManager in which we expose what need? Right, there's probably two options: # (oak) add another variant of addEventListener to OakObservationManager ([0]) # (jackrabbit) integrate that into the JackrabbitEventFilter ([1]) I guess it comes down to API design and cleanliness, I don't have any preference. Cheers, Stefan -- [0] - https://github.com/apache/jackrabbit-oak/blob/trunk/oak-jcr/src/main/java/o rg/apache/jackrabbit/oak/jcr/observation/ObservationManagerImpl.java#L179 [1] - https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-api/src/main/jav a/org/apache/jackrabbit/api/observation/JackrabbitEventFilter.java
Re: [observation] more options in JackrabbitEventFilter
Hi Davide, On 08/09/16 14:24, "Davide Giannella"wrote: >On 07/09/2016 14:04, Michael Dürig wrote: >> No not open them. But make their functionality available through an >> API. Since JCR is dead (hint hint) we probably have to come up with an >> ad-hoc API here. >FWIW, I'm for exposing this aspect as Oak API. Would be fine for me, however, how would you do that? IIUC then EventListeners are registered via either JCR's ObservationManager or Jackrabbit's extension at [0]. If you want to do this in Oak (ie not in Jackrabbit) then would you extend Oak's Observationmanager ([1]) directly? Cheers, Stefan -- [0] - https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-api/src/main/jav a/org/apache/jackrabbit/api/observation/JackrabbitObservationManager.java#L 26 [1] - https://github.com/apache/jackrabbit-oak/blob/trunk/oak-jcr/src/main/java/o rg/apache/jackrabbit/oak/jcr/observation/ObservationManagerImpl.java > >Then in Oak we implement few Filters for the already existing mechanism, >so that the jcr layer can map the JCR API as Oak api. > >An application that needs to have complex filtering, will have to >leverage the Oak API. > >Don't know whether it will be possible for an application to leverage >*both* JCR and Oak APIs but I'm sure there are ways around it. > >Cheers >Davide > >
Re: [observation] pure internal or external listeners
On 02/09/16 13:41, "Stefan Egli" <stefane...@apache.org> wrote: >On 02/09/16 13:26, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote: > >>Listener for local Change >>-- >> >>Such a listener is more particular about type of change and is doing >>some persisted state change i.e. like registering a job, invoking some >>third party service to update the value. This listener is only >>interested in local as it know same listener is also active on other >>cluster node (homogeneous cluster setup) so if a node gets added it >>only need to react on the cluster node where it got added. > >One thing this reminds me of is a use-case where you have say 3 cluster >nodes, each one handling mainly local events lets say. All fine. Then 1 >node crashes while likely it's (local) observation queue wasn't entirely >empty. Those events would then probably not get handled by anyone (and >that node wouldn't necessarily be restarted as the cluster continues >normally, it could be restarted as a new clusterNodeId..). So maybe >there's an issue there. I think this should be handled same as today with (non-journaled) listeners loosing events on any crash: either upon restart or when an instance leaves the cluster (which can be noticed eg via Sling's Discovery API) someone (preferably the leader) should handle this and do a repository scan of whatever interesting the crashing instance might have stored. Lack of journaled observation that's the way to go probably. Cheers, Stefan
Re: [observation] pure internal or external listeners
Hi Chetan, (see below) On 02/09/16 13:26, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote: >On Fri, Sep 2, 2016 at 4:00 PM, Stefan Egli <stefane...@apache.org> wrote: >> If we >> separate listeners into purely internal vs external, then a queue as a >>whole >> is either purely internal or external and we no longer have this issue. > >Not sure here on how this would work. The observation queue is made up >of ContentChange which is a tuple of [root NodeState , CommitInfo >(null for external)] > >--- NS1-L---NS2-L--NS3---NS4-L---NS5-L ---NS6-L > >--- a /a/b - /a/c --- /a/c > /a/b /a/b >/a/d > >So if we dedicate a queue for local changes only what would happen. > >If we drop NS3 then while diffing [NS2-L, NS4-L] /a/c would be >reported as "added" and "local". Now we have a listener which listens >for locally added nt:file node such it can start some processing job >for it. Such a listener would then think its a locally added node and >would start a duplicate job Good point. We could probably fix this though by not only storing 1 root NodeState per ContentChange, but store 2: a 'from' and a 'to' (the 'from' is currently implicit, as that's taken from the previous entry, but if we skip entries, then it needs to be re-added). So with that, we could safely drop external changes as 'uninterested' and diffing would still report the correct thing. > >In general I believe > >Listener for external Change >-- >listener which are listening for external changes are maintaining some >state and purge/refresh it upon detecting change in interested paths. >They would work fine if multiple content change occurrences are merged > >[NS4-L, NS5-L] + [NS5-L,NS6-L] = [NS4, NS6] (external) as they would >still detect the change > >An example of this is LuceneIndexObserver which sets queue size to 5 >and does not care its local or not. It just interested in if index >node is updated > >Listener for local Change >-- > >Such a listener is more particular about type of change and is doing >some persisted state change i.e. like registering a job, invoking some >third party service to update the value. This listener is only >interested in local as it know same listener is also active on other >cluster node (homogeneous cluster setup) so if a node gets added it >only need to react on the cluster node where it got added. One thing this reminds me of is a use-case where you have say 3 cluster nodes, each one handling mainly local events lets say. All fine. Then 1 node crashes while likely it's (local) observation queue wasn't entirely empty. Those events would then probably not get handled by anyone (and that node wouldn't necessarily be restarted as the cluster continues normally, it could be restarted as a new clusterNodeId..). So maybe there's an issue there. > >So for such it needs to be ensured that mixed content changes are not >compacted. So its fine to > >[NS4-L, NS5-L] + [NS5-L,NS6-L] = [NS4, NS6] (can be treated as >local with loss of user identity which caused the change) >[NS2-L, NS3]+ [NS3, NS4-L] = [NS2-L, NS4-L] (cannot be treated as >local) I think keeping the 'from/to' tuple instead of just 1 root NodeState would make the above picture more simple. Cheers, Stefan > >Just thinking out loud here to understand the problem space better :) > >Chetan Mehrotra
Re: [observation] pure internal or external listeners
Perhaps for backwards compatibility we could auto-create 2 listeners for the case where a listener is registered without ExcludeInternal or ExcludeExternal - and issue a corresponding, loud, WARN. On 02/09/16 12:30, "Stefan Egli" <stefane...@apache.org> wrote: >Hi, > >As you're probably aware there are currently several different issues >being >worked upon related to the observation queue limit problem ([0], epic >[1]). >I wanted to discuss yet another improvement and first ask what the list >thinks. > >What about requiring observation listeners to either consume only internal >or only external events, but never both together, we wouldn't support that >anymore. (And if you're in a cluster you want to be very careful with >consuming external events in the first place - but that's another topic) > >The root problem of the 'queue hitting the limit' as of today is that it >throws away the CommitInfo, thus doesn't know anymore if it's an internal >or >an external event (besides actually loosing the CommitInfo details). If we >separate listeners into purely internal vs external, then a queue as a >whole >is either purely internal or external and we no longer have this issue. We >could continue to throw away the CommitInfo (or avoid that using a >persisted >obs queue ([2])), but we could then still say with certainty if it's an >internal or an external event. > >A user that would want to receive both internal and external events could >simply create two listeners. Those would both receive events as expected. >The only difference would be that the two stream of events would not be in >sync - but I doubt that this would be a big loss. > >We could thus introduce 'ExcludeInternal' and demand in >ObservationManager.addEventListener that the listener is flagged with one >of >ExcludeInternal or ExcludeExternal. > >Wdyt? > >Cheers, >Stefan >-- >[0] - https://issues.apache.org/jira/browse/OAK-2683 >[1] - https://issues.apache.org/jira/browse/OAK-4614 >[2] - https://issues.apache.org/jira/browse/OAK-4581 > > >
[wip][review] persistent observation queue - OAK-4581
Hi, As an FYI: I'm working on persisting the observation queue - OAK-4581 - and have attached a patch and a comment [0] to the ticket indicating current progress. Would welcome some early feedback/review. The main idea is that it would introduce a 'PersistedBlockingQueue' that would be plugged (as the 'queue') into the BackgroundObserver, which can then remain largely unchanged. The whole logic of persisting is thus hidden in the PersistedBlockingQueue. PS1: The stored data would all be discarded on restart - this is just to work around the 'limit' aspect of the current in-memory queue at runtime. Nothing related to journaled observation. PS2: The current v0 implementation is a bit dumb, early progress - no tests, no generational-gc, not much batching/caching, but already uses a secondary (or is that thirdary?) SegmentNodeStore just for storing ContentChange objs. Thanks! Cheers, Stefan -- [0] https://issues.apache.org/jira/browse/OAK-4581?focusedCommentId=15452460 e=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1 5452460
Re: [suggestion] introduce oak compatibility levels
Hi Michael, On 28/07/16 10:54, "Michael Marth"wrote: >I think we should simply stick to SemVer of the released artefacts to >signal those changes to upstream. IIUC the difference would be that one version (eg oak 1.6) could contain multiple compatibility versions (eg 1.2/1.4) - some perhaps marked as deprecated - while as using SemVer you'd have to have multiple versions of oak concurrently in an OSGi stack (which is likely not going to work) to achieve the same. Compatibility levels would be more flexible than SemVer. >On the more specific topic of session behaviour: could we use session >attributes to let the app specify session behaviour? [1] Yes, that would work too. >[1] >https://docs.adobe.com/docs/en/spec/javax.jcr/javadocs/jcr-2.0/javax/jcr/S >ession.html#getAttribute(java.lang.String) Cheers, Stefan
Re: Requirements for multiple Oak clients on the same backend (was: [suggestion] introduce oak compatibility levels)
Don't have an answer, but there was a similar question recently on this list: "Does Oak core check the repository version ?" http://markmail.org/thread/sbvjydwdu3g2eze5 Cheers, Stefan On 28/07/16 10:45, "Bertrand Delacretaz" <bdelacre...@apache.org> wrote: >Hi, > >On Thu, Jul 28, 2016 at 10:23 AM, Stefan Egli <stefane...@apache.org> >wrote: >>...we could introduce a concept of >> 'compatibility levels' which are a set of features/behaviours that a >> particular oak version has and that application code relies upon > >Good timing, I have a related question about multiple client apps >connecting to the same Oak backend. > >Say I have to Java apps A and B which use the same Oak/Mongo/BlobStore >configuration, are there defined requirements as to the Oak library >versions or other settings that A and B use? > >Do they need to use the exact same versions of the Oak bundles, and >are violations to that or to other compatibility requirements >detected? > >-Bertrand
Re: [suggestion] introduce oak compatibility levels
(typo) On 28/07/16 10:23, "Stefan Egli" <stefane...@apache.org> wrote: >One concrete case where this could have been useful is the >backwards-compatible behaviour where a session is auto-refreshed when >changes are done in another session. .. in the same thread, that is ..
Re: Specifying threadpool name for periodic scheduled jobs (OAK-4563)
I'd go for #A to limit cross-effects between oak and other layers. The reason one would want to use the default pool for #4 is probably the idea that you'd want to avoid "wasting" a thread in the oak-thread-pool and rather rely on a shared one. But arguably, that should be an optimization of the thread pool provider itself: that provider could be more intelligent and allocate threads from an under-used pool elsewhere - if that were more performant. But from a logical point of view, I'd argue it's better to have an oak-dedicated thread-pool. Cheers, Stefan On 19/07/16 10:06, "Chetan Mehrotra"wrote: >On Tue, Jul 19, 2016 at 1:21 PM, Michael Dürig wrote: >> Not sure as I'm confused by your description of that option. I don't >> understand which of 1, 2, 3 and 4 would run in the "default pool" and >>which >> should run in its own dedicated pool. > >#1, #2 and #3 would run in dedicated pool and each using same pool. >Pool name would be 'oak'. Also see OAK-4563 for the patch > >While for #4 default pool would be used as those are non blocking and >short tasks > >Chetan Mehrotra
Re: Requirement to support multiple NodeStore instance in same setup (OAK-4490)
On 22/06/16 12:21, "Chetan Mehrotra"wrote: >On Tue, Jun 21, 2016 at 4:52 PM, Julian Sedding >wrote: >> Not exposing the secondary NodeStore in the service registry would be >> backwards compatible. Introducing the "type" property potentially >> breaks existing consumers, i.e. is not backwards compatible. > >I had similar concern so proposed a new interface as part of OAK-4369. >However later with further discussion realized that we might have >similar requirement going forward i.e. presence of multiple NodeStore >impl so might be better to make setup handle such case. > >So at this stage we have 2 options > >1. Use a new interface to expose such "secondary" NodeStore >2. OR Use a new service property to distinguish between different roles > >Not sure which one to go. May be we go for merged i.e. have a new >interface as in #1 but also mandate that it provides its "role/type" >as a service property to allow client to select correct one > >Thoughts? If the 'SecondaryNodeStoreProvider' is a non-public interface which can later 'easily' be replaced with another mechanism, then for me this would sound more straight forward at this stage as it would not break any existing consumers (as mentioned by Julian). Perhaps once those 'other use cases going forward' of multiple NodeStores become more clear, then it might be more obvious as to how the generalization into perhaps a type property should look like. my 2cents, Cheers, Stefan
Re: [VOTE] Please vote for the final name of oak-segment-next
Hi, On 26/04/16 14:00, "Thomas Mueller"wrote: >I would keep the "oak-segment-*" name, so that it's clear what it is based >on. So: > >-1 oak-local-store >-1 oak-embedded-store > >+1 oak-segment-* > >Within the oak-segment-* options, I don't have a preference. +1 (I do like 'oak-segment-v2' though, so +1 to that too) Cheers, Stefan
Re: [discuss][scalability] oak:asyncConflictResolution
Hi, On 21/03/16 21:23, "Michael Dürig"wrote: > There is org.apache.jackrabbit.oak.spi.commit.PartialConflictHandler and > a couple of its implementations already. Maybe this could be leveraged > here by somehow connecting it to the mix-ins you propose. Yes, I think it should be something like a PartialConflictHandler that is either configurable or customizable. On 22/03/16 11:35, "Davide Giannella" wrote: > I'd go for the mixin, with a default chain/order of conflict resolution > and allow to define such in a multivalue property. So that in case > needed the user can define its own chain of conflict resolution, or even > custom one if needed. Right, sounds like a mixin rather than (just) a property would be more appropriate. Cheers, Stefan
Re: [discuss][scalability] oak:asyncConflictResolution
On 21/03/16 21:03, "Stefan Egli" <stefane...@apache.org> wrote: >...a third one could again be 'strict' (which would correspond to JCR >semantics >as are the default today) .. actually that would not be possible asynchronously, scratch that.. Cheers, Stefan
[discuss][scalability] oak:asyncConflictResolution
Hi oak-devs, tl.dr: suggestion is to introduce a new property (or mixin) that enables async merge for a subtree in a cluster case while at the same time pre-defines conflict resolution, since conflicts currently prevent trouble-free async merging. In case this has been discussed/suggested before, please point me to the discussion, in case not, here's the suggestion: When it comes to handling conflicts we either deal with them in a synchronous way (we throw a CommitFailedException right away) or have no feasible/implemented solutions how to asynchronously handle them (we'd have the possibility of leaving :conflict markers persisted, which would in theory allow asynchronous merges, but so far we don't have anything built ontop of that) In any case, for cluster scalability it's critical that we avoid 'synchronous' checks and instead switch to asynchronous merging wherever possible: while for some parts of the content (eg '/var') it is always necessary to have synchronous checks, the assumption is that other areas (eg '/content') might well live with something asynchronous - as normally no conflicts occur and if, then a predefined schema that then kicks in is fine. And one way to tackle this would be to mark nodes (and thus implicitly its subtree) in a way that says "from here on below it's ok to do asynchronous conflict resolution of type X". Something that could be solved by introducing an explicit marker in the form of eg a mixin or a property 'oak:asyncConflictResolution' (that could either refer to a globally defined resolution or further detail 'how' that resolution should look like). If a transaction would involve both normal as well as async conflict resolution, then not much is gained as you'd still have to do conflict checks at least for that 'normal/sync' part. But if the expectation is that there are cases of transactions that include only such async marked areas, then you can avoid the synchronous checks. Examples for these pre-defined resolutions are: 'delete-wins, then latest-change-wins' (which might be the easiest), or 'latest-change-wins' (which might be more tricky as that would mean those 'changeDeleted' cases would resurrect deleted data magically - possible but perhaps too magic), a third one could again be 'strict' (which would correspond to JCR semantics as are the default today) - or again 'no-resolution-but-persist-conflict-marker' etc... Having such pre-defined conflict resolution and at the same time clearly indicating that doing conflict-checking asynchronously is OK would allow to have truly parallel writes into the NodeStore from different instance's pov. Wdyt? Cheers, Stefan
Re: oak-resilience
Hi Tomek, Would also be interesting to see the effect on the leases and thus discovery-lite under high memory load and network problems. Cheers, Stefan On 04/03/16 11:13, "Tomek Rekawek"wrote: >Hello, > >For some time I've worked on a little project called oak-resilience. It >aims to be a resilience testing framework for the Oak. It uses >virtualisation to run Java code in a controlled environment, that can be >spoilt in different ways, by: > >* resetting the machine, >* filling the JVM memory, >* filling the disk, >* breaking or deteriorating the network. > >I described currently supported features in the README file [1]. > >Now, once I have a hammer I'm looking for a nail. Could you share your >thoughts on areas/features in Oak which may benefit from being >systematically tested for the resilience in the way described above? > >Best regards, >Tomek > >[1] >https://github.com/trekawek/jackrabbit-oak/tree/resilience/oak-resilience > >-- >Tomek Rękawek | Adobe Research | www.adobe.com >reka...@adobe.com >
Re: OAK-4006 : Enable cloning of repo for shared data store and discovery-lite
Thanks for the various comments and review on OAK-4006. I've attached a final version of the patch and will push that later this afternoon (together with OAK-4007) unless I hear fresh concern. Cheers, Stefan On 11/02/16 20:16, "Stefan Egli" <stefane...@apache.org> wrote: >Hi all, > >The recent clusterId-discussions around OAK-3935 together with the cloning >problem it shares with discovery.oak made me rethink the current >two-clusterId-approach. After some offline discussions with Thomas and >Marcel I've created OAK-4006 which suggests reusing the SharedDataStore >way >of a hidden :clusterId property, providing a dedicated 'after clone' >offline >reset tool in oak-run and using that same clusterId also in discovery-lite >(thus discovery.oak). This should leave us with only 1 clusterId in the >stack. > >Since 1.4 will be the first to support discovery.oak, and to allow for >enough testing, it would be important to have this in 1.3.16. I will >therefore work on a patch tomorrow and would highly appreciate comments on >the approach and patch. If +1-ed It should delay 1.3.16 a few hours or a >day. > >https://issues.apache.org/jira/browse/OAK-4006 > >Cheers, >Stefan > >
Re: Oak 1.3.16 release plan
Hi Davide, As mentioned on the list OAK-4006 is in discussion and in the works. So, depending on the outcome it might require a small delay. Cheers, Stefan On 11/02/16 11:45, "Davide Giannella"wrote: >Hello team, > >I'm planning to cut Oak 1.3.16 on Monday 15th February more or less 10am >GMT. > >If there are any objections please let me know. Otherwise I will >re-schedule any non-resolved issue for the next iteration. > >Thanks >Davide > >
Re: OAK-4006 : Enable cloning of repo for shared data store and discovery-lite
On 11/02/16 20:29, "Vikas Saurabh"wrote: >we'd really have to shout in the >documentation that after this, clone use-case requires >oak-run->reset_id Agreed. (Side note: but that we'd otherwise have had to do for OAK-3935, right?) > (I'm assuming that the approach obviates the need to >delete sling id file) Not sure about this one. As deleting sling.id.file is still required and likely a separate task, as that's on a sling level and you can't combine that into the oak-run tool from a separation of concern pov. Cheers, Stefan
OAK-4006 : Enable cloning of repo for shared data store and discovery-lite
Hi all, The recent clusterId-discussions around OAK-3935 together with the cloning problem it shares with discovery.oak made me rethink the current two-clusterId-approach. After some offline discussions with Thomas and Marcel I've created OAK-4006 which suggests reusing the SharedDataStore way of a hidden :clusterId property, providing a dedicated 'after clone' offline reset tool in oak-run and using that same clusterId also in discovery-lite (thus discovery.oak). This should leave us with only 1 clusterId in the stack. Since 1.4 will be the first to support discovery.oak, and to allow for enough testing, it would be important to have this in 1.3.16. I will therefore work on a patch tomorrow and would highly appreciate comments on the approach and patch. If +1-ed It should delay 1.3.16 a few hours or a day. https://issues.apache.org/jira/browse/OAK-4006 Cheers, Stefan
Re: OAK-4006 : Enable cloning of repo for shared data store and discovery-lite
On 11/02/16 20:42, "Vikas Saurabh"wrote: >probably I mis-understood sling id file as >cluster id... while I think that's persistent instance id, right? correct. Cheers, Stefan >
Re: travis needs more memory
On 10/02/16 14:59, "Davide Giannella" <dav...@apache.org> wrote: >On 10/02/2016 10:22, Stefan Egli wrote: >> Re NonLocalObservationIT, that one creates like 160'000 nodes in-memory >> and that seems not to fit the default VM settings. > >Shall we move this to a SegmentFixture? Or DocumentFixture because the test needs to simulate a cluster. It looks like OAK-3803 removed 'cluster support' from the NodeStoreFixtures - they all return null now in createNodeStore(clusterNodeId) - which I originally worked around by creating a new in-memory fixture. But yes, actually the test should run against mongo. Was there a particular reason for removing cluster support in OAK-3803? Cheers, Stefan
Re: travis needs more memory
Re NonLocalObservationIT, that one creates like 160'000 nodes in-memory and that seems not to fit the default VM settings. Re the other test (ConcurrentAddIT) I don't know. Cheers, Stefan On 10/02/16 09:04, "Marcel Reutegger" <mreut...@adobe.com> wrote: >Hi, > >this may solve the immediate issue with the test >failure, but it probably also hides an memory problem >with our tests. in the past I tried to first identify >and fix memory leaks and only then increase the heap >if really necessary. do you know what is holding on >to the memory? > >Regards > Marcel > >On 09/02/16 19:17, "Stefan Egli" wrote: > >>Hi, >> >>Looks like we need to give our travis run [0] more memory. OAK-3986 was >>likely partly slowing down due to memory becoming low. Now it looks like >>ConcurrentAddIT is failing [1] for the same reason too (can reproduce >>this >>locally: default memory settings result in OOME). I'm guessing adding >>this >>to the .travis.yml would do the trick? >> >>env: >> >>global: >> >>- JAVA_OPTS="-Xmx1G" >> >> >>Cheers, >>Stefan >>-- >>[0] - https://travis-ci.org/apache/jackrabbit-oak/builds >>[1] - >>Running org.apache.jackrabbit.oak.jcr.ConcurrentAddIT >>No output has been received in the last 10 minutes, this potentially >>indicates a stalled build or something wrong with the build itself. >> >>The build has been terminated >> >> >
travis needs more memory
Hi, Looks like we need to give our travis run [0] more memory. OAK-3986 was likely partly slowing down due to memory becoming low. Now it looks like ConcurrentAddIT is failing [1] for the same reason too (can reproduce this locally: default memory settings result in OOME). I'm guessing adding this to the .travis.yml would do the trick? env: global: - JAVA_OPTS="-Xmx1G" Cheers, Stefan -- [0] - https://travis-ci.org/apache/jackrabbit-oak/builds [1] - Running org.apache.jackrabbit.oak.jcr.ConcurrentAddIT No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself. The build has been terminated
Re: [discuss] persisting cluster (view) id for discovery-lite-descriptor
Having thought and discussed about this some more.. an even simpler solution is: d) the discovery-lite descriptor *can* contain an id, in which case it should be used. But *neither tarMk nor mongoMk set this*. + The advantage is that tarMk and mongoMk then behave the same, and even the similar to discovery.impl: discovery.oak stores a 'clusterId' property under /var/discovery/oak, thus being easily visible/manageable in all cases. - The disadvantages are in the same area that lead to choosing c) originally: conceptually defining the id and who is member etc are the all aspects of the same concern and should not be separated, as otherwise you open the door for possible inconsistencies of these aspects. So if this is separated it needs to be seen as a trade-off with what is gained, namely easier visibility and manageability of this id. Known places where this separation and thus loss of synchronization can be a problem is the first time the id is defined. That should however be handled by mongoMk's conflict handling. Another potential place is when this id is redefined (eg deleted). That must be managed separately and is one consequence of d) versus c). At this stage I'm not seeing any other negative consequences so overall d) sounds still better than c). Unless I hear vetoes, I'd implement this change before tomorrow's 1.3.15 release (also in OAK-3672, which I'll then rename) Cheers, Stefan On 27/01/16 10:45, "Stefan Egli" <stefane...@apache.org> wrote: >Hi, > >Following up on the OAK-3672 discussion again, and taking a step back, I >see three possible classes of solutions: > >a) the (cluster)id is always defined by discovery-lite, be it cluster or >singlevm >b) the (cluster)id is entirely removed and it is up to discovery.oak (in >sling) to define it >c) the (cluster)id is only set by discovery-lite when feasible, eg only >for the cluster case > >I'm in favour of c) with the following arguments: >* a) requires tarMk (!) to store this id somewhere. It can either store it >in the filesystem (which makes failover support harder), store it as a >hidden property in the node store (which is not manageable as it's hidden) >or store it as a normal property in the repository (which sounds hacky, as >discovery-lite is in the NodeStore layer while this would require it to >simulate writing a JCR property) >* removing the id altogether (b) would be going too far imv: the logical >unit that defines the cluster view (its members) is the best place to also >define an id for that unit. And that logical unit is discovery-lite in >this case. >* what speaks for returning null for the singleVm case (c) is the fact >that it is a special case (it is not a cluster). So treating the special >case separately doesn't break the separation of concern rule in my view. >(c) would imply that the id is set when we're in a cluster case, and not >otherwise (but that would not be a hard requirement, the specification >would just be that the id *can* be null). > >So long story short: I suggest to change the definition of this id so that >it *can* be null - in which case upper layers must define their own id. >Which means Sling's discovery.oak would then store a clusterId under >/var/discovery/oak. That would automatically support cold-standby/failover >- fix the original bug - and simplify cleaning this property up for the >clone case (as that would correspond to how this case was dealt with in >discovery.impl times already). > >WDYT? > >Cheers, >Stefan > >On 26/11/15 11:32, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote: > >>On Thu, Nov 26, 2015 at 3:56 PM, Stefan Egli <e...@adobe.com> wrote: >>> which would >>> then be on the Sling level thus could more simply use the slingId. >> >>That also sounds good. While we are at it also have a look at OAK-3529 >>where system needs to know a clusterId. Looks like some overlap so >>keep that usecase also in mind >> >> >>Chetan Mehrotra > >
Re: [DISCUSS] avoid bad commits with mis-behaving clock
On 14/01/16 18:34, "Julian Reschke"wrote: >On 2016-01-14 17:36, Vikas Saurabh wrote: >>@Julian, if I understand correctly, OAK-2682 currently is about >> warning, right? It mentions a self-desctruct option but I think it >> wasn't implemented. > >It is implemented in trunk, see r1695671 (might be only on startup, >though). The current model is that at startup this is enforced, but at runtime this is not enforced at the oak level. What we currently have is a JMX method which should be hooked into some sort of runtime monitoring (be that external or internal eg via health checks). Such a runtime monitoring would not be enough though as it would certainly not react fast enough. So if we're saying we need clocks to be in sync at any given time, we probably have to combine checking clocks upon every lease update, as well as restricting valid revisions to be within the lease window. Cheers, Stefan
Re: [discuss] persisting cluster (view) id for discovery-lite-descriptor
Hi, Following up on the OAK-3672 discussion again, and taking a step back, I see three possible classes of solutions: a) the (cluster)id is always defined by discovery-lite, be it cluster or singlevm b) the (cluster)id is entirely removed and it is up to discovery.oak (in sling) to define it c) the (cluster)id is only set by discovery-lite when feasible, eg only for the cluster case I'm in favour of c) with the following arguments: * a) requires tarMk (!) to store this id somewhere. It can either store it in the filesystem (which makes failover support harder), store it as a hidden property in the node store (which is not manageable as it's hidden) or store it as a normal property in the repository (which sounds hacky, as discovery-lite is in the NodeStore layer while this would require it to simulate writing a JCR property) * removing the id altogether (b) would be going too far imv: the logical unit that defines the cluster view (its members) is the best place to also define an id for that unit. And that logical unit is discovery-lite in this case. * what speaks for returning null for the singleVm case (c) is the fact that it is a special case (it is not a cluster). So treating the special case separately doesn't break the separation of concern rule in my view. (c) would imply that the id is set when we're in a cluster case, and not otherwise (but that would not be a hard requirement, the specification would just be that the id *can* be null). So long story short: I suggest to change the definition of this id so that it *can* be null - in which case upper layers must define their own id. Which means Sling's discovery.oak would then store a clusterId under /var/discovery/oak. That would automatically support cold-standby/failover - fix the original bug - and simplify cleaning this property up for the clone case (as that would correspond to how this case was dealt with in discovery.impl times already). WDYT? Cheers, Stefan On 26/11/15 11:32, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote: >On Thu, Nov 26, 2015 at 3:56 PM, Stefan Egli <e...@adobe.com> wrote: >> which would >> then be on the Sling level thus could more simply use the slingId. > >That also sounds good. While we are at it also have a look at OAK-3529 >where system needs to know a clusterId. Looks like some overlap so >keep that usecase also in mind > > >Chetan Mehrotra
Re: [discuss] persisting cluster (view) id for discovery-lite-descriptor
I'm not sure how feasible kung fu or voodoo would be but one alternative could be that discovery-lite would 'signal' that this is a standalone instance (either by just setting id=null or by something a bit more explicit) and discovery.oak could then react accordingly - which would then be on the Sling level thus could more simply use the slingId. Not sure about making the "discovery-lite API" weaker re this point though... Cheers, Stefan On 26/11/15 04:37, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote: >There is another option to avoid extra effort when running within >Sling. Have an optional implementation which makes use of >SlingSettingsService to get fetch SlingId. With little bit of OSGi >kung fu you can have an implementation which uses SlingId when running >in Sling otherwise maintains its own id using File based approach. > >This would reduce operational complexity >Chetan Mehrotra > > >On Wed, Nov 25, 2015 at 6:23 PM, Stefan Egli <stefane...@apache.org> >wrote: >> Right, I'm not sure it is indeed a requirement. But without automatic >> support it might get forgotten and thus the cluster id would change upon >> failover. >> >> Cheers, >> Stefan >> >> On 25/11/15 13:40, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote: >> >>>On Wed, Nov 25, 2015 at 6:00 PM, Stefan Egli <stefane...@apache.org> >>>wrote: >>>>> * disadvantage: cold standby would require an explicit copying of >>>>>this >>>>>file >>>>> (during initial hand-shake?) >>> >>>Why is that a requirement? Cold standby is just a backup and currently >>>there is no automatic failover support. >>> >>>For such cases we can allow passing the id as a system/framework >>>property >>>also >>> >>>Chetan Mehrotra >> >>
[discuss] persisting cluster (view) id for discovery-lite-descriptor
Hi, Noticed that for TarMK the discovery-lite-descriptor does currently not persist the cluster-view-id [0]. It should do this however, as otherwise this causes upper-level discovery.oak to break the discovery API, as it demands a persisted cluster id. (Note that this id is not to be confused with the 'cluster node id' that identifies an instance within a document node store cluster) I wanted to get some ideas from the list as to how this should be implemented. Current options are: 1. storing a 'cluster.id.file' (or 'discovery.cluster.id.file') similar to the 'sling.id.file' (via BundleContext.getDataFile). > * cloning a repository would therefore require to delete both sling.id.file > and this new file > * disadvantage: cold standby would require an explicit copying of this file > (during initial hand-shake?) 2. storing the id as a property somewhere in the repository. > * disadvantage: cloning a repository would clone this id as well and there > might not be an easy enough way for a user to reset it Opinions? Alternatives? Cheers, Stefan -- [0] https://issues.apache.org/jira/browse/OAK-3672
Re: [discuss] persisting cluster (view) id for discovery-lite-descriptor
Right, I'm not sure it is indeed a requirement. But without automatic support it might get forgotten and thus the cluster id would change upon failover. Cheers, Stefan On 25/11/15 13:40, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote: >On Wed, Nov 25, 2015 at 6:00 PM, Stefan Egli <stefane...@apache.org> >wrote: >>> * disadvantage: cold standby would require an explicit copying of this >>>file >>> (during initial hand-shake?) > >Why is that a requirement? Cold standby is just a backup and currently >there is no automatic failover support. > >For such cases we can allow passing the id as a system/framework property >also > >Chetan Mehrotra
Re: Observation: External vs local - Load distribution
Hi Carsten, For external events the commit info is indeed not provided yup. For internal ones it is - except for those 'overflow' ones which collapse into a pseudo-external one. Cheers, Stefan On 13/10/15 15:17, "Carsten Ziegeler"wrote: >Am 17.06.15 um 10:35 schrieb Carsten Ziegeler: >> Ok, just to recap. In Sling we can implement the Observer interface (and >> not use the BackgroundObserver base class). This will give us reliably >> user id for all local events. >> >> Does anyone see a problem with this approach? >> >Getting back to this problem, it seems the above does not work, as the >DocumentNodeStore is not passing on the commit info to the observer in >the case of external events. >So no matter how I implement my observer, I don't get the info passed in. > >Can someone please confirm this? > >Thanks >Carsten >-- >Carsten Ziegeler >Adobe Research Switzerland >cziege...@apache.org
Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j
On 10/09/15 18:43, "Stefan Egli" <stefane...@apache.org> wrote: >additionally/independently: > >[...] > >* also, we should probably increase the lease thread's priority to reduce >the likelihood of the lease timing out (same would be true for >discovery.impl's heartbeat thread) > >* plus increasing the lease time from 1min to perhaps 5min as the default >would also reduce the number of cases that hit problems dramatically FYI: Put these suggested improvements into: https://issues.apache.org/jira/browse/OAK-3398 most noteworthy: I suggest to increase the lease timeout by default to 120sec. (not 5min, I think that's too much) Cheers, Stefan
Re: Oak 1.3.6 release plan
As the 1.3.6 is already in the voting phase, it would mean -1 for that release - not sure if it's enough of an issue for that though? (mind you, the issue was already there in 1.3.5..) Cheers, Stefan On 14/09/15 12:29, "Julian Reschke" <julian.resc...@gmx.de> wrote: >On 2015-09-14 10:17, Julian Reschke wrote: >> On 2015-09-14 10:03, Stefan Egli wrote: >>> On 14/09/15 09:51, "Marcel Reutegger" <mreut...@adobe.com> wrote: >>> >>>> ...would it >>>> make sense to just disable the lease check for the diagnostics >>>> in oak-run? ... >>> >>> +1 as a short-term fix >>> >>> Cheers, >>> Stefan >> >> I agree that this would have been broken by the other wrappers, and the >> approach in itself wasn't smart in the first place. My point being: can >> we please come up with a proper solution that will address all the uses >> cases? >> >> Best regards, Julian > >...essentially we are introducing a new feature (improving resilience) >that breaks existing code assumptions, potentially causing performance >degradations. I believe the right thing to do *now* is to disable the >new feature, make the 1.3.6 release, then fix things properly and turn >it in in 1.3.7. > > >Best regards, Julian > > > > > > > >
Re: Oak 1.3.6 release plan
On 14/09/15 09:51, "Marcel Reutegger"wrote: >...would it >make sense to just disable the lease check for the diagnostics >in oak-run? ... +1 as a short-term fix Cheers, Stefan
Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j
My vote would also be (b) for the short-term. If we figure out a way to properly restart the nodestore (c) we can still come back to that at a later time. Hence I've created https://issues.apache.org/jira/browse/OAK-3397 and unless the list vetoes I'll follow up on that next. Cheers, Stefan On 11/09/15 11:38, "Julian Sedding" <jsedd...@gmail.com> wrote: >My preference is (b), even though I think stopping the NodeStore >service should be sufficient (it may not currently be sufficient, I >don't know). > >Particularly, I believe that "trying harder" is detrimental to the >overall stability of a cluster/topology. We are dealing with a >possibly faulty instance, so who can decide that it is ok again after >trying harder? The faulty instance itself? > >"Read-only" doesn't sound too useful either, because that may fool >clients into thinking they are dealing with a "healthy" instance for >longer than necessary and thus can lead to bigger issues downstream. > >I believe that "fail early and fail often" is the path to a stable >cluster. > >Regards >Julian > >On Thu, Sep 10, 2015 at 6:43 PM, Stefan Egli <stefane...@apache.org> >wrote: >> On 09/09/15 18:11, "Stefan Egli" <stefane...@apache.org> wrote: >> >>>On 09/09/15 18:01, "Stefan Egli" <stefane...@apache.org> wrote: >>> >>>>I think if the observers would all be 'OSGi-ified' then this could be >>>>achieved. But currently eg the BackgroundObserver is just a pojo and >>>>not >>>>an osgi component (thus doesn't support any activate/deactivate method >>>>hooks). >>> >>>.. I take that back - going via OsgiWhiteboard should work as desired - >>>so >>>perhaps implementing deactivate/activate methods in the >>>(Background)Observer(s) would do the trick .. I'll give it a try .. >> >> ootb this wont work as the BackgroundObserver, as one example, is not an >> OSGi component, so wont get any deactivate/activate calls atm. so to >> achieve this, it would have to be properly OSGi-ified - something which >> sounds like a bigger task and not only limited to this one class - which >> means making DocumentNodeStore 'restart capable' sounds like a bigger >>task >> too and the question is indeed if it is worth while ('will it work?') or >> if there are alternatives.. >> >> which brings me back to the original question as to what should be done >>in >> case of a lease failure - to recap the options left (if System.exit is >>not >> one of them) are: >> >> a) 'go read-only': prevent writes by throwing exceptions from this >>moment >> until eternity >> >> b) 'stop oak': stop the oak-core bundle (prevent writes by throwing >> exceptions for those still reaching out for the nodeStore) >> >> c) 'try harder': try to reactivate the lease - continue allowing writes >>- >> and make sure the next backgroundWrite has correctly updated the >> 'unsavedLastRevisions' (cos others could have done a recover of this >>node, >> so unsavedLastRevisions contains superfluous stuff that must no longer >>be >> written). this would open the door for edge cases ('change of longer >>time >> window with multiple leaders') but perhaps is not entirely impossible... >> >> additionally/independently: >> >> * in all cases the discovery-lite descriptor should expose this lease >> failure/partitioning situation - so that anyone can react who would like >> to, esp should anyone no longer assume that the local instance is leader >> or part of the cluster - and to support that optional Sling Health Check >> which still does a System.exit :) >> >> * also, we should probably increase the lease thread's priority to >>reduce >> the likelihood of the lease timing out (same would be true for >> discovery.impl's heartbeat thread) >> >> >> * plus increasing the lease time from 1min to perhaps 5min as the >>default >> would also reduce the number of cases that hit problems dramatically >> >> wdyt? >> >> Cheers, >> Stefan >> >>
Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j
On 09/09/15 18:11, "Stefan Egli" <stefane...@apache.org> wrote: >On 09/09/15 18:01, "Stefan Egli" <stefane...@apache.org> wrote: > >>I think if the observers would all be 'OSGi-ified' then this could be >>achieved. But currently eg the BackgroundObserver is just a pojo and not >>an osgi component (thus doesn't support any activate/deactivate method >>hooks). > >.. I take that back - going via OsgiWhiteboard should work as desired - so >perhaps implementing deactivate/activate methods in the >(Background)Observer(s) would do the trick .. I'll give it a try .. ootb this wont work as the BackgroundObserver, as one example, is not an OSGi component, so wont get any deactivate/activate calls atm. so to achieve this, it would have to be properly OSGi-ified - something which sounds like a bigger task and not only limited to this one class - which means making DocumentNodeStore 'restart capable' sounds like a bigger task too and the question is indeed if it is worth while ('will it work?') or if there are alternatives.. which brings me back to the original question as to what should be done in case of a lease failure - to recap the options left (if System.exit is not one of them) are: a) 'go read-only': prevent writes by throwing exceptions from this moment until eternity b) 'stop oak': stop the oak-core bundle (prevent writes by throwing exceptions for those still reaching out for the nodeStore) c) 'try harder': try to reactivate the lease - continue allowing writes - and make sure the next backgroundWrite has correctly updated the 'unsavedLastRevisions' (cos others could have done a recover of this node, so unsavedLastRevisions contains superfluous stuff that must no longer be written). this would open the door for edge cases ('change of longer time window with multiple leaders') but perhaps is not entirely impossible... additionally/independently: * in all cases the discovery-lite descriptor should expose this lease failure/partitioning situation - so that anyone can react who would like to, esp should anyone no longer assume that the local instance is leader or part of the cluster - and to support that optional Sling Health Check which still does a System.exit :) * also, we should probably increase the lease thread's priority to reduce the likelihood of the lease timing out (same would be true for discovery.impl's heartbeat thread) * plus increasing the lease time from 1min to perhaps 5min as the default would also reduce the number of cases that hit problems dramatically wdyt? Cheers, Stefan
Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j
Hi all, I'd like to follow up on the idea to restart DocumentNodeStore as a result of a lease failure [0]: I suggest we don't do that and instead just stop the oak-core bundle. After some prototyping and running into OAK-3373 [1] I'm no longer sure if restarting the DocumentNodeStore is a feasible path to go, esp in the short term. The problem encountered so far is that Observers cannot be easily switched from old to (restarted/)new store due to: * as pointed out by MichaelD they could have a backlog yet to process towards the old store - which they cannot access anymore as that one would be forcibly closed * there is not yet a proper way to switch from old to new ('reset') - esp is there a risk that there could be a gap (this part we might be able to fix though, not sure) * both above carry the risk that Observers miss some changes - something which would be unacceptable I guess. I think the more kiss approach would be to just forcibly close the DocumentNodeStore - or actually to stop the entire oak-core bundle - with appropriate errors logged so that the issue becomes clear. The instance would basically become unusable, mostly, but at least it would not be a System.exit. What do ppl think? Cheers, Stefan -- [0] https://issues.apache.org/jira/browse/OAK-3250 [1] https://issues.apache.org/jira/browse/OAK-3373 On 18/08/15 16:45, "Stefan Egli" <e...@adobe.com> wrote: >I've created OAK-3250 to follow up on the DocumentNodeStore-restart idea. > >Cheers, >Stefan >-- >https://issues.apache.org/jira/browse/OAK-3250 > >On 18/08/15 15:59, "Marcel Reutegger" <mreut...@adobe.com> wrote: > >>On 18/08/15 15:38, "Stefan Egli" wrote: >>>On 18/08/15 13:43, "Marcel Reutegger" <mreut...@adobe.com> wrote: >>>>On 18/08/15 11:14, "Stefan Egli" wrote: >>>>>b) Oak does not do the System.exit but refuses to update anything >>>>>towards >>>>>the document store (thus just throws exceptions on each invocation) - >>>>>and >>>>>upper level code detects this situation (eg a Sling Health Check) and >>>>>would do a System.exit based on how it is configured >>>>> >>>>>c) same as b) but upper level code does not do a System.exit (I¹m not >>>>>sure >>>>>if that makes sense - the instance is useless in such a situation) >>>> >>>>either b) or c) sounds reasonable to me. >>>> >>>>but if possible I'd like to avoid a System.exit(). would it be possible >>>>to detect this situation in the DocumentNodeStoreService and restart >>>>the DocumentNodeStore without the need to restart the JVM >>> >>>Good point. Perhaps restarting DocumentNodeStore is a valid alternative >>>indeed. Is that feasible from a DocumentNodeStore point of view? >> >>it probably requires some changes to the DocumentNodeStore, because >>we want it to tear down without doing any of the cleanup it >>may otherwise perform. it must not release the cluster node info >>nor update pending _lastRevs, etc. >> >>> What would be the consequences of a restarted DocumentNodeStore? >> >>to the DocumentNodeStore it will look like it was killed and it will >>perform recovery (e.g. for the pending _lastRevs). >> >>Regards >> Marcel >> >
Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j
Hi, On 09/09/15 17:39, "Marcel Reutegger"wrote: >>* as pointed out by MichaelD they could have a backlog yet to process >>towards the old store - which they cannot access anymore as that one >>would >>be forcibly closed > >in my view, those observers should be unregistered from the store before >it is shut down and any backlog cleared, i.e. it will be lost. yes they do get unregistered right away indeed - but atm there's no handle as to prevent eg the BackgroundObserver from still having entries in the queue and continuing to process them. so those queued entries will indeed fail as the store is closed. >>* there is not yet a proper way to switch from old to new ('reset') - esp >>is there a risk that there could be a gap (this part we might be able to >>fix though, not sure) > >I don't see a requirement for this. if you restart the entire stack you >will also have a gap. the difference is perhaps that if you restart the stack this is done as an explicit admin operation, knowingly. While as what we're trying to achieve here is something automated, 'under the hood', which has a different quality requirement imv. >>* both above carry the risk that Observers miss some changes - something >>which would be unacceptable I guess. > >same as above. I don't think observers must survive a node store restart. >I even think it is wrong. Every client of the node store should be >restarted in that case, including Observers. I think if the observers would all be 'OSGi-ified' then this could be achieved. But currently eg the BackgroundObserver is just a pojo and not an osgi component (thus doesn't support any activate/deactivate method hooks). Cheers, Stefan
Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j
On 09/09/15 18:01, "Stefan Egli" <stefane...@apache.org> wrote: >I think if the observers would all be 'OSGi-ified' then this could be >achieved. But currently eg the BackgroundObserver is just a pojo and not >an osgi component (thus doesn't support any activate/deactivate method >hooks). .. I take that back - going via OsgiWhiteboard should work as desired - so perhaps implementing deactivate/activate methods in the (Background)Observer(s) would do the trick .. I'll give it a try .. Cheers, Stefan
Re: [Oak origin/trunk] Apache Jackrabbit Oak matrix - Build # 381 - Failure
before it does the exit it issues a loud log.error - so we'd have to have access to the log output.. besides resolving OAK-3250 when we know a test fails because of it, the easiest is to disable the leaseCheck as eg done in [0] but now test results of '381' are deleted so we can't find out anymore Cheers, Stefan -- [0] http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-core/src/test/java/or g/apache/jackrabbit/oak/plugins/document/VersionGarbageCollectorIT.java?r1= 1700741=1700740=1700741 On 07/09/15 10:25, "Michael Dürig" <mdue...@apache.org> wrote: > > >On 7.9.15 10:03 , Stefan Egli wrote: >> so perhaps it's a lease timeout case.. > >Any way to confirm this on Jenkins? E.g. could we place a println in >front of it? Or replace it with a throws? > >Michael
Re: [Oak origin/trunk] Apache Jackrabbit Oak matrix - Build # 381 - Failure
'... System.exit called ...' what we currently have until OAK-3250 is fixed is a System.exit when the lease cannot be updated. so perhaps it's a lease timeout case.. Cheers, Stefan On 31/08/15 16:00, "Michael Dürig"wrote: > >"The forked VM terminated without saying properly goodbye. VM crash or >System.exit called ?" [2]. This happens quite often lately. See log >files with -X option [1]. Not much information though. Any ideas what >could be causing this? > >Michael > >[1] >https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/381/jdk=j >dk1.8.0_11,label=Ubuntu,nsfixtures=SEGMENT_MK,profile=integrationTesting/c >onsole > >[2] >[ERROR] Failed to execute goal >org.apache.maven.plugins:maven-failsafe-plugin:2.12.4:integration-test >(default) on project oak-core: Execution default of goal >org.apache.maven.plugins:maven-failsafe-plugin:2.12.4:integration-test >failed: The forked VM terminated without saying properly goodbye. VM >crash or System.exit called ? -> [Help 1] >org.apache.maven.lifecycle.LifecycleExecutionException: Failed to >execute goal >org.apache.maven.plugins:maven-failsafe-plugin:2.12.4:integration-test >(default) on project oak-core: Execution default of goal >org.apache.maven.plugins:maven-failsafe-plugin:2.12.4:integration-test >failed: The forked VM terminated without saying properly goodbye. VM >crash or System.exit called ? > at >org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java >:224) > at >org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java >:153) > at >org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java >:145) > at >org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(Li >fecycleModuleBuilder.java:108) > at >org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(Li >fecycleModuleBuilder.java:76) > at >org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedB >uilder.build(SingleThreadedBuilder.java:51) > at >org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStar >ter.java:116) > at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:361) > at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155) > at org.apache.maven.cli.MavenCli.execute(MavenCli.java:584) > at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:213) > at org.apache.maven.cli.MavenCli.main(MavenCli.java:157) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: >62) > at >sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm >pl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at >org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher. >java:289) > at >org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229 >) > at >org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launche >r.java:415) > at >org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) >Caused by: org.apache.maven.plugin.PluginExecutionException: Execution >default of goal >org.apache.maven.plugins:maven-failsafe-plugin:2.12.4:integration-test >failed: The forked VM terminated without saying properly goodbye. VM >crash or System.exit called ? > at >org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuild >PluginManager.java:144) > at >org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java >:208) > ... 19 more >Caused by: java.lang.RuntimeException: The forked VM terminated without >saying properly goodbye. VM crash or System.exit called ? > at >org.apache.maven.plugin.surefire.booterclient.output.ForkClient.close(Fork >Client.java:257) > at >org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter >.java:301) > at >org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter. >java:116) > at >org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(Abst >ractSurefireMojo.java:740) > at >org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAllProviders( >AbstractSurefireMojo.java:682) > at >org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPrecondi >tionsChecked(AbstractSurefireMojo.java:648) > at >org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSure >fireMojo.java:586) > at >org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuild >PluginManager.java:133) > ... 20 more > > >On 31.8.15 3:56 , Apache Jenkins Server wrote: >> The Apache Jenkins build system has built Apache Jackrabbit Oak matrix >>(build #381) >> >> Status: Failure >> >> Check console output at
Re: System.out.println used in unit tests in oak-core
which you might have noticed since I disabled redirectTestOutputToFile [0] to debug OAK-3292 so we now have system.out during test runs. I intend to put that flag back once the OAK-3292 dust has settled.. Cheers, Stefan -- [0] - http://svn.apache.org/r1697676 On 27/08/15 13:51, Alex Parvulescu alex.parvule...@gmail.com wrote: Hi, I noticed there are quite a few tests using System.out.println to display various data. Please replace these calls by proper logging. Culprits: - org.apache.jackrabbit.oak.plugins.document.cache.SerializerTest [0] - org.apache.jackrabbit.oak.plugins.document.ClusterViewTest [1] - org.apache.jackrabbit.oak.plugins.document.HierarchyConflictTest [2] - org.apache.jackrabbit.oak.plugins.document.NodeStoreDiffTest [3] - org.apache.jackrabbit.oak.security.user.MembershipProviderTest [4] thanks, alex [0] Running org.apache.jackrabbit.oak.plugins.document.cache.SerializerTest Size 7 null Size 18 b1 Size 301 b1 Size 9 r14f6ef05471-1-5 Size 9 br14f6ef05472-1-5 [1] Running org.apache.jackrabbit.oak.plugins.document.ClusterViewTest {seq:10,final:true,id:a2b9d562-9536-436f-9b67-5efbb85fbed4,me:21 ,active:[21],deactivating:[],inactive:[]} {seq:10,final:true,id:b8e70adb-4b30-4319-aec8-b28fb1679a4c,me:2, active:[2],deactivating:[],inactive:[3]} {seq:10,final:true,id:341ca74f-a2cd-4d81-8a1b-6565f90e22a2,me:2, active:[2,5,6],deactivating:[],inactive:[3]} {seq:10,final:true,id:7f0cfb9e-27eb-47fc-934b-89194a18ac0c,me:2, active:[2],deactivating:[],inactive:[3,4,5,6]} {seq:10,final:true,id:07ac1a64-dd2c-4d02-add8-e345d808aa7a,me:2, active:[2,3],deactivating:[4],inactive:[5,6]} {seq:10,final:false,id:3cd780f0-16ff-4100-a842-c15fcf43e339,me:2 ,active:[2,3],deactivating:[4,5],inactive:[6]} [2] Running org.apache.jackrabbit.oak.plugins.document.HierarchyConflictTest expected: org.apache.jackrabbit.oak.api.CommitFailedException: OakOak: do not retry merge in this test expected: org.apache.jackrabbit.oak.api.CommitFailedException: OakOak: do not retry merge in this test [3] Running org.apache.jackrabbit.oak.plugins.document.NodeStoreDiffTest Root at r1-0-1 (r1-0-1) Root at r2-0-1 (r2-0-1) Root at r3-0-1 (r3-0-1) Root at r4-0-1 (r4-0-1) Root at r1-0-1 (r1-0-1) Root at r2-0-1 (r2-0-1) Root at r3-0-1 (r3-0-1) Root at r4-0-1 (r4-0-1) [4] Running org.apache.jackrabbit.oak.security.user.MembershipProviderTest created 1 groups, 99 users. created 1 groups, 199 users. created 1 groups, 299 users. created 1 groups, 399 users. created 1 groups, 499 users. created 1 groups, 599 users. created 1 groups, 699 users. created 1 groups, 799 users. created 1 groups, 899 users. created 1 groups, 999 users. created 99 groups, 1 users. created 199 groups, 1 users. created 299 groups, 1 users. created 399 groups, 1 users. created 499 groups, 1 users. created 599 groups, 1 users. created 699 groups, 1 users. created 799 groups, 1 users. created 899 groups, 1 users. created 999 groups, 1 users. created 11 groups, 89 users. created 21 groups, 179 users. created 31 groups, 269 users. created 41 groups, 359 users. created 51 groups, 449 users. created 61 groups, 539 users. created 71 groups, 629 users. created 81 groups, 719 users. created 91 groups, 809 users. created 100 groups, 900 users. created 110 groups, 990 users. created 99 groups, 1 users. created 1 groups, 99 users. created 1 groups, 199 users. created 1 groups, 299 users. created 1 groups, 399 users. created 1 groups, 499 users. created 1 groups, 599 users. created 1 groups, 699 users. created 1 groups, 799 users. created 1 groups, 899 users. created 1 groups, 999 users. created 1 groups, 99 users. created 1 groups, 199 users. created 1 groups, 299 users. created 1 groups, 399 users. created 1 groups, 499 users. created 1 groups, 599 users. created 1 groups, 699 users. created 1 groups, 799 users. created 1 groups, 899 users. created 1 groups, 999 users. created 1 groups, 99 users. created 1 groups, 199 users. created 1 groups, 299 users. created 1 groups, 399 users. created 1 groups, 499 users. created 1 groups, 599 users. created 1 groups, 699 users. created 1 groups, 799 users. created 1 groups, 899 users. created 1 groups, 999 users.
Re: Jenkins notifications
yep, very useful, thx! Cheers, Stefan On 26/08/15 11:47, Michael Dürig mdue...@apache.org wrote: Hi, As you might have seen, Jenkins notifications now contain the change list since the last build as well as the list of failed tests. This should make it easier for everyone to find out what caused a build to fail and to take appropriate actions. Michael
[travis] console output of failed tests
Hi, I'm chasing a test failure on travis ([0]) currently but it's virtually impossible to find the root cause without having the console (or file) output of the test in case it fails. Does anyone know if/how to get the surefire files on travis? or should we tweak the pom (redirectTestOutputToFile)? Cheers, Stefan -- [0] - https://travis-ci.org/apache/jackrabbit-oak/builds/77114814
Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j
On 18/08/15 13:43, Marcel Reutegger mreut...@adobe.com wrote: On 18/08/15 11:14, Stefan Egli wrote: b) Oak does not do the System.exit but refuses to update anything towards the document store (thus just throws exceptions on each invocation) - and upper level code detects this situation (eg a Sling Health Check) and would do a System.exit based on how it is configured c) same as b) but upper level code does not do a System.exit (I¹m not sure if that makes sense - the instance is useless in such a situation) either b) or c) sounds reasonable to me. but if possible I'd like to avoid a System.exit(). would it be possible to detect this situation in the DocumentNodeStoreService and restart the DocumentNodeStore without the need to restart the JVM Good point. Perhaps restarting DocumentNodeStore is a valid alternative indeed. Is that feasible from a DocumentNodeStore point of view? What would be the consequences of a restarted DocumentNodeStore? or would this lead to an illegal state from a discovery POV? Have to think through the scenarios but perhaps this is fine (I was indeed initially under the assumption that it would not be fine, but that might have been wrong). The important bit is that any topology-related activity stops - and this can be achieved by sending TOPOLOGY_CHANGING (which in turn could be achieved by setting the own instance into 'deactivating' state in the discovery-lite-descriptor) and only coming back with TOPOLOGY_CHANGED once the restart would be settled and the local instance is back in the cluster with a valid, new lease. Cheers, Stefan
Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j
I've created OAK-3250 to follow up on the DocumentNodeStore-restart idea. Cheers, Stefan -- https://issues.apache.org/jira/browse/OAK-3250 On 18/08/15 15:59, Marcel Reutegger mreut...@adobe.com wrote: On 18/08/15 15:38, Stefan Egli wrote: On 18/08/15 13:43, Marcel Reutegger mreut...@adobe.com wrote: On 18/08/15 11:14, Stefan Egli wrote: b) Oak does not do the System.exit but refuses to update anything towards the document store (thus just throws exceptions on each invocation) - and upper level code detects this situation (eg a Sling Health Check) and would do a System.exit based on how it is configured c) same as b) but upper level code does not do a System.exit (I¹m not sure if that makes sense - the instance is useless in such a situation) either b) or c) sounds reasonable to me. but if possible I'd like to avoid a System.exit(). would it be possible to detect this situation in the DocumentNodeStoreService and restart the DocumentNodeStore without the need to restart the JVM Good point. Perhaps restarting DocumentNodeStore is a valid alternative indeed. Is that feasible from a DocumentNodeStore point of view? it probably requires some changes to the DocumentNodeStore, because we want it to tear down without doing any of the cleanup it may otherwise perform. it must not release the cluster node info nor update pending _lastRevs, etc. What would be the consequences of a restarted DocumentNodeStore? to the DocumentNodeStore it will look like it was killed and it will perform recovery (e.g. for the pending _lastRevs). Regards Marcel
Re: 1.3.4 blocked as failing tests
my fault, I¹m looking into it now On 17/08/15 12:02, Davide Giannella dav...@apache.org wrote: Hello team, trying to release Oak 1.3.4 but it's constantly failing on my local. Details can be found here https://issues.apache.org/jira/secure/attachment/12750782/oak-1.3.4-failin g-1439805620.log looking into it but if you know the answer ping me please. Davide
Re: [discovery] Introducing a simple mongo-based discovery-light service (to circumvent mongoMk's eventual consistency delays)
Hi all, I¹ve attached a final final¹ version of discovery lite to OAK-2844 ready for a final review - depending on feedback I plan to push that to trunk once 1.3.4 is out. Cheers, Stefan https://issues.apache.org/jira/browse/OAK-2844 https://issues.apache.org/jira/secure/attachment/12750833/OAK-2844.v4.patch On 07/07/15 12:45, Stefan Egli stefane...@apache.org wrote: FYI: I've attached a suggested 'final draft' version of the discovery lite to OAK-2844 for review. Comments very welcome! Cheers, Stefan -- https://issues.apache.org/jira/browse/OAK-2844?focusedCommentId=14616496p a ge=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#commen t -14616496 On 5/6/15 3:22 PM, Stefan Egli stefane...@apache.org wrote: Hi, Pls note a suggestion of a new 'discovery-light' API in OAK-2844. Would appreciate comments and reviews from this list. Thanks, Cheers, Stefan
[document] lease check activated (OAK-2739)
Hi all, Just a quick heads-up: I¹ve activated a lease check¹ with OAK-2739 in trunk: this checks upon every invocation of DocumentStore if the local lease is still valid. If it is not, it means that the instance is misbehaving and that others potentially have seen it as inactive. Thus the local instance will automatically shutdown and not do any further writes towards DocumentStore. Cheers, Stefan
Re: Release dates
I¹d find it more useful (for us) when it would be the cut-date. Cheers, Stefan On 13/08/15 10:08, Davide Giannella dav...@apache.org wrote: Hello team, a trivia question about release dates. Normally in jira I set the release date on a future release for when we plan to cut it. But we have the voting process of 72hrs that means the actual release date will be 3 days after the cut. Shall we put on jira then the release date as the actual announcement or stick it to the cut? Cheers Davide
Re: [discuss] handling of 'wish list' issues - introduce 'wish list' fix version?
perhaps 'unscheduled' and 'wish list' are very similar indeed - even though I'd have thought of 'unscheduled' more as of 'it should be scheduled soon-ish' - where as 'wish list' would already have gone through the decision process of 'no we dont do this anytime soon but its a good idea so lets not forget it'. Cheers, Stefan On 7/29/15 9:27 AM, Angela Schreiber anch...@adobe.com wrote: why not simply marking it as 'unscheduled'? IMO that pretty much expresses that this is is not yet scheduled but still considered a valid improvement/bug that we want to address at some point. i only resolve issues 'later' or 'wontfix' that i am confident that will never be fixed. adding a 'wish list' fix version will just be another huge container that we hardly ever look at and i would find it hard to understand the difference between 'unscheduled' and 'wish list'. if something is on your wishlist, i would suggest you assign the issue to yourself in order to keep track of it (compared to the whole bunch of other unscheduled issues). or flag it with a label that allows you to find all your wishes. so, rather -1 from my side. kind regards angela On 29/07/15 08:58, Stefan Egli stefane...@apache.org wrote: Hi, Just came across a ticket [0] that has no urgent priority to be fixed in 1.3 but would be a good candidate to be put into the general 'wish list pod'. Now currently we seem to handle such cases by just closing the ticket. This imv has the downside of it getting completely lost and forgotten. We could thus introduce a new 'wish list' fix version that can be set on those tickets instead of just closing them. Wdyt? Cheers, Stefan -- https://issues.apache.org/jira/browse/OAK-2613
Re: Do not add comments when bulk moves are performed in JIRA
+1 There's always the jira history to figure out when what was modified Cheers, Stefan On 7/29/15 8:17 AM, Chetan Mehrotra chetan.mehro...@gmail.com wrote: Hi Team, Currently most of the issues scheduled for 1.3.x release have comments like 'Bulk Move to xxx'. This creates unnecessary noise in the comment log. Would it be possible to move the issues to next version silently i.e. just get fix version changed and not add any comment Chetan Mehrotra
Re: [discovery] Introducing a simple mongo-based discovery-light service (to circumvent mongoMk's eventual consistency delays)
FYI: I've attached a suggested 'final draft' version of the discovery lite to OAK-2844 for review. Comments very welcome! Cheers, Stefan -- https://issues.apache.org/jira/browse/OAK-2844?focusedCommentId=14616496pa ge=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment -14616496 On 5/6/15 3:22 PM, Stefan Egli stefane...@apache.org wrote: Hi, Pls note a suggestion of a new 'discovery-light' API in OAK-2844. Would appreciate comments and reviews from this list. Thanks, Cheers, Stefan
Re: Error handling during AsyncIndexUpdate
+1 to report and continue. There was a similar issue earlier where the async indexing would fail with an OOME - in which case the 'rinse and repeat' even made it worse (as each time more and more data-to-be-indexed accumulates and the likelihood of an OOME would just increase) Cheers, Stefan On 6/22/15 10:54 AM, Julian Sedding jsedd...@gmail.com wrote: Hi all On a freshly migrated Oak setup (AEM 6.1), I recently observed that async indexing was running all the time. At first I did not worry, because there were ~14mio nodes to be indexed, but eventually I got the impression that there was an endless loop. Here's my take on what's happening, and please feel free to correct any wrong assumptions I make: - after a migration there is no checkpoint for async indexing to start at, so it indexes everything - a migration is a single commit, so async indexing is all or nothing (not sure the single commit is relevant, anyone?) - due to an oddity in the metadata of a PDF file, async indexing failed with an exception - async indexing recommences to see if the error persists on any subsequent run - rinse and repeat If my interpretation is correct, I would suggest to review the error handling. If an error is not recoverable, the current behaviour basically prevents any documents to be indexed and the AsyncIndexUpdate stops to make any progress. It may be a better trade off to report the paths of failing documents and continue despite the failure. What do others think? Regards Julian
Re: [mongoNs] using bulk operation for backgroundupdate?
Ok, created a separate OAK-3018 for adapting backgroundWrite to use the batch-update (once available) Cheers, Stefan On 6/22/15 10:05 AM, Marcel Reutegger mreut...@adobe.com wrote: Hi, this is currently not possible because the DocumentStore API does not have such a method. There's an existing issue closely related to your request: https://issues.apache.org/jira/browse/OAK-2066 I think in general it makes sense to add such a method. As you can see in the issue, the background write is not the only application that would benefit from it. Regards Marcel On 18/06/15 17:24, Stefan Egli wrote: Hi, This might have been discussed before but just so I understand: The DocumentNodeStore.backgroundWrite goes through the heavy work of updating the lastRev for all pending changes and does so in a hierarchical-depth-first manner. Unfortunately, if the pending changes all come from separate commits (as does not sound so unlikely), the updates are sent in individual update calls to mongo (whenever the lastRev differs). Which, if there are many changes, results in many calls to mongo. What about replacing that mechanism using mongo's bulk functionality (eg initializeOrderedBulkOperation)? Is this for some reason not possible or already in the jira-queue (which ticket)? Cheers, Stefan -- http://api.mongodb.org/java/current/com/mongodb/DBCollection.html#initial i ze OrderedBulkOperation--
[mongoNs] using bulk operation for backgroundupdate?
Hi, This might have been discussed before but just so I understand: The DocumentNodeStore.backgroundWrite goes through the heavy work of updating the lastRev for all pending changes and does so in a hierarchical-depth-first manner. Unfortunately, if the pending changes all come from separate commits (as does not sound so unlikely), the updates are sent in individual update calls to mongo (whenever the lastRev differs). Which, if there are many changes, results in many calls to mongo. What about replacing that mechanism using mongo's bulk functionality (eg initializeOrderedBulkOperation)? Is this for some reason not possible or already in the jira-queue (which ticket)? Cheers, Stefan -- http://api.mongodb.org/java/current/com/mongodb/DBCollection.html#initialize OrderedBulkOperation--
Re: Observation: External vs local - Load distribution
On 6/15/15 2:40 PM, Carsten Ziegeler cziege...@apache.org wrote: Am 15.06.15 um 14:23 schrieb Marcel Reutegger: Hi, you can write a CommitEditor, which is called with every local commit. Is it easy to calculate the changed nodes/properties in this editor? As I understand yes, the Editor gets callback for all changed nodes and properties. I guess the question is how that is encapsulated towards upper layers as you probably do not want (too much) application code using commit editors. Cheers, Stefan
Re: Observation: External vs local - Load distribution
On 6/15/15 4:29 PM, Carsten Ziegeler cziege...@apache.org wrote: Am 15.06.15 um 16:21 schrieb Chetan Mehrotra: On Mon, Jun 15, 2015 at 1:13 PM, Carsten Ziegeler cziege...@apache.org wrote: Now, with Oak there is still this distinction, however if I remember correctly under heavy load it might happen that local events are reported as external events. And in that case the above pattern fails. Regardless of how rare this situation might be, if it can happen it will eventually happen. This is an implementation detail of BackgroundObserver (BO) which is used by OakResourceListener in Sling. BO keeps a queue of changed NodeState tuples and if it gets filled it is collapsed. If you want to avoid that at *any* cost that you can used a different impl which uses say LinkedBlockingQueue and does not enforce any limit. That would be similar to how JcrResourceListener works which uses an unbound in memory queue Indeed a good point! Ah, thanks Chetan, that's the first time I hear this - so basically if we implement our own observer, we can reliably get: a) all changes b) local/external info c) user id Is that correct? the way I understand it is: ;) * for local changes yes, you'd get all local changes incl user id * for external changes you'd get them all, but without user id and they would typically be collapsed (as external changes are only periodically written by the background updater) So given this, you could indeed have an Observer that throws away all external events (which are easily spottable as they have commitInfo==null) and only process internal ones. And for such a 'local-only' observer I think this could be a feasible approach. Speaking more generally however: I guess to support scaling to very large number of instances, the goal should be that external events are filtered as much as possible too. Providing fast processing alone (as is the goal eg with OAK-2829) would not suffice. I think for this we'd need 'oak level observation filtering'. Such a filter could be applied to the journal (filling only 'interested' paths into the diff caches). At which point I wonder if it would not be beneficial to do both 'local vs external' as well as 'path-filtering' on an oak level, rather than one or both on the sling level Re the commit editor use case: I think that would still be the only option if you'd want 'local-guaranteed' events, ie local events that would not get lost even in case of a crash. At the moment there are no solutions for this - local events just get lost. I think we could have three different event types (local-filtered, local-guaranteed-filtered, external-filtered). Cheers, Stefan
DocumentNodeStore background read/update operations synchronized?
Hi, Just realized that DocumentNodeStore background read and update operations are synchronized which basically makes them be executed sequentially which somewhat works against OAK-2624. @Marcel, @Chetan, wdyt, do they have to be synchronized? Could this not be a bottleneck concurrency-wise? Cheers, Stefan
[discovery] Introducing a simple mongo-based discovery-light service (to circumvent mongoMk's eventual consistency delays)
Hi, Pls note a suggestion of a new 'discovery-light' API in OAK-2844. Would appreciate comments and reviews from this list. Thanks, Cheers, Stefan
Re: Efficiently process observation event for local changes
Related to this, I've created https://issues.apache.org/jira/browse/OAK-2683 which is about an issue that happens when the observation queue limit is reached. Cheers, Stefan On 3/23/15 4:03 PM, Chetan Mehrotra chetan.mehro...@gmail.com wrote: After discussing this further with Marcel and Michael we came to conclusion that we can achieve similar performance by make use of persistent cache for storing the diff. This would require slight change in way we interpret the diff JSOP. This should not require any change in current logic related to observation event generation. Opened OAK-2669 to track that. One thing that we might still want to do is to use separate queue size for listeners interested in local events only and those which can work with external event. On a system like AEM there 180 listeners which listen for external changes and ~20 which only listen to local changes. So makes sense to have bigger queues for such listners Chetan Mehrotra On Mon, Mar 23, 2015 at 4:09 PM, Michael Dürig mdue...@apache.org wrote: On 23.3.15 11:03 , Stefan Egli wrote: Going one step further we could also discuss to completely moving the handling of the 'observation queues' to an actual messaging system. Whether this would be embedded to an oak instance or whether it would be shared between instances in an oak cluster might be a different question (the embedded variant would have less implication on the overall oak model, esp also timing-wise). But the observation model quite exactly matches the publish-subscribe semantics - it actually matches pub-sub more than it fits into the 'cache semantics' to me. Definitely something to try out, given someone find the time for it. ;-) Mind you that some time ago I implemented persisting events to Apache Kafka [1], which wasn't greeted with great enthusiasm though... OTOH the same concern regarding pushing the bottleneck to IO applies here. Furthermore filtering the persisted events through access control is something we need yet to figure out as AC is a) sessions scoped and b) depends on the tree hierarchy. Michael [1] https://github.com/mduerig/oak-kafka .. just saying .. On 3/23/15 10:47 AM, Michael Dürig mdue...@apache.org wrote: On 23.3.15 5:04 , Chetan Mehrotra wrote: B - Proposed Changes --- 1. Move the notion of listening to local events to Observer level - So upon any new change detected we only push the change to a given queue if its local and bounded listener is only interested in local. Currently we push all changes which later do get filter out but we avoid doing that first level itself and keep queue content limited to local changes only I think there is no change needed in the Observer API itself as you can already figure out from the passed CommitInfo whether a commit is external or not. BTW please take care with the term local as there is also the concept of session local commits. 2. Attach the calculated diff as part of commit info which is attached to the given change. This would allow eliminating the chances of the cache miss altogether and would ensure observation is not delayed due to slow processing of diff. This can be done on best effort basis if the diff is to large then we do not attach it and in that case we diff again 3. For listener which are only interested in local events we can use a different queue size limit i.e. allow larger queues for such listener. Later we can also look into using a journal (or persistent queue) for local event processing. Definitely something to try out. A few points to consider: * There doesn't seem to be too much of a difference to me whether this is routed via a cache or directly attached to commits. In wither way it adds additional memory requirements and churn, which need to be managed. * When introducing persisted queuing we need to be careful not to just move the bottleneck to IO. * An eventual implementation should not break the fundamental design. Either hide it in the implementation or find a clean way to put this into the overall design. Michael
Re: Efficiently process observation event for local changes
Going one step further we could also discuss to completely moving the handling of the 'observation queues' to an actual messaging system. Whether this would be embedded to an oak instance or whether it would be shared between instances in an oak cluster might be a different question (the embedded variant would have less implication on the overall oak model, esp also timing-wise). But the observation model quite exactly matches the publish-subscribe semantics - it actually matches pub-sub more than it fits into the 'cache semantics' to me. .. just saying .. On 3/23/15 10:47 AM, Michael Dürig mdue...@apache.org wrote: On 23.3.15 5:04 , Chetan Mehrotra wrote: B - Proposed Changes --- 1. Move the notion of listening to local events to Observer level - So upon any new change detected we only push the change to a given queue if its local and bounded listener is only interested in local. Currently we push all changes which later do get filter out but we avoid doing that first level itself and keep queue content limited to local changes only I think there is no change needed in the Observer API itself as you can already figure out from the passed CommitInfo whether a commit is external or not. BTW please take care with the term local as there is also the concept of session local commits. 2. Attach the calculated diff as part of commit info which is attached to the given change. This would allow eliminating the chances of the cache miss altogether and would ensure observation is not delayed due to slow processing of diff. This can be done on best effort basis if the diff is to large then we do not attach it and in that case we diff again 3. For listener which are only interested in local events we can use a different queue size limit i.e. allow larger queues for such listener. Later we can also look into using a journal (or persistent queue) for local event processing. Definitely something to try out. A few points to consider: * There doesn't seem to be too much of a difference to me whether this is routed via a cache or directly attached to commits. In wither way it adds additional memory requirements and churn, which need to be managed. * When introducing persisted queuing we need to be careful not to just move the bottleneck to IO. * An eventual implementation should not break the fundamental design. Either hide it in the implementation or find a clean way to put this into the overall design. Michael
Re: [segment] offline compaction broken?
Hi Alex, There's only 1 checkpoint, so that looks good. I still see the same.. oak-run 1.0.8 compacts fine, but the latest trunk will instead start filling up tar file after tar file.. (tested with java 1.7 against a segmentstore-repo that was created with oak 1.1.4) Cheers, Stefan On 1/26/15 7:13 PM, Alex Parvulescu alex.parvule...@gmail.com wrote: Hi Stefan, Offline compaction should work properly. Can you quickly check the number of checkpoints? alex On Mon, Jan 26, 2015 at 6:12 PM, Stefan Egli stefane...@apache.org wrote: Hi, Before I dig too deep I built the latest trunk and tried to run offline compaction but see a weird behavior where oak-run starts filling one tar file after the other basically increasing seemingly endlessly. Is this known or only me? Cheers, Stefan
Re: [segment] offline compaction broken?
It looks like no compaction strategy is set in oak-run. Created https://issues.apache.org/jira/browse/OAK-2449 Cheers, Stefan On 1/27/15 9:58 AM, Stefan Egli e...@adobe.com wrote: Hi Alex, There's only 1 checkpoint, so that looks good. I still see the same.. oak-run 1.0.8 compacts fine, but the latest trunk will instead start filling up tar file after tar file.. (tested with java 1.7 against a segmentstore-repo that was created with oak 1.1.4) Cheers, Stefan On 1/26/15 7:13 PM, Alex Parvulescu alex.parvule...@gmail.com wrote: Hi Stefan, Offline compaction should work properly. Can you quickly check the number of checkpoints? alex On Mon, Jan 26, 2015 at 6:12 PM, Stefan Egli stefane...@apache.org wrote: Hi, Before I dig too deep I built the latest trunk and tried to run offline compaction but see a weird behavior where oak-run starts filling one tar file after the other basically increasing seemingly endlessly. Is this known or only me? Cheers, Stefan
[segment] offline compaction broken?
Hi, Before I dig too deep I built the latest trunk and tried to run offline compaction but see a weird behavior where oak-run starts filling one tar file after the other basically increasing seemingly endlessly. Is this known or only me? Cheers, Stefan
Re: Scalability of JCR observation
Hi, On 4/16/13 4:26 PM, Dominik Süß dominik.su...@gmail.com wrote: I see some overlap with the latest work of Carsten in Sling regarding Discovery API[0]. Since Sling typically should work uppon JCR / Oak it might be good not to follow different patterns. For a combined solution I do think it would be great to have one pluggable mediating system instead of two which might have strange sideeffects for rejoin scenarios in a cluster. +1 If there was a jms/messaging client available in oak (pluggable) that an implementation of the discovery.api (at the sling level..) could reuse, that would definitely result in a more reliable 'cluster view' than having separate mechanisms. How the 'cross cluster' aspect of the discovery's topology would be implemented in that case is yet another question, but I suppose it could just as well use jms cross-cluster... Cheers, Stefan Just my 2 cents Dominik [0]http://markmail.org/thread/w3kgl7jxvhki3oqj On Tue, Apr 16, 2013 at 11:51 AM, Michael Dürig mdue...@apache.org wrote: On 15.4.13 9:46, Julian Reschke wrote: On 2013-04-15 10:32, Bertrand Delacretaz wrote: So I'm wondering if using an existing distributed message queue service (ActiveMQ/RabbitMQ etc) would help implement this. IIUC this is only a problem in very large Oak setups, so having to install additional components might not be an issue. Could that also help with implementing proper JCR Locking (or are we there already???). Probably. The idea of making external coordinaters pluggable has come up before: https://issues.apache.org/**jira/browse/OAK-150?** focusedCommentId=13401328**page=com.atlassian.jira.** plugin.system.issuetabpanels:**comment-tabpanel#comment-**13401328https: //issues.apache.org/jira/browse/OAK-150?focusedCommentId=13401328page=co m.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13 401328 Michael