from:"Stefan Egli"

Re: fixVersion

2020-07-30 Thread Stefan Egli


Hi Julian,

Thx for that. That was indeed an unlucky typo from my side..

Cheers,
Stefan


On 30.07.20 10:00, Julian Reschke wrote:

Hi,

please be careful when setting fixVersion in Jira. I just fixed a few
recently resolved tickets where a change in trunk was advertised to fix
1.2.32, not 1.34.0 as it should.

Best regards, Julian

Re: [DISCUSS] Branching and release: version numbers

2019-09-27 Thread Stefan Egli


+1

Cheers,
Stefan

On 27.09.19 11:40, Julian Reschke wrote:

On 04.03.2019 14:29, Davide Giannella wrote:

...


Picking up an old thread...

So we've released 1.12.0, 1.14.0, 1.16.0, and will release 1.18.0 next 
week.


What we apparently did not discuss what the project version for trunk
should be in the meantime.

So far, we've been using 1.12-SNAPSHOT, etc, and we are on 1.20-SNAPHOT
right now.

This however seems incorrect to me; shouldn't it be 1.19-SNAPSHOT?

For this release I'd like to avoid any changes, but for future releases
I'd like to document that we're using an odd-numbered version.

Feedback appreciated,

Julian

Intent to backport OAK-8351

2019-06-25 Thread Stefan Egli


Hi,

I'd like to backport OAK-8351 [0] to the 1.8 and 1.10 branches unless 
someone objects. OAK-8351 changes a MongoDB query that was introduced in 
this form in 1.8


Cheers,
Stefan
--
[0] https://issues.apache.org/jira/browse/OAK-8351

Re: Intent to backport OAK-6953

2017-11-20 Thread Stefan Egli

+1

Cheers,
Stefan

On 20.11.17, 09:24, "Marcel Reutegger"  wrote:

>Hi,
>
>I'd like to backport OAK-6953 to the maintenance branches. In some cases,
>it is desirable to disable a cache, which is not possible with the
>current CacheLIRS implementation in Oak. Instead of changing the
>CacheLIRS implementation, OAK-6953 uses the Guava Cache implementation
>when the cache size is set to zero, which immediately evicts entries when
>loaded.
>
>Regards
> Marcel
>

Re: single node cluster

2017-08-02 Thread Stefan Egli

Hi Mostafa,

I'd suggest to narrow down why that lease update failed, esp if you have
it reproducible. By default a lease is updated every 10 seconds and is
valid for 2min (and could in theory be changed but that's not recommended
necessarily).

Besides mentioned DB issues, other cases where lease updates failed were
JVMs running low on memory thus doing too long GC-stop-the-worlds.

If you can rule out both, then here's some more ideas to investigate:

a) check for warnings in the form of: "BackgroundLeaseUpdate.execute: time
since last renewClusterIdLease() call longer than expected" to see if the
lease update became slow already before it finally expired. Perhaps that
gives some clues already.

b) enable trace logging for
'org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo' to see all
details about lease updates happening (or not).

c) analyse thread dumps to rule out blocked lease update thread

Cheers,
Stefan

On 01/08/17 15:45, "Mostafa Mahdieh"  wrote:

>Hi,
>
>I'm using jackrabbit oak as the content repository of a document
>management
>system product. Currently there is no need to scale out, therefore I'm
>using jackrabbit oak in a single node environment. However, I'm
>experiencing issues related to clustering and lease time, such as the
>following exception which is appearing all over my tomcat logs:
>
>WARN: Background operation failed:
>org.apache.jackrabbit.oak.plugins.document.DocumentStoreException: This
>oak
>instance failed to update the lease in time and can therefore no longer
>access this DocumentNodeStore.
>
>After some research, It seems that there is no way to use jackrabbit oak
>forcing it to use a single node and not having any concerns related to
>clustering.
>
>Am I using the right tool? I thought maybe jackrabbit 2 might be better
>for
>my current use case, however oak seemed as the future of jackrabbit, and
>attracted me (adding scalability is also in my future vision). Do you
>suggest oak for my usecase or jackrabbit 2? How can I adapt oak for a
>single node environment without getting issues regarding lease time and
>clustering?
>
>Best Regards
>-- 
>Mostafa Mahdieh

Re: [discuss] expose way to detect "eventual consistency delay"

2017-05-30 Thread Stefan Egli

On 30/05/17 14:51, "Stefan Egli" <stefane...@apache.org> wrote:

>on how Oak could "expose a way to detect the eventual delay".

... "to detect the eventual consistency delay" ...

of course ...

[discuss] expose way to detect "eventual consistency delay"

2017-05-30 Thread Stefan Egli

Hi all,

I'd like to invite those interested to join a discussion in

https://issues.apache.org/jira/browse/OAK-6276

on how Oak could "expose a way to detect the eventual delay".

This is a requirement coming from the integration with an external messaging
system in an Oak-based application.

One way suggested so far is that this could simply be done by exposing a
"normalized head revision vector" via a repository descriptor.

But let's discuss over in OAK-6276.

Thanks,
Cheers,
Stefan

Re: MongoMK failover behaviour.

2017-05-05 Thread Stefan Egli

Hi,

On 04/05/17 16:56, "Justin Edelson"  wrote:

>>Hmm, depending on the Oak version, this may also be caused by OAK-5528.
>> The current fix versions are 1.4.15 and 1.6.0.
>>
>
>Would this show up in thread dumps? Based on the description, it seems
>like
>it should.

Not necessarily. In OAK-5528 the lease update thread goes into
performLeaseCheck which will do a 5x1sec retry loop. So if the thread dump
is taken during that time one would see it - if taken afterwards not.

Cheers,
Stefan

Re: ObservationTest with Thread.sleep()

2017-04-25 Thread Stefan Egli

Hi Marcel,

IIUC then the sleeps are used to check for expected *and* unexpected
events. The expected part could be easily replaced with a busy-check loop.
The unexpected part is a bit more tricky though, but the test could be
rewritten to be more of a white-box test where not only both ends are
tested but also the middle (observation queue) part, that would work.

So I guess yes, the sleeps could be avoided - with a bit of effort though.

Cheers,
Stefan

On 25/04/17 10:56, "Marcel Reutegger"  wrote:

>Hi,
>
>there is a test in oak-jcr
>(org.apache.jackrabbit.oak.jcr.observation.ObservationTest) with many
>Thread.sleep() calls. This means, the test mostly sleeps and slows down
>the build. What's the reason for those sleeps and can we somehow remove
>them?
>
>Regards
>  Marcel

Re: [Observation] Should listeners require constant inflow of commits to get all events?

2017-02-21 Thread Stefan Egli

>>
>>But agreed, this is a bug and we should fix it.
>>
>Actually, I'm not too sure as long as we concretely document the
>behavior and potentially have a sample abstract
>commit-creator/listener which does the job well (may be similar to the
>hack I used)

I've created OAK-5740 and attached test case that reproduces this. We can
follow up there if/when/how we want to fix this.

Cheers,
Stefan

Re: ChangeProcessor potentially warns only once for queue being full during its lifetime (without CommitRateLimiter)

2017-02-10 Thread Stefan Egli

+1, looks like a bug to me.

Cheers,
Stefan

On 09/02/17 23:17, "Vikas Saurabh"  wrote:

>Hi,
>
>_Disclaimer_ : I get confused with change processor code, so not sure
>if this is an issue or PEBKAC
>
>ChangeProcessor#queueSizeChanged sets blocking flag to true if queue
>size is hit (or beyond). The warning "Revision queue is full. Further
>revisions will be compacted." is logged only when it *wasn't*
>blocking.
>
>BUT, when queue empties, blocking flag is reset inside if block for
>commitRateLimiter!=null. That, to me seems like
>qFull->log->qEmpties->qFull won't log another warn. This sounds wrong
>to me.
>
>Thanks,
>Vikas

Re: incomplete diffManyChildren during a persisted branch merge

2017-02-01 Thread Stefan Egli

On 31/01/17 18:07, "Stefan Egli" <stefane...@apache.org> wrote:

>I'm following up on failure case in oak 1.2.14 where as part of a
>persisted
>branch merge commit hooks do not propagate through all affected changes,
>resulting in an inconsistent state.

>https://issues.apache.org/jira/browse/OAK-5557

I believe the problem is indeed related to a rebase that happens before
merging a persisted branch. The diffManyChildren subsequently takes the
rebased revision timestamp as the minValue, instead of taking the branch's
previous purges into account. This seems to (only) occur when between the
last purge and the actual merge another session does a merge.

One possible fix I see is to detect such a situation (in diffManyChildren)
- ie check if one of the revisions is a branch revision - and fall back to
not using the _modified index then. This will definitely find all
potential child nodes - but it has the downside that it becomes
slow/doesn't scale well with a very large list of child nodes.

Other ideas?

Cheers,
Stefan

Re: incomplete diffManyChildren during a persisted branch merge

2017-02-01 Thread Stefan Egli

On 01/02/17 09:16, "Marcel Reutegger"  wrote:

>I think in trunk the code path is also a bit different because of
>OAK-4528. It may be possible that the issue still exists in trunk, but
>does not call diffManyChildren() anymore.
>
>What happens when you disable the journal diff mechanism in trunk with
>-Doak.disableJournalDiff=true ?

Good idea, however that alone doesn't let the test fail yet, as both the
local diff cache as well as the node children cache avoid diffManyChildren
from being used.

But if I use brute force and bypass those two caches explicitly - and at
the same time increase the test size by increasing # of nodes - then the
test fails on trunk too.

So indeed trunk seems to avoid this problem as it doesn't go into
diffManyChildren for the cases triggered by the test.

Cheers,
Stefan

incomplete diffManyChildren during a persisted branch merge

2017-01-31 Thread Stefan Egli

Hi,

I'm following up on failure case in oak 1.2.14 where as part of a persisted
branch merge commit hooks do not propagate through all affected changes,
resulting in an inconsistent state. It's unclear how realistic this scenario
is and/or if it's relevant, but I was able to produce such a scenario in a
test case.

Interesting thing is that it's quite easily reproducible in 1.2.14 while as
later in the 1.2 branch it takes longer for the test (which loops until it
fails) to fail. Also, it doesn't fail in trunk even after eg 500 iterations.

Does this ring a bell with anyone - diffManyChildren / wrong _modified
calculation / branch - perhaps this was fixed in trunk a while ago and not
backported?

Cheers,
Stefan
--
https://issues.apache.org/jira/browse/OAK-5557

Re: Detecting if setup is a cluster or a single node via repository Descriptors

2016-11-15 Thread Stefan Egli

Hi Chetan,

I think the discoverylite and the new 'clustered' property options have
different characteristics. The former describes the current status of the
cluster, irrespective of whether it can be clustered at all. While the
latter is about a capability whether the node store supports clustering or
not. And assuming that you're after the capability 'cluster support'
alone, then I think handling this separate is indeed more appropriate.

Cheers,
Stefan

On 15/11/16 10:37, "Chetan Mehrotra"  wrote:

>Hi Team,
>
>For OAK-2108 Killing a cluster node may stop async index update to to
>30 minutes.
>
>One possible fix can be that AsyncIndexUpdate can determine if the
>repository is part of cluster or its a single instance. In case its a
>single instance we can reduce the timeout as its known that there are
>no other processes involved.
>
>Currently for SegmentNodeStore a Descriptor with name
>'oak.discoverylite.clusterview' is registered whose value is as below
>
>---
>{"seq":1,"final":true,"me":1,"id":"80a1fb91-83bc-4eac-b855-53d7b8a04092","
>active":[1],"deactivating":[],"inactive":[]}
>---
>
>AsyncIndexerService can get access to 'Descriptors' and look for that
>key and check if 'active' is 1.
>
>However there should be a better way to detect this. Can we have an
>explicit descriptor defined say OAK_CLUSTERED having boolean value. A
>false means its not a cluster while true means it "might" be part of
>cluster.
>
>Thoughts?
>
>Chetan Mehrotra

Re: [REVIEW] OAK-4908 in 1.5.13: prefiltering (enabled by default)

2016-11-07 Thread Stefan Egli

FYI: Assuming lazy consensus I've now committed this one to unblock
1.5.13. We can do post-review in case.

Cheers,
Stefan

On 04/11/16 15:59, "Stefan Egli" <stefane...@apache.org> wrote:

>Hi,
>
>I'd like to commit OAK-4908 which would introduce prefiltering for
>observation listeners. This is based on OAK-4907 (population of a
>ChangeSet
>during the commit) and OAK-4916 (FilteringObserver-wrapper for the
>BackgroundObserver) - and it works fine with the new filters (OAK-5019-23)
>too.
>
>The reason I raise this on the list is that this is quite a change and it
>would thus be good if there was an agreement that we want this in for
>1.5.13
>(Monday). I know it's a bit a tight schedule, but I think it would be good
>to have that in to allow for more testing in real life scenarios. I've
>thus
>marked it a blocker for 1.5.13. If you disagree, pls let me know.
>
>Wdyt?
>
>Cheers,
>Stefan
>
>

Re: [REVIEW][API] Additions to JackrabbitEventFilter

2016-10-31 Thread Stefan Egli

(+oak-dev as it meanwhile has move to oak features alone)

I've committed OAK-5013 which introduces the OakEventFilter extension and
a number of such extensions (OAK-5019, OAK-5020, OAK-5021, OAK-5022,
OAK-5023).

While they should all in principle work I don't consider them as done yet
as the test coverage is minimal and there's room for code(-style)
improvement.

But the point of this heads-up is about the API of OakEventFilter that
should ideally not have to change anymore, so if you're interested pls
have a look.

Cheers,
Stefan

On 26/10/16 19:09, "Stefan Egli" <stefane...@apache.org> wrote:

>On 26/10/16 16:48, "Michael Dürig" <mdue...@apache.org> wrote:
>
>>... Just ensure we expose the required
>>functionality on the Oak side as a proper API. That is, interface and
>>utility only and proper package versioning...
>
>Opened OAK-5013 for that which is just about the API.
>(it's a beautified version of the previous patches)
>
>Cheers,
>Stefan
>
>

Re: globbing: oak style vs sling style

2016-10-31 Thread Stefan Egli

I've created 

https://issues.apache.org/jira/browse/OAK-5039

to follow up

Cheers,
Stefan

On 31/10/16 14:18, "Stefan Egli" <stefane...@apache.org> wrote:

>Hi,
>
>As being discussed in [0] in OAK-5021 there are 2 different ways how
>globbing is currently defined in Oak vs in Sling. In Oak globbing is
>restricted to ** being 0-n path elements and * being 1 path element, while
>in Sling it is more generic in that * means 0-n characters excluding path
>boundaries.
>
>IIUC then the GlobbingPathFilter is basically where Oak implements this
>and
>it looks like this is not yet exposed, as that's internal to observation
>filtering only.
>
>So my suggestion would be to simply extend the GlobbingPathFilter's
>globbing
>definition to match that of Sling.
>
>Any objections?
>
>Cheers,
>Stefan
>--
>[0] - 
>https://issues.apache.org/jira/browse/OAK-5021?focusedCommentId=15622005
>ag
>e=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment
>-1
>5622005
>[1] - 
>https://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/p
>lu
>gins/observation/filter/GlobbingPathFilter.html
>
>
>

globbing: oak style vs sling style

2016-10-31 Thread Stefan Egli

Hi,

As being discussed in [0] in OAK-5021 there are 2 different ways how
globbing is currently defined in Oak vs in Sling. In Oak globbing is
restricted to ** being 0-n path elements and * being 1 path element, while
in Sling it is more generic in that * means 0-n characters excluding path
boundaries.

IIUC then the GlobbingPathFilter is basically where Oak implements this and
it looks like this is not yet exposed, as that's internal to observation
filtering only.

So my suggestion would be to simply extend the GlobbingPathFilter's globbing
definition to match that of Sling.

Any objections?

Cheers,
Stefan
--
[0] - 
https://issues.apache.org/jira/browse/OAK-5021?focusedCommentId=15622005
e=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1
5622005
[1] - 
https://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/plu
gins/observation/filter/GlobbingPathFilter.html

Re: [observation] more options in JackrabbitEventFilter

2016-09-13 Thread Stefan Egli

On 13/09/16 15:27, "Davide Giannella" <dav...@apache.org> wrote:

>On 12/09/2016 09:48, Stefan Egli wrote:
>> IIUC then EventListeners are registered via either JCR's
>> ObservationManager or Jackrabbit's extension at [0]. If you want to do
>> this in Oak (ie not in Jackrabbit) then would you extend Oak's
>> Observationmanager ([1]) directly?
>Didn't look at the code and didn't think all the implications.
>
>Would it be an option to expose,
>javax.jcr.oak.observation.OakObservationManager that extends
>javax.jcr.observation.ObservationManager in which we expose what need?

Right, there's probably two options:

# (oak) add another variant of addEventListener to OakObservationManager
([0])
# (jackrabbit) integrate that into the JackrabbitEventFilter ([1])

I guess it comes down to API design and cleanliness, I don't have any
preference.

Cheers,
Stefan
--
[0] - 
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-jcr/src/main/java/o
rg/apache/jackrabbit/oak/jcr/observation/ObservationManagerImpl.java#L179
[1] - 
https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-api/src/main/jav
a/org/apache/jackrabbit/api/observation/JackrabbitEventFilter.java

Re: [observation] more options in JackrabbitEventFilter

2016-09-12 Thread Stefan Egli

Hi Davide,

On 08/09/16 14:24, "Davide Giannella"  wrote:

>On 07/09/2016 14:04, Michael Dürig wrote:
>> No not open them. But make their functionality available through an
>> API. Since JCR is dead (hint hint) we probably have to come up with an
>> ad-hoc API here.
>FWIW, I'm for exposing this aspect as Oak API.

Would be fine for me, however, how would you do that?

IIUC then EventListeners are registered via either JCR's
ObservationManager or Jackrabbit's extension at [0]. If you want to do
this in Oak (ie not in Jackrabbit) then would you extend Oak's
Observationmanager ([1]) directly?


Cheers,
Stefan
--
[0] - 
https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-api/src/main/jav
a/org/apache/jackrabbit/api/observation/JackrabbitObservationManager.java#L
26
[1] - 
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-jcr/src/main/java/o
rg/apache/jackrabbit/oak/jcr/observation/ObservationManagerImpl.java

>
>Then in Oak we implement few Filters for the already existing mechanism,
>so that the jcr layer can map the JCR API as Oak api.
>
>An application that needs to have complex filtering, will have to
>leverage the Oak API.
>
>Don't know whether it will be possible for an application to leverage
>*both* JCR and Oak APIs but I'm sure there are ways around it.
>
>Cheers
>Davide
>
>

Re: [observation] pure internal or external listeners

2016-09-02 Thread Stefan Egli

On 02/09/16 13:41, "Stefan Egli" <stefane...@apache.org> wrote:

>On 02/09/16 13:26, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote:
>
>>Listener for local Change
>>--
>>
>>Such a listener is more particular about type of change and is doing
>>some persisted state change i.e. like registering a job, invoking some
>>third party service to update the value. This listener is only
>>interested in local as it know same listener is also active on other
>>cluster node (homogeneous cluster setup) so if a node gets added it
>>only need to react on the cluster node where it got added.
>
>One thing this reminds me of is a use-case where you have say 3 cluster
>nodes, each one handling mainly local events lets say. All fine. Then 1
>node crashes while likely it's (local) observation queue wasn't entirely
>empty. Those events would then probably not get handled by anyone (and
>that node wouldn't necessarily be restarted as the cluster continues
>normally, it could be restarted as a new clusterNodeId..). So maybe
>there's an issue there.

I think this should be handled same as today with (non-journaled)
listeners loosing events on any crash: either upon restart or when an
instance leaves the cluster (which can be noticed eg via Sling's Discovery
API) someone (preferably the leader) should handle this and do a
repository scan of whatever interesting the crashing instance might have
stored. Lack of journaled observation that's the way to go probably.

Cheers,
Stefan

Re: [observation] pure internal or external listeners

2016-09-02 Thread Stefan Egli

Hi Chetan,

(see below)

On 02/09/16 13:26, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote:

>On Fri, Sep 2, 2016 at 4:00 PM, Stefan Egli <stefane...@apache.org> wrote:
>> If we
>> separate listeners into purely internal vs external, then a queue as a
>>whole
>> is either purely internal or external and we no longer have this issue.
>
>Not sure here on how this would work. The observation queue is made up
>of ContentChange which is a tuple of [root NodeState , CommitInfo
>(null for external)]
>
>--- NS1-L---NS2-L--NS3---NS4-L---NS5-L ---NS6-L
>
>--- a  /a/b  - /a/c --- /a/c
> /a/b /a/b
>/a/d
>
>So if we dedicate a queue for local changes only what would happen.
>
>If we drop NS3 then while diffing [NS2-L, NS4-L] /a/c would be
>reported as "added" and "local". Now we have a listener which listens
>for locally added nt:file node such it can start some processing job
>for it. Such a listener would then think its a locally added node and
>would start a duplicate job

Good point. We could probably fix this though by not only storing 1 root
NodeState per ContentChange, but store 2: a 'from' and a 'to' (the 'from'
is currently implicit, as that's taken from the previous entry, but if we
skip entries, then it needs to be re-added). So with that, we could safely
drop external changes as 'uninterested' and diffing would still report the
correct thing.

>
>In general I believe
>
>Listener for external Change
>--
>listener which are listening for external changes are maintaining some
>state and purge/refresh it upon detecting change in interested paths.
>They would work fine if multiple content change occurrences are merged
>
>[NS4-L, NS5-L] + [NS5-L,NS6-L] = [NS4, NS6] (external) as they would
>still detect the change
>
>An example of this is LuceneIndexObserver which sets queue size to 5
>and does not care its local or not. It just interested in if index
>node is updated
>
>Listener for local Change
>--
>
>Such a listener is more particular about type of change and is doing
>some persisted state change i.e. like registering a job, invoking some
>third party service to update the value. This listener is only
>interested in local as it know same listener is also active on other
>cluster node (homogeneous cluster setup) so if a node gets added it
>only need to react on the cluster node where it got added.

One thing this reminds me of is a use-case where you have say 3 cluster
nodes, each one handling mainly local events lets say. All fine. Then 1
node crashes while likely it's (local) observation queue wasn't entirely
empty. Those events would then probably not get handled by anyone (and
that node wouldn't necessarily be restarted as the cluster continues
normally, it could be restarted as a new clusterNodeId..). So maybe
there's an issue there.

>
>So for such it needs to be ensured that mixed content changes are not
>compacted. So its fine to
>
>[NS4-L, NS5-L] + [NS5-L,NS6-L] = [NS4, NS6] (can be treated as
>local with loss of user identity which caused the change)
>[NS2-L, NS3]+ [NS3, NS4-L] = [NS2-L, NS4-L] (cannot be treated as
>local)

I think keeping the 'from/to' tuple instead of just 1 root NodeState would
make the above picture more simple.

Cheers,
Stefan

>
>Just thinking out loud here to understand the problem space better :)
>
>Chetan Mehrotra

Re: [observation] pure internal or external listeners

2016-09-02 Thread Stefan Egli

Perhaps for backwards compatibility we could auto-create 2 listeners for
the case where a listener is registered without ExcludeInternal or
ExcludeExternal - and issue a corresponding, loud, WARN.

On 02/09/16 12:30, "Stefan Egli" <stefane...@apache.org> wrote:

>Hi,
>
>As you're probably aware there are currently several different issues
>being
>worked upon related to the observation queue limit problem ([0], epic
>[1]).
>I wanted to discuss yet another improvement and first ask what the list
>thinks.
>
>What about requiring observation listeners to either consume only internal
>or only external events, but never both together, we wouldn't support that
>anymore. (And if you're in a cluster you want to be very careful with
>consuming external events in the first place - but that's another topic)
>
>The root problem of the 'queue hitting the limit' as of today is that it
>throws away the CommitInfo, thus doesn't know anymore if it's an internal
>or
>an external event (besides actually loosing the CommitInfo details). If we
>separate listeners into purely internal vs external, then a queue as a
>whole
>is either purely internal or external and we no longer have this issue. We
>could continue to throw away the CommitInfo (or avoid that using a
>persisted
>obs queue ([2])), but we could then still say with certainty if it's an
>internal or an external event.
>
>A user that would want to receive both internal and external events could
>simply create two listeners. Those would both receive events as expected.
>The only difference would be that the two stream of events would not be in
>sync - but I doubt that this would be a big loss.
>
>We could thus introduce 'ExcludeInternal' and demand in
>ObservationManager.addEventListener that the listener is flagged with one
>of
>ExcludeInternal or ExcludeExternal.
>
>Wdyt?
>
>Cheers,
>Stefan
>--
>[0] - https://issues.apache.org/jira/browse/OAK-2683
>[1] - https://issues.apache.org/jira/browse/OAK-4614
>[2] - https://issues.apache.org/jira/browse/OAK-4581
>
>
>

[wip][review] persistent observation queue - OAK-4581

2016-08-31 Thread Stefan Egli

Hi,

As an FYI: I'm working on persisting the observation queue - OAK-4581 - and
have attached a patch and a comment [0] to the ticket indicating current
progress. Would welcome some early feedback/review.

The main idea is that it would introduce a 'PersistedBlockingQueue' that
would be plugged (as the 'queue') into the BackgroundObserver, which can
then remain largely unchanged. The whole logic of persisting is thus hidden
in the PersistedBlockingQueue.

PS1: The stored data would all be discarded on restart - this is just to
work around the 'limit' aspect of the current in-memory queue at runtime.
Nothing related to journaled observation.

PS2: The current v0 implementation is a bit dumb, early progress - no tests,
no generational-gc, not much batching/caching, but already uses a secondary
(or is that thirdary?) SegmentNodeStore just for storing ContentChange objs.

Thanks!
Cheers,
Stefan
--
[0] 
https://issues.apache.org/jira/browse/OAK-4581?focusedCommentId=15452460
e=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1
5452460

Re: [suggestion] introduce oak compatibility levels

2016-07-28 Thread Stefan Egli

Hi Michael,

On 28/07/16 10:54, "Michael Marth"  wrote:

>I think we should simply stick to SemVer of the released artefacts to
>signal those changes to upstream.

IIUC the difference would be that one version (eg oak 1.6) could contain
multiple compatibility versions (eg 1.2/1.4) - some perhaps marked as
deprecated - while as using SemVer you'd have to have multiple versions of
oak concurrently in an OSGi stack (which is likely not going to work) to
achieve the same. Compatibility levels would be more flexible than SemVer.

>On the more specific topic of session behaviour: could we use session
>attributes to let the app specify session behaviour? [1]

Yes, that would work too.

>[1] 
>https://docs.adobe.com/docs/en/spec/javax.jcr/javadocs/jcr-2.0/javax/jcr/S
>ession.html#getAttribute(java.lang.String)

Cheers,
Stefan

Re: Requirements for multiple Oak clients on the same backend (was: [suggestion] introduce oak compatibility levels)

2016-07-28 Thread Stefan Egli

Don't have an answer, but there was a similar question recently on this
list:

"Does Oak core check the repository version ?"
http://markmail.org/thread/sbvjydwdu3g2eze5

Cheers,
Stefan

On 28/07/16 10:45, "Bertrand Delacretaz" <bdelacre...@apache.org> wrote:

>Hi,
>
>On Thu, Jul 28, 2016 at 10:23 AM, Stefan Egli <stefane...@apache.org>
>wrote:
>>...we could introduce a concept of
>> 'compatibility levels' which are a set of features/behaviours that a
>> particular oak version has and that application code relies upon
>
>Good timing, I have a related question about multiple client apps
>connecting to the same Oak backend.
>
>Say I have to Java apps A and B which use the same Oak/Mongo/BlobStore
>configuration, are there defined requirements as to the Oak library
>versions or other settings that A and B use?
>
>Do they need to use the exact same versions of the Oak bundles, and
>are violations to that or to other compatibility requirements
>detected?
>
>-Bertrand

Re: [suggestion] introduce oak compatibility levels

2016-07-28 Thread Stefan Egli

(typo)

On 28/07/16 10:23, "Stefan Egli" <stefane...@apache.org> wrote:

>One concrete case where this could have been useful is the
>backwards-compatible behaviour where a session is auto-refreshed when
>changes are done in another session.

.. in the same thread, that is ..

Re: Specifying threadpool name for periodic scheduled jobs (OAK-4563)

2016-07-19 Thread Stefan Egli

I'd go for #A to limit cross-effects between oak and other layers.

The reason one would want to use the default pool for #4 is probably the
idea that you'd want to avoid "wasting" a thread in the oak-thread-pool
and rather rely on a shared one. But arguably, that should be an
optimization of the thread pool provider itself: that provider could be
more intelligent and allocate threads from an under-used pool elsewhere -
if that were more performant.

But from a logical point of view, I'd argue it's better to have an
oak-dedicated thread-pool.

Cheers,
Stefan

On 19/07/16 10:06, "Chetan Mehrotra"  wrote:

>On Tue, Jul 19, 2016 at 1:21 PM, Michael Dürig  wrote:
>> Not sure as I'm confused by your description of that option. I don't
>> understand which of 1, 2, 3 and 4 would run in the "default pool" and
>>which
>> should run in its own dedicated pool.
>
>#1, #2 and #3 would run in dedicated pool and each using same pool.
>Pool name would be 'oak'. Also see OAK-4563 for the patch
>
>While for #4 default pool would be used as those are non blocking and
>short tasks
>
>Chetan Mehrotra

Re: Requirement to support multiple NodeStore instance in same setup (OAK-4490)

2016-06-22 Thread Stefan Egli

On 22/06/16 12:21, "Chetan Mehrotra"  wrote:

>On Tue, Jun 21, 2016 at 4:52 PM, Julian Sedding 
>wrote:
>> Not exposing the secondary NodeStore in the service registry would be
>> backwards compatible. Introducing the "type" property potentially
>> breaks existing consumers, i.e. is not backwards compatible.
>
>I had similar concern so proposed a new interface as part of OAK-4369.
>However later with further discussion realized that we might have
>similar requirement going forward i.e. presence of multiple NodeStore
>impl so might be better to make setup handle such case.
>
>So at this stage we have 2 options
>
>1. Use a new interface to expose such "secondary" NodeStore
>2. OR Use a new service property to distinguish between different roles
>
>Not sure which one to go. May be we go for merged i.e. have a new
>interface as in #1 but also mandate that it provides its "role/type"
>as a service property to allow client to select correct one
>
>Thoughts?

If the 'SecondaryNodeStoreProvider' is a non-public interface which can
later 'easily' be replaced with another mechanism, then for me this would
sound more straight forward at this stage as it would not break any
existing consumers (as mentioned by Julian).

Perhaps once those 'other use cases going forward' of multiple NodeStores
become more clear, then it might be more obvious as to how the
generalization into perhaps a type property should look like.

my 2cents,
Cheers,
Stefan

Re: [VOTE] Please vote for the final name of oak-segment-next

2016-04-26 Thread Stefan Egli

Hi,

On 26/04/16 14:00, "Thomas Mueller"  wrote:

>I would keep the "oak-segment-*" name, so that it's clear what it is based
>on. So:
>
>-1 oak-local-store
>-1 oak-embedded-store
>
>+1 oak-segment-*
>
>Within the oak-segment-* options, I don't have a preference.

+1

(I do like 'oak-segment-v2' though, so +1 to that too)

Cheers,
Stefan

Re: [discuss][scalability] oak:asyncConflictResolution

2016-03-22 Thread Stefan Egli

Hi,

On 21/03/16 21:23, "Michael Dürig"  wrote:
> There is org.apache.jackrabbit.oak.spi.commit.PartialConflictHandler and
> a couple of its implementations already. Maybe this could be leveraged
> here by somehow connecting it to the mix-ins you propose.

Yes, I think it should be something like a PartialConflictHandler that is
either configurable or customizable.

On 22/03/16 11:35, "Davide Giannella"  wrote:
> I'd go for the mixin, with a default chain/order of conflict resolution
> and allow to define such in a multivalue property. So that in case
> needed the user can define its own chain of conflict resolution, or even
> custom one if needed.

Right, sounds like a mixin rather than (just) a property would be more
appropriate.

Cheers,
Stefan

Re: [discuss][scalability] oak:asyncConflictResolution

2016-03-21 Thread Stefan Egli

On 21/03/16 21:03, "Stefan Egli" <stefane...@apache.org> wrote:

>...a third one could again be 'strict' (which would correspond to JCR
>semantics
>as are the default today) ..

actually that would not be possible asynchronously, scratch that..

Cheers,
Stefan

[discuss][scalability] oak:asyncConflictResolution

2016-03-21 Thread Stefan Egli

Hi oak-devs,

tl.dr: suggestion is to introduce a new property (or mixin) that enables
async merge for a subtree in a cluster case while at the same time
pre-defines conflict resolution, since conflicts currently prevent
trouble-free async merging.

In case this has been discussed/suggested before, please point me to the
discussion, in case not, here's the suggestion:

When it comes to handling conflicts we either deal with them in a
synchronous way (we throw a CommitFailedException right away) or have no
feasible/implemented solutions how to asynchronously handle them (we'd have
the possibility of leaving :conflict markers persisted, which would in
theory allow asynchronous merges, but so far we don't have anything built
ontop of that)

In any case, for cluster scalability it's critical that we avoid
'synchronous' checks and instead switch to asynchronous merging wherever
possible: while for some parts of the content (eg '/var') it is always
necessary to have synchronous checks, the assumption is that other areas (eg
'/content') might well live with something asynchronous - as normally no
conflicts occur and if, then a predefined schema that then kicks in is fine.

And one way to tackle this would be to mark nodes (and thus implicitly its
subtree) in a way that says "from here on below it's ok to do asynchronous
conflict resolution of type X". Something that could be solved by
introducing an explicit marker in the form of eg a mixin or a property
'oak:asyncConflictResolution' (that could either refer to a globally defined
resolution or further detail 'how' that resolution should look like). If a
transaction would involve both normal as well as async conflict resolution,
then not much is gained as you'd still have to do conflict checks at least
for that 'normal/sync' part. But if the expectation is that there are cases
of transactions that include only such async marked areas, then you can
avoid the synchronous checks.

Examples for these pre-defined resolutions are: 'delete-wins, then
latest-change-wins' (which might be the easiest), or 'latest-change-wins'
(which might be more tricky as that would mean those 'changeDeleted' cases
would resurrect deleted data magically - possible but perhaps too magic), a
third one could again be 'strict' (which would correspond to JCR semantics
as are the default today) - or again
'no-resolution-but-persist-conflict-marker' etc...

Having such pre-defined conflict resolution and at the same time clearly
indicating that doing conflict-checking asynchronously is OK would allow to
have truly parallel writes into the NodeStore from different instance's pov.

Wdyt?

Cheers,
Stefan

Re: oak-resilience

2016-03-07 Thread Stefan Egli

Hi Tomek,

Would also be interesting to see the effect on the leases and thus
discovery-lite under high memory load and network problems.

Cheers,
Stefan

On 04/03/16 11:13, "Tomek Rekawek"  wrote:

>Hello,
>
>For some time I've worked on a little project called oak-resilience. It
>aims to be a resilience testing framework for the Oak. It uses
>virtualisation to run Java code in a controlled environment, that can be
>spoilt in different ways, by:
>
>* resetting the machine,
>* filling the JVM memory,
>* filling the disk,
>* breaking or deteriorating the network.
>
>I described currently supported features in the README file [1].
>
>Now, once I have a hammer I'm looking for a nail. Could you share your
>thoughts on areas/features in Oak which may benefit from being
>systematically tested for the resilience in the way described above?
>
>Best regards,
>Tomek
>
>[1] 
>https://github.com/trekawek/jackrabbit-oak/tree/resilience/oak-resilience
>
>-- 
>Tomek Rękawek | Adobe Research | www.adobe.com
>reka...@adobe.com
>

Re: OAK-4006 : Enable cloning of repo for shared data store and discovery-lite

2016-02-15 Thread Stefan Egli

Thanks for the various comments and review on OAK-4006. I've attached a
final version of the patch and will push that later this afternoon
(together with OAK-4007) unless I hear fresh concern.

Cheers,
Stefan

On 11/02/16 20:16, "Stefan Egli" <stefane...@apache.org> wrote:

>Hi all,
>
>The recent clusterId-discussions around OAK-3935 together with the cloning
>problem it shares with discovery.oak made me rethink the current
>two-clusterId-approach. After some offline discussions with Thomas and
>Marcel I've created OAK-4006 which suggests reusing the SharedDataStore
>way
>of a hidden :clusterId property, providing a dedicated 'after clone'
>offline
>reset tool in oak-run and using that same clusterId also in discovery-lite
>(thus discovery.oak). This should leave us with only 1 clusterId in the
>stack.
>
>Since 1.4 will be the first to support discovery.oak, and to allow for
>enough testing, it would be important to have this in 1.3.16. I will
>therefore work on a patch tomorrow and would highly appreciate comments on
>the approach and patch. If +1-ed It should delay 1.3.16 a few hours or a
>day.
>
>https://issues.apache.org/jira/browse/OAK-4006
>
>Cheers,
>Stefan
>
>

Re: Oak 1.3.16 release plan

2016-02-12 Thread Stefan Egli

Hi Davide,

As mentioned on the list OAK-4006 is in discussion and in the works. So,
depending on the outcome it might require a small delay.

Cheers,
Stefan

On 11/02/16 11:45, "Davide Giannella"  wrote:

>Hello team,
>
>I'm planning to cut Oak 1.3.16 on Monday 15th February more or less 10am
>GMT.
>
>If there are any objections please let me know. Otherwise I will
>re-schedule any non-resolved issue for the next iteration.
>
>Thanks
>Davide
>
>

Re: OAK-4006 : Enable cloning of repo for shared data store and discovery-lite

2016-02-11 Thread Stefan Egli

On 11/02/16 20:29, "Vikas Saurabh"  wrote:

>we'd really have to shout in the
>documentation that after this, clone use-case requires
>oak-run->reset_id

Agreed. (Side note: but that we'd otherwise have had to do for OAK-3935,
right?)

> (I'm assuming that the approach obviates the need to
>delete sling id file)

Not sure about this one. As deleting sling.id.file is still required and
likely a separate task, as that's on a sling level and you can't combine
that into the oak-run tool from a separation of concern pov.

Cheers,
Stefan

OAK-4006 : Enable cloning of repo for shared data store and discovery-lite

2016-02-11 Thread Stefan Egli

Hi all,

The recent clusterId-discussions around OAK-3935 together with the cloning
problem it shares with discovery.oak made me rethink the current
two-clusterId-approach. After some offline discussions with Thomas and
Marcel I've created OAK-4006 which suggests reusing the SharedDataStore way
of a hidden :clusterId property, providing a dedicated 'after clone' offline
reset tool in oak-run and using that same clusterId also in discovery-lite
(thus discovery.oak). This should leave us with only 1 clusterId in the
stack.

Since 1.4 will be the first to support discovery.oak, and to allow for
enough testing, it would be important to have this in 1.3.16. I will
therefore work on a patch tomorrow and would highly appreciate comments on
the approach and patch. If +1-ed It should delay 1.3.16 a few hours or a
day.

https://issues.apache.org/jira/browse/OAK-4006

Cheers,
Stefan

Re: OAK-4006 : Enable cloning of repo for shared data store and discovery-lite

2016-02-11 Thread Stefan Egli

On 11/02/16 20:42, "Vikas Saurabh"  wrote:

>probably I mis-understood sling id file as
>cluster id... while I think that's persistent instance id, right?

correct.

Cheers,
Stefan
>

Re: travis needs more memory

2016-02-10 Thread Stefan Egli

On 10/02/16 14:59, "Davide Giannella" <dav...@apache.org> wrote:

>On 10/02/2016 10:22, Stefan Egli wrote:
>> Re NonLocalObservationIT, that one creates like 160'000 nodes in-memory
>> and that seems not to fit the default VM settings.
>
>Shall we move this to a SegmentFixture?

Or DocumentFixture because the test needs to simulate a cluster. It looks
like OAK-3803 removed 'cluster support' from the NodeStoreFixtures - they
all return null now in createNodeStore(clusterNodeId) - which I originally
worked around by creating a new in-memory fixture. But yes, actually the
test should run against mongo.

Was there a particular reason for removing cluster support in OAK-3803?

Cheers,
Stefan

Re: travis needs more memory

2016-02-10 Thread Stefan Egli

Re NonLocalObservationIT, that one creates like 160'000 nodes in-memory
and that seems not to fit the default VM settings.
Re the other test (ConcurrentAddIT) I don't know.

Cheers,
Stefan

On 10/02/16 09:04, "Marcel Reutegger" <mreut...@adobe.com> wrote:

>Hi,
>
>this may solve the immediate issue with the test
>failure, but it probably also hides an memory problem
>with our tests. in the past I tried to first identify
>and fix memory leaks and only then increase the heap
>if really necessary. do you know what is holding on
>to the memory?
>
>Regards
> Marcel   
>
>On 09/02/16 19:17, "Stefan Egli" wrote:
>
>>Hi,
>>
>>Looks like we need to give our travis run [0] more memory. OAK-3986 was
>>likely partly slowing down due to memory becoming low. Now it looks like
>>ConcurrentAddIT is failing [1] for the same reason too (can reproduce
>>this
>>locally: default memory settings result in OOME). I'm guessing adding
>>this
>>to the .travis.yml would do the trick?
>>
>>env:
>>
>>global:
>>
>>- JAVA_OPTS="-Xmx1G"
>>
>>
>>Cheers,
>>Stefan
>>--
>>[0] - https://travis-ci.org/apache/jackrabbit-oak/builds
>>[1] - 
>>Running org.apache.jackrabbit.oak.jcr.ConcurrentAddIT
>>No output has been received in the last 10 minutes, this potentially
>>indicates a stalled build or something wrong with the build itself.
>>
>>The build has been terminated
>>
>>
>

travis needs more memory

2016-02-09 Thread Stefan Egli

Hi,

Looks like we need to give our travis run [0] more memory. OAK-3986 was
likely partly slowing down due to memory becoming low. Now it looks like
ConcurrentAddIT is failing [1] for the same reason too (can reproduce this
locally: default memory settings result in OOME). I'm guessing adding this
to the .travis.yml would do the trick?

env:

global:

- JAVA_OPTS="-Xmx1G"


Cheers,
Stefan
--
[0] - https://travis-ci.org/apache/jackrabbit-oak/builds
[1] - 
Running org.apache.jackrabbit.oak.jcr.ConcurrentAddIT
No output has been received in the last 10 minutes, this potentially
indicates a stalled build or something wrong with the build itself.

The build has been terminated

Re: [discuss] persisting cluster (view) id for discovery-lite-descriptor

2016-02-01 Thread Stefan Egli

Having thought and discussed about this some more.. an even simpler
solution is:

d) the discovery-lite descriptor *can* contain an id, in which case it
should be used. But *neither tarMk nor mongoMk set this*.

+ The advantage is that tarMk and mongoMk then behave the same, and even
the similar to discovery.impl: discovery.oak stores a 'clusterId' property
under /var/discovery/oak, thus being easily visible/manageable in all
cases.

- The disadvantages are in the same area that lead to choosing c)
originally: conceptually defining the id and who is member etc are the all
aspects of the same concern and should not be separated, as otherwise you
open the door for possible inconsistencies of these aspects. So if this is
separated it needs to be seen as a trade-off with what is gained, namely
easier visibility and manageability of this id. Known places where this
separation and thus loss of synchronization can be a problem is the first
time the id is defined. That should however be handled by mongoMk's
conflict handling. Another potential place is when this id is redefined
(eg deleted). That must be managed separately and is one consequence of d)
versus c). At this stage I'm not seeing any other negative consequences so
overall d) sounds still better than c).

Unless I hear vetoes, I'd implement this change before tomorrow's 1.3.15
release (also in OAK-3672, which I'll then rename)

Cheers,
Stefan

On 27/01/16 10:45, "Stefan Egli" <stefane...@apache.org> wrote:

>Hi,
>
>Following up on the OAK-3672 discussion again, and taking a step back, I
>see three possible classes of solutions:
>
>a) the (cluster)id is always defined by discovery-lite, be it cluster or
>singlevm
>b) the (cluster)id is entirely removed and it is up to discovery.oak (in
>sling) to define it
>c) the (cluster)id is only set by discovery-lite when feasible, eg only
>for the cluster case
>
>I'm in favour of c) with the following arguments:
>* a) requires tarMk (!) to store this id somewhere. It can either store it
>in the filesystem (which makes failover support harder), store it as a
>hidden property in the node store (which is not manageable as it's hidden)
>or store it as a normal property in the repository (which sounds hacky, as
>discovery-lite is in the NodeStore layer while this would require it to
>simulate writing a JCR property)
>* removing the id altogether (b) would be going too far imv: the logical
>unit that defines the cluster view (its members) is the best place to also
>define an id for that unit. And that logical unit is discovery-lite in
>this case.
>* what speaks for returning null for the singleVm case (c) is the fact
>that it is a special case (it is not a cluster). So treating the special
>case separately doesn't break the separation of concern rule in my view.
>(c) would imply that the id is set when we're in a cluster case, and not
>otherwise (but that would not be a hard requirement, the specification
>would just be that the id *can* be null).
>
>So long story short: I suggest to change the definition of this id so that
>it *can* be null - in which case upper layers must define their own id.
>Which means Sling's discovery.oak would then store a clusterId under
>/var/discovery/oak. That would automatically support cold-standby/failover
>- fix the original bug - and simplify cleaning this property up for the
>clone case (as that would correspond to how this case was dealt with in
>discovery.impl times already).
>
>WDYT?
>
>Cheers,
>Stefan
>
>On 26/11/15 11:32, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote:
>
>>On Thu, Nov 26, 2015 at 3:56 PM, Stefan Egli <e...@adobe.com> wrote:
>>> which would
>>> then be on the Sling level thus could more simply use the slingId.
>>
>>That also sounds good. While we are at it also have a look at OAK-3529
>>where system needs to know a clusterId. Looks like some overlap so
>>keep that usecase also in mind
>>
>>
>>Chetan Mehrotra
>
>

Re: [DISCUSS] avoid bad commits with mis-behaving clock

2016-01-28 Thread Stefan Egli

On 14/01/16 18:34, "Julian Reschke"  wrote:

>On 2016-01-14 17:36, Vikas Saurabh wrote:
>>@Julian, if I understand correctly, OAK-2682 currently is about
>> warning, right? It mentions a self-desctruct option but I think it
>> wasn't implemented.
>
>It is implemented in trunk, see r1695671 (might be only on startup,
>though).

The current model is that at startup this is enforced, but at runtime this
is not enforced at the oak level. What we currently have is a JMX method
which should be hooked into some sort of runtime monitoring (be that
external or internal eg via health checks). Such a runtime monitoring
would not be enough though as it would certainly not react fast enough.

So if we're saying we need clocks to be in sync at any given time, we
probably have to combine checking clocks upon every lease update, as well
as restricting valid revisions to be within the lease window.

Cheers,
Stefan

Re: [discuss] persisting cluster (view) id for discovery-lite-descriptor

2016-01-27 Thread Stefan Egli

Hi,

Following up on the OAK-3672 discussion again, and taking a step back, I
see three possible classes of solutions:

a) the (cluster)id is always defined by discovery-lite, be it cluster or
singlevm
b) the (cluster)id is entirely removed and it is up to discovery.oak (in
sling) to define it
c) the (cluster)id is only set by discovery-lite when feasible, eg only
for the cluster case

I'm in favour of c) with the following arguments:
* a) requires tarMk (!) to store this id somewhere. It can either store it
in the filesystem (which makes failover support harder), store it as a
hidden property in the node store (which is not manageable as it's hidden)
or store it as a normal property in the repository (which sounds hacky, as
discovery-lite is in the NodeStore layer while this would require it to
simulate writing a JCR property)
* removing the id altogether (b) would be going too far imv: the logical
unit that defines the cluster view (its members) is the best place to also
define an id for that unit. And that logical unit is discovery-lite in
this case.
* what speaks for returning null for the singleVm case (c) is the fact
that it is a special case (it is not a cluster). So treating the special
case separately doesn't break the separation of concern rule in my view.
(c) would imply that the id is set when we're in a cluster case, and not
otherwise (but that would not be a hard requirement, the specification
would just be that the id *can* be null).

So long story short: I suggest to change the definition of this id so that
it *can* be null - in which case upper layers must define their own id.
Which means Sling's discovery.oak would then store a clusterId under
/var/discovery/oak. That would automatically support cold-standby/failover
- fix the original bug - and simplify cleaning this property up for the
clone case (as that would correspond to how this case was dealt with in
discovery.impl times already).

WDYT?

Cheers,
Stefan

On 26/11/15 11:32, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote:

>On Thu, Nov 26, 2015 at 3:56 PM, Stefan Egli <e...@adobe.com> wrote:
>> which would
>> then be on the Sling level thus could more simply use the slingId.
>
>That also sounds good. While we are at it also have a look at OAK-3529
>where system needs to know a clusterId. Looks like some overlap so
>keep that usecase also in mind
>
>
>Chetan Mehrotra

Re: [discuss] persisting cluster (view) id for discovery-lite-descriptor

2015-11-26 Thread Stefan Egli

I'm not sure how feasible kung fu or voodoo would be but one alternative
could be that discovery-lite would 'signal' that this is a standalone
instance (either by just setting id=null or by something a bit more
explicit) and discovery.oak could then react accordingly - which would
then be on the Sling level thus could more simply use the slingId.

Not sure about making the "discovery-lite API" weaker re this point
though...

Cheers,
Stefan

On 26/11/15 04:37, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote:

>There is another option to avoid extra effort when running within
>Sling. Have an optional implementation which makes use of
>SlingSettingsService to get fetch SlingId. With little bit of OSGi
>kung fu you can have an implementation which uses SlingId when running
>in Sling otherwise maintains its own id using File based approach.
>
>This would reduce operational complexity
>Chetan Mehrotra
>
>
>On Wed, Nov 25, 2015 at 6:23 PM, Stefan Egli <stefane...@apache.org>
>wrote:
>> Right, I'm not sure it is indeed a requirement. But without automatic
>> support it might get forgotten and thus the cluster id would change upon
>> failover.
>>
>> Cheers,
>> Stefan
>>
>> On 25/11/15 13:40, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote:
>>
>>>On Wed, Nov 25, 2015 at 6:00 PM, Stefan Egli <stefane...@apache.org>
>>>wrote:
>>>>> * disadvantage: cold standby would require an explicit copying of
>>>>>this
>>>>>file
>>>>> (during initial hand-shake?)
>>>
>>>Why is that a requirement? Cold standby is just a backup and currently
>>>there is no automatic failover support.
>>>
>>>For such cases we can allow passing the id as a system/framework
>>>property
>>>also
>>>
>>>Chetan Mehrotra
>>
>>

[discuss] persisting cluster (view) id for discovery-lite-descriptor

2015-11-25 Thread Stefan Egli

Hi,

Noticed that for TarMK the discovery-lite-descriptor does currently not
persist the cluster-view-id [0]. It should do this however, as otherwise
this causes upper-level discovery.oak to break the discovery API, as it
demands a persisted cluster id. (Note that this id is not to be confused
with the 'cluster node id' that identifies an instance within a document
node store cluster)

I wanted to get some ideas from the list as to how this should be
implemented. Current options are:
1. storing a 'cluster.id.file' (or 'discovery.cluster.id.file') similar to
the 'sling.id.file' (via BundleContext.getDataFile).
> * cloning a repository would therefore require to delete both sling.id.file
> and this new file
> * disadvantage: cold standby would require an explicit copying of this file
> (during initial hand-shake?)
2. storing the id as a property somewhere in the repository.
> * disadvantage: cloning a repository would clone this id as well and there
> might not be an easy enough way for a user to reset it
Opinions? Alternatives?

Cheers,
Stefan
--
[0] https://issues.apache.org/jira/browse/OAK-3672

Re: [discuss] persisting cluster (view) id for discovery-lite-descriptor

2015-11-25 Thread Stefan Egli

Right, I'm not sure it is indeed a requirement. But without automatic
support it might get forgotten and thus the cluster id would change upon
failover.

Cheers,
Stefan

On 25/11/15 13:40, "Chetan Mehrotra" <chetan.mehro...@gmail.com> wrote:

>On Wed, Nov 25, 2015 at 6:00 PM, Stefan Egli <stefane...@apache.org>
>wrote:
>>> * disadvantage: cold standby would require an explicit copying of this
>>>file
>>> (during initial hand-shake?)
>
>Why is that a requirement? Cold standby is just a backup and currently
>there is no automatic failover support.
>
>For such cases we can allow passing the id as a system/framework property
>also
>
>Chetan Mehrotra

Re: Observation: External vs local - Load distribution

2015-10-13 Thread Stefan Egli

Hi Carsten,

For external events the commit info is indeed not provided yup.
For internal ones it is - except for those 'overflow' ones which collapse
into a pseudo-external one.

Cheers,
Stefan

On 13/10/15 15:17, "Carsten Ziegeler"  wrote:

>Am 17.06.15 um 10:35 schrieb Carsten Ziegeler:
>> Ok, just to recap. In Sling we can implement the Observer interface (and
>> not use the BackgroundObserver base class). This will give us reliably
>> user id for all local events.
>> 
>> Does anyone see a problem with this approach?
>> 
>Getting back to this problem, it seems the above does not work, as the
>DocumentNodeStore is not passing on the commit info to the observer in
>the case of external events.
>So no matter how I implement my observer, I don't get the info passed in.
>
>Can someone please confirm this?
>
>Thanks
>Carsten
>-- 
>Carsten Ziegeler
>Adobe Research Switzerland
>cziege...@apache.org

Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j

2015-09-14 Thread Stefan Egli

On 10/09/15 18:43, "Stefan Egli" <stefane...@apache.org> wrote:

>additionally/independently:
>
>[...]
>
>* also, we should probably increase the lease thread's priority to reduce
>the likelihood of the lease timing out (same would be true for
>discovery.impl's heartbeat thread)
>
>* plus increasing the lease time from 1min to perhaps 5min as the default
>would also reduce the number of cases that hit problems dramatically

FYI: Put these suggested improvements into:

https://issues.apache.org/jira/browse/OAK-3398

most noteworthy: I suggest to increase the lease timeout by default to
120sec. (not 5min, I think that's too much)

Cheers,
Stefan

Re: Oak 1.3.6 release plan

2015-09-14 Thread Stefan Egli

As the 1.3.6 is already in the voting phase, it would mean -1 for that
release - not sure if it's enough of an issue for that though? (mind you,
the issue was already there in 1.3.5..)

Cheers,
Stefan

On 14/09/15 12:29, "Julian Reschke" <julian.resc...@gmx.de> wrote:

>On 2015-09-14 10:17, Julian Reschke wrote:
>> On 2015-09-14 10:03, Stefan Egli wrote:
>>> On 14/09/15 09:51, "Marcel Reutegger" <mreut...@adobe.com> wrote:
>>>
>>>> ...would it
>>>> make sense to just disable the lease check for the diagnostics
>>>> in oak-run? ...
>>>
>>> +1 as a short-term fix
>>>
>>> Cheers,
>>> Stefan
>>
>> I agree that this would have been broken by the other wrappers, and the
>> approach in itself wasn't smart in the first place. My point being: can
>> we please come up with a proper solution that will address all the uses
>> cases?
>>
>> Best regards, Julian
>
>...essentially we are introducing a new feature (improving resilience)
>that breaks existing code assumptions, potentially causing performance
>degradations. I believe the right thing to do *now* is to disable the
>new feature, make the 1.3.6 release, then fix things properly and turn
>it in in 1.3.7.
>
>
>Best regards, Julian
>
>
>
>
>
>
>
>

Re: Oak 1.3.6 release plan

2015-09-14 Thread Stefan Egli

On 14/09/15 09:51, "Marcel Reutegger"  wrote:

>...would it
>make sense to just disable the lease check for the diagnostics
>in oak-run? ...

+1 as a short-term fix

Cheers,
Stefan

Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j

2015-09-14 Thread Stefan Egli

My vote would also be (b) for the short-term. If we figure out a way to
properly restart the nodestore (c) we can still come back to that at a
later time.

Hence I've created https://issues.apache.org/jira/browse/OAK-3397 and
unless the list vetoes I'll follow up on that next.

Cheers,
Stefan

On 11/09/15 11:38, "Julian Sedding" <jsedd...@gmail.com> wrote:

>My preference is (b), even though I think stopping the NodeStore
>service should be sufficient (it may not currently be sufficient, I
>don't know).
>
>Particularly, I believe that "trying harder" is detrimental to the
>overall stability of a cluster/topology. We are dealing with a
>possibly faulty instance, so who can decide that it is ok again after
>trying harder? The faulty instance itself?
>
>"Read-only" doesn't sound too useful either, because that may fool
>clients into thinking they are dealing with a "healthy" instance for
>longer than necessary and thus can lead to bigger issues downstream.
>
>I believe that "fail early and fail often" is the path to a stable
>cluster.
>
>Regards
>Julian
>
>On Thu, Sep 10, 2015 at 6:43 PM, Stefan Egli <stefane...@apache.org>
>wrote:
>> On 09/09/15 18:11, "Stefan Egli" <stefane...@apache.org> wrote:
>>
>>>On 09/09/15 18:01, "Stefan Egli" <stefane...@apache.org> wrote:
>>>
>>>>I think if the observers would all be 'OSGi-ified' then this could be
>>>>achieved. But currently eg the BackgroundObserver is just a pojo and
>>>>not
>>>>an osgi component (thus doesn't support any activate/deactivate method
>>>>hooks).
>>>
>>>.. I take that back - going via OsgiWhiteboard should work as desired -
>>>so
>>>perhaps implementing deactivate/activate methods in the
>>>(Background)Observer(s) would do the trick .. I'll give it a try ..
>>
>> ootb this wont work as the BackgroundObserver, as one example, is not an
>> OSGi component, so wont get any deactivate/activate calls atm. so to
>> achieve this, it would have to be properly OSGi-ified - something which
>> sounds like a bigger task and not only limited to this one class - which
>> means making DocumentNodeStore 'restart capable' sounds like a bigger
>>task
>> too and the question is indeed if it is worth while ('will it work?') or
>> if there are alternatives..
>>
>> which brings me back to the original question as to what should be done
>>in
>> case of a lease failure - to recap the options left (if System.exit is
>>not
>> one of them) are:
>>
>> a) 'go read-only': prevent writes by throwing exceptions from this
>>moment
>> until eternity
>>
>> b) 'stop oak': stop the oak-core bundle (prevent writes by throwing
>> exceptions for those still reaching out for the nodeStore)
>>
>> c) 'try harder': try to reactivate the lease - continue allowing writes
>>-
>> and make sure the next backgroundWrite has correctly updated the
>> 'unsavedLastRevisions' (cos others could have done a recover of this
>>node,
>> so unsavedLastRevisions contains superfluous stuff that must no longer
>>be
>> written). this would open the door for edge cases ('change of longer
>>time
>> window with multiple leaders') but perhaps is not entirely impossible...
>>
>> additionally/independently:
>>
>> * in all cases the discovery-lite descriptor should expose this lease
>> failure/partitioning situation - so that anyone can react who would like
>> to, esp should anyone no longer assume that the local instance is leader
>> or part of the cluster - and to support that optional Sling Health Check
>> which still does a System.exit :)
>>
>> * also, we should probably increase the lease thread's priority to
>>reduce
>> the likelihood of the lease timing out (same would be true for
>> discovery.impl's heartbeat thread)
>>
>>
>> * plus increasing the lease time from 1min to perhaps 5min as the
>>default
>> would also reduce the number of cases that hit problems dramatically
>>
>> wdyt?
>>
>> Cheers,
>> Stefan
>>
>>

Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j

2015-09-10 Thread Stefan Egli

On 09/09/15 18:11, "Stefan Egli" <stefane...@apache.org> wrote:

>On 09/09/15 18:01, "Stefan Egli" <stefane...@apache.org> wrote:
>
>>I think if the observers would all be 'OSGi-ified' then this could be
>>achieved. But currently eg the BackgroundObserver is just a pojo and not
>>an osgi component (thus doesn't support any activate/deactivate method
>>hooks).
>
>.. I take that back - going via OsgiWhiteboard should work as desired - so
>perhaps implementing deactivate/activate methods in the
>(Background)Observer(s) would do the trick .. I'll give it a try ..

ootb this wont work as the BackgroundObserver, as one example, is not an
OSGi component, so wont get any deactivate/activate calls atm. so to
achieve this, it would have to be properly OSGi-ified - something which
sounds like a bigger task and not only limited to this one class - which
means making DocumentNodeStore 'restart capable' sounds like a bigger task
too and the question is indeed if it is worth while ('will it work?') or
if there are alternatives..

which brings me back to the original question as to what should be done in
case of a lease failure - to recap the options left (if System.exit is not
one of them) are:

a) 'go read-only': prevent writes by throwing exceptions from this moment
until eternity

b) 'stop oak': stop the oak-core bundle (prevent writes by throwing
exceptions for those still reaching out for the nodeStore)

c) 'try harder': try to reactivate the lease - continue allowing writes -
and make sure the next backgroundWrite has correctly updated the
'unsavedLastRevisions' (cos others could have done a recover of this node,
so unsavedLastRevisions contains superfluous stuff that must no longer be
written). this would open the door for edge cases ('change of longer time
window with multiple leaders') but perhaps is not entirely impossible...

additionally/independently:

* in all cases the discovery-lite descriptor should expose this lease
failure/partitioning situation - so that anyone can react who would like
to, esp should anyone no longer assume that the local instance is leader
or part of the cluster - and to support that optional Sling Health Check
which still does a System.exit :)

* also, we should probably increase the lease thread's priority to reduce
the likelihood of the lease timing out (same would be true for
discovery.impl's heartbeat thread)

* plus increasing the lease time from 1min to perhaps 5min as the default
would also reduce the number of cases that hit problems dramatically

wdyt?

Cheers,
Stefan

Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j

2015-09-09 Thread Stefan Egli

Hi all,

I'd like to follow up on the idea to restart DocumentNodeStore as a result
of a lease failure [0]: I suggest we don't do that and instead just stop
the oak-core bundle.

After some prototyping and running into OAK-3373 [1] I'm no longer sure if
restarting the DocumentNodeStore is a feasible path to go, esp in the
short term. The problem encountered so far is that Observers cannot be
easily switched from old to (restarted/)new store due to:

 * as pointed out by MichaelD they could have a backlog yet to process
towards the old store - which they cannot access anymore as that one would
be forcibly closed
 * there is not yet a proper way to switch from old to new ('reset') - esp
is there a risk that there could be a gap (this part we might be able to
fix though, not sure)
 * both above carry the risk that Observers miss some changes - something
which would be unacceptable I guess.

I think the more kiss approach would be to just forcibly close the
DocumentNodeStore - or actually to stop the entire oak-core bundle - with
appropriate errors logged so that the issue becomes clear. The instance
would basically become unusable, mostly, but at least it would not be a
System.exit.

What do ppl think?

Cheers,
Stefan
--
[0] https://issues.apache.org/jira/browse/OAK-3250
[1] https://issues.apache.org/jira/browse/OAK-3373

On 18/08/15 16:45, "Stefan Egli" <e...@adobe.com> wrote:

>I've created OAK-3250 to follow up on the DocumentNodeStore-restart idea.
>
>Cheers,
>Stefan
>--
>https://issues.apache.org/jira/browse/OAK-3250
>
>On 18/08/15 15:59, "Marcel Reutegger" <mreut...@adobe.com> wrote:
>
>>On 18/08/15 15:38, "Stefan Egli" wrote:
>>>On 18/08/15 13:43, "Marcel Reutegger" <mreut...@adobe.com> wrote:
>>>>On 18/08/15 11:14, "Stefan Egli" wrote:
>>>>>b) Oak does not do the System.exit but refuses to update anything
>>>>>towards
>>>>>the document store (thus just throws exceptions on each invocation) -
>>>>>and
>>>>>upper level code detects this situation (eg a Sling Health Check) and
>>>>>would do a System.exit based on how it is configured
>>>>>
>>>>>c) same as b) but upper level code does not do a System.exit (I¹m not
>>>>>sure
>>>>>if that makes sense - the instance is useless in such a situation)
>>>>
>>>>either b) or c) sounds reasonable to me.
>>>>
>>>>but if possible I'd like to avoid a System.exit(). would it be possible
>>>>to detect this situation in the DocumentNodeStoreService and restart
>>>>the DocumentNodeStore without the need to restart the JVM
>>>
>>>Good point. Perhaps restarting DocumentNodeStore is a valid alternative
>>>indeed. Is that feasible from a DocumentNodeStore point of view?
>>
>>it probably requires some changes to the DocumentNodeStore, because
>>we want it to tear down without doing any of the cleanup it
>>may otherwise perform. it must not release the cluster node info
>>nor update pending _lastRevs, etc.
>>
>>> What would be the consequences of a restarted DocumentNodeStore?
>>
>>to the DocumentNodeStore it will look like it was killed and it will
>>perform recovery (e.g. for the pending _lastRevs).
>>
>>Regards
>> Marcel
>>
>

Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j

2015-09-09 Thread Stefan Egli

Hi,

On 09/09/15 17:39, "Marcel Reutegger"  wrote:

>>* as pointed out by MichaelD they could have a backlog yet to process
>>towards the old store - which they cannot access anymore as that one
>>would
>>be forcibly closed
>
>in my view, those observers should be unregistered from the store before
>it is shut down and any backlog cleared, i.e. it will be lost.

yes they do get unregistered right away indeed - but atm there's no handle
as to prevent eg the BackgroundObserver from still having entries in the
queue and continuing to process them. so those queued entries will indeed
fail as the store is closed.

>>* there is not yet a proper way to switch from old to new ('reset') - esp
>>is there a risk that there could be a gap (this part we might be able to
>>fix though, not sure)
>
>I don't see a requirement for this. if you restart the entire stack you
>will also have a gap.

the difference is perhaps that if you restart the stack this is done as an
explicit admin operation, knowingly. While as what we're trying to achieve
here is something automated, 'under the hood', which has a different
quality requirement imv.

>>* both above carry the risk that Observers miss some changes - something
>>which would be unacceptable I guess.
>
>same as above. I don't think observers must survive a node store restart.
>I even think it is wrong. Every client of the node store should be
>restarted in that case, including Observers.

I think if the observers would all be 'OSGi-ified' then this could be
achieved. But currently eg the BackgroundObserver is just a pojo and not
an osgi component (thus doesn't support any activate/deactivate method
hooks).

Cheers,
Stefan

Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j

2015-09-09 Thread Stefan Egli

On 09/09/15 18:01, "Stefan Egli" <stefane...@apache.org> wrote:

>I think if the observers would all be 'OSGi-ified' then this could be
>achieved. But currently eg the BackgroundObserver is just a pojo and not
>an osgi component (thus doesn't support any activate/deactivate method
>hooks).

.. I take that back - going via OsgiWhiteboard should work as desired - so
perhaps implementing deactivate/activate methods in the
(Background)Observer(s) would do the trick .. I'll give it a try ..

Cheers,
Stefan

Re: [Oak origin/trunk] Apache Jackrabbit Oak matrix - Build # 381 - Failure

2015-09-07 Thread Stefan Egli

before it does the exit it issues a loud log.error - so we'd have to have
access to the log output..

besides resolving OAK-3250 when we know a test fails because of it, the
easiest is to disable the leaseCheck as eg done in [0]

but now test results of '381' are deleted so we can't find out anymore

Cheers,
Stefan
--
[0] 
http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-core/src/test/java/or
g/apache/jackrabbit/oak/plugins/document/VersionGarbageCollectorIT.java?r1=
1700741=1700740=1700741

On 07/09/15 10:25, "Michael Dürig" <mdue...@apache.org> wrote:

>
>
>On 7.9.15 10:03 , Stefan Egli wrote:
>> so perhaps it's a lease timeout case..
>
>Any way to confirm this on Jenkins? E.g. could we place a println in
>front of it? Or replace it with a throws?
>
>Michael

Re: [Oak origin/trunk] Apache Jackrabbit Oak matrix - Build # 381 - Failure

2015-09-07 Thread Stefan Egli

'... System.exit called ...'

what we currently have until OAK-3250 is fixed is a System.exit when the
lease cannot be updated.

so perhaps it's a lease timeout case..

Cheers,
Stefan

On 31/08/15 16:00, "Michael Dürig"  wrote:

>
>"The forked VM terminated without saying properly goodbye. VM crash or
>System.exit called ?" [2]. This happens quite often lately. See log
>files with -X option [1]. Not much information though. Any ideas what
>could be causing this?
>
>Michael
>
>[1] 
>https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/381/jdk=j
>dk1.8.0_11,label=Ubuntu,nsfixtures=SEGMENT_MK,profile=integrationTesting/c
>onsole
>
>[2]
>[ERROR] Failed to execute goal
>org.apache.maven.plugins:maven-failsafe-plugin:2.12.4:integration-test
>(default) on project oak-core: Execution default of goal
>org.apache.maven.plugins:maven-failsafe-plugin:2.12.4:integration-test
>failed: The forked VM terminated without saying properly goodbye. VM
>crash or System.exit called ? -> [Help 1]
>org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
>execute goal 
>org.apache.maven.plugins:maven-failsafe-plugin:2.12.4:integration-test
>(default) on project oak-core: Execution default of goal
>org.apache.maven.plugins:maven-failsafe-plugin:2.12.4:integration-test
>failed: The forked VM terminated without saying properly goodbye. VM
>crash or System.exit called ?
>   at 
>org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java
>:224)
>   at 
>org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java
>:153)
>   at 
>org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java
>:145)
>   at 
>org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(Li
>fecycleModuleBuilder.java:108)
>   at 
>org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(Li
>fecycleModuleBuilder.java:76)
>   at 
>org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedB
>uilder.build(SingleThreadedBuilder.java:51)
>   at 
>org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStar
>ter.java:116)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:361)
>   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155)
>   at org.apache.maven.cli.MavenCli.execute(MavenCli.java:584)
>   at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:213)
>   at org.apache.maven.cli.MavenCli.main(MavenCli.java:157)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
>62)
>   at 
>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm
>pl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
>org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.
>java:289)
>   at 
>org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229
>)
>   at 
>org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launche
>r.java:415)
>   at 
>org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
>Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
>default of goal 
>org.apache.maven.plugins:maven-failsafe-plugin:2.12.4:integration-test
>failed: The forked VM terminated without saying properly goodbye. VM
>crash or System.exit called ?
>   at 
>org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuild
>PluginManager.java:144)
>   at 
>org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java
>:208)
>   ... 19 more
>Caused by: java.lang.RuntimeException: The forked VM terminated without
>saying properly goodbye. VM crash or System.exit called ?
>   at 
>org.apache.maven.plugin.surefire.booterclient.output.ForkClient.close(Fork
>Client.java:257)
>   at 
>org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter
>.java:301)
>   at 
>org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.
>java:116)
>   at 
>org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(Abst
>ractSurefireMojo.java:740)
>   at 
>org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAllProviders(
>AbstractSurefireMojo.java:682)
>   at 
>org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPrecondi
>tionsChecked(AbstractSurefireMojo.java:648)
>   at 
>org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSure
>fireMojo.java:586)
>   at 
>org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuild
>PluginManager.java:133)
>   ... 20 more
>
>
>On 31.8.15 3:56 , Apache Jenkins Server wrote:
>> The Apache Jenkins build system has built Apache Jackrabbit Oak matrix
>>(build #381)
>>
>> Status: Failure
>>
>> Check console output at

Re: System.out.println used in unit tests in oak-core

2015-08-27 Thread Stefan Egli

which you might have noticed since I disabled redirectTestOutputToFile [0]
to debug OAK-3292 so we now have system.out during test runs. I intend to
put that flag back once the OAK-3292 dust has settled..

Cheers,
Stefan
--
[0] - http://svn.apache.org/r1697676


On 27/08/15 13:51, Alex Parvulescu alex.parvule...@gmail.com wrote:

Hi,

I noticed there are quite a few tests using
System.out.println to display various data.
Please replace these calls by proper logging.

Culprits:
 - org.apache.jackrabbit.oak.plugins.document.cache.SerializerTest [0]
 - org.apache.jackrabbit.oak.plugins.document.ClusterViewTest [1]
 - org.apache.jackrabbit.oak.plugins.document.HierarchyConflictTest [2]
 - org.apache.jackrabbit.oak.plugins.document.NodeStoreDiffTest [3]
 - org.apache.jackrabbit.oak.security.user.MembershipProviderTest [4]

thanks,
alex

[0]
Running org.apache.jackrabbit.oak.plugins.document.cache.SerializerTest
Size 7 null
Size 18 b1
Size 301 b1
Size 9 r14f6ef05471-1-5
Size 9 br14f6ef05472-1-5

[1]
Running org.apache.jackrabbit.oak.plugins.document.ClusterViewTest
{seq:10,final:true,id:a2b9d562-9536-436f-9b67-5efbb85fbed4,me:21
,active:[21],deactivating:[],inactive:[]}
{seq:10,final:true,id:b8e70adb-4b30-4319-aec8-b28fb1679a4c,me:2,
active:[2],deactivating:[],inactive:[3]}
{seq:10,final:true,id:341ca74f-a2cd-4d81-8a1b-6565f90e22a2,me:2,
active:[2,5,6],deactivating:[],inactive:[3]}
{seq:10,final:true,id:7f0cfb9e-27eb-47fc-934b-89194a18ac0c,me:2,
active:[2],deactivating:[],inactive:[3,4,5,6]}
{seq:10,final:true,id:07ac1a64-dd2c-4d02-add8-e345d808aa7a,me:2,
active:[2,3],deactivating:[4],inactive:[5,6]}
{seq:10,final:false,id:3cd780f0-16ff-4100-a842-c15fcf43e339,me:2
,active:[2,3],deactivating:[4,5],inactive:[6]}

[2]
Running org.apache.jackrabbit.oak.plugins.document.HierarchyConflictTest
expected: org.apache.jackrabbit.oak.api.CommitFailedException: OakOak:
do not retry merge in this test
expected: org.apache.jackrabbit.oak.api.CommitFailedException: OakOak:
do not retry merge in this test

[3]
Running org.apache.jackrabbit.oak.plugins.document.NodeStoreDiffTest
Root at r1-0-1 (r1-0-1)
Root at r2-0-1 (r2-0-1)
Root at r3-0-1 (r3-0-1)
Root at r4-0-1 (r4-0-1)
Root at r1-0-1 (r1-0-1)
Root at r2-0-1 (r2-0-1)
Root at r3-0-1 (r3-0-1)
Root at r4-0-1 (r4-0-1)

[4]
Running org.apache.jackrabbit.oak.security.user.MembershipProviderTest
created 1 groups, 99 users.
created 1 groups, 199 users.
created 1 groups, 299 users.
created 1 groups, 399 users.
created 1 groups, 499 users.
created 1 groups, 599 users.
created 1 groups, 699 users.
created 1 groups, 799 users.
created 1 groups, 899 users.
created 1 groups, 999 users.
created 99 groups, 1 users.
created 199 groups, 1 users.
created 299 groups, 1 users.
created 399 groups, 1 users.
created 499 groups, 1 users.
created 599 groups, 1 users.
created 699 groups, 1 users.
created 799 groups, 1 users.
created 899 groups, 1 users.
created 999 groups, 1 users.
created 11 groups, 89 users.
created 21 groups, 179 users.
created 31 groups, 269 users.
created 41 groups, 359 users.
created 51 groups, 449 users.
created 61 groups, 539 users.
created 71 groups, 629 users.
created 81 groups, 719 users.
created 91 groups, 809 users.
created 100 groups, 900 users.
created 110 groups, 990 users.
created 99 groups, 1 users.
created 1 groups, 99 users.
created 1 groups, 199 users.
created 1 groups, 299 users.
created 1 groups, 399 users.
created 1 groups, 499 users.
created 1 groups, 599 users.
created 1 groups, 699 users.
created 1 groups, 799 users.
created 1 groups, 899 users.
created 1 groups, 999 users.
created 1 groups, 99 users.
created 1 groups, 199 users.
created 1 groups, 299 users.
created 1 groups, 399 users.
created 1 groups, 499 users.
created 1 groups, 599 users.
created 1 groups, 699 users.
created 1 groups, 799 users.
created 1 groups, 899 users.
created 1 groups, 999 users.
created 1 groups, 99 users.
created 1 groups, 199 users.
created 1 groups, 299 users.
created 1 groups, 399 users.
created 1 groups, 499 users.
created 1 groups, 599 users.
created 1 groups, 699 users.
created 1 groups, 799 users.
created 1 groups, 899 users.
created 1 groups, 999 users.

Re: Jenkins notifications

2015-08-26 Thread Stefan Egli

yep, very useful, thx!

Cheers,
Stefan

On 26/08/15 11:47, Michael Dürig mdue...@apache.org wrote:


Hi,

As you might have seen, Jenkins notifications now contain the change
list since the last build as well as the list of failed tests. This
should make it easier for everyone to find out what caused a build to
fail and to take appropriate actions.

Michael

[travis] console output of failed tests

2015-08-25 Thread Stefan Egli

Hi,

I'm chasing a test failure on travis ([0]) currently but it's virtually
impossible to find the root cause without having the console (or file)
output of the test in case it fails.

Does anyone know if/how to get the surefire files on travis? or should we
tweak the pom (redirectTestOutputToFile)?

Cheers,
Stefan
--
[0] - https://travis-ci.org/apache/jackrabbit-oak/builds/77114814

Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j

2015-08-18 Thread Stefan Egli

On 18/08/15 13:43, Marcel Reutegger mreut...@adobe.com wrote:

On 18/08/15 11:14, Stefan Egli wrote:
b) Oak does not do the System.exit but refuses to update anything towards
the document store (thus just throws exceptions on each invocation) - and
upper level code detects this situation (eg a Sling Health Check) and
would do a System.exit based on how it is configured

c) same as b) but upper level code does not do a System.exit (I¹m not
sure
if that makes sense - the instance is useless in such a situation)

either b) or c) sounds reasonable to me.

but if possible I'd like to avoid a System.exit(). would it be possible
to detect this situation in the DocumentNodeStoreService and restart
the DocumentNodeStore without the need to restart the JVM

Good point. Perhaps restarting DocumentNodeStore is a valid alternative
indeed. Is that feasible from a DocumentNodeStore point of view? What
would be the consequences of a restarted DocumentNodeStore?


or would this
lead to an illegal state from a discovery POV?

Have to think through the scenarios but perhaps this is fine (I was indeed
initially under the assumption that it would not be fine, but that might
have been wrong). The important bit is that any topology-related activity
stops - and this can be achieved by sending TOPOLOGY_CHANGING (which in
turn could be achieved by setting the own instance into 'deactivating'
state in the discovery-lite-descriptor) and only coming back with
TOPOLOGY_CHANGED once the restart would be settled and the local instance
is back in the cluster with a valid, new lease.

Cheers,
Stefan

Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j

2015-08-18 Thread Stefan Egli

I've created OAK-3250 to follow up on the DocumentNodeStore-restart idea.

Cheers,
Stefan
--
https://issues.apache.org/jira/browse/OAK-3250

On 18/08/15 15:59, Marcel Reutegger mreut...@adobe.com wrote:

On 18/08/15 15:38, Stefan Egli wrote:
On 18/08/15 13:43, Marcel Reutegger mreut...@adobe.com wrote:
On 18/08/15 11:14, Stefan Egli wrote:
b) Oak does not do the System.exit but refuses to update anything
towards
the document store (thus just throws exceptions on each invocation) -
and
upper level code detects this situation (eg a Sling Health Check) and
would do a System.exit based on how it is configured

c) same as b) but upper level code does not do a System.exit (I¹m not
sure
if that makes sense - the instance is useless in such a situation)

either b) or c) sounds reasonable to me.

but if possible I'd like to avoid a System.exit(). would it be possible
to detect this situation in the DocumentNodeStoreService and restart
the DocumentNodeStore without the need to restart the JVM

Good point. Perhaps restarting DocumentNodeStore is a valid alternative
indeed. Is that feasible from a DocumentNodeStore point of view?

it probably requires some changes to the DocumentNodeStore, because
we want it to tear down without doing any of the cleanup it
may otherwise perform. it must not release the cluster node info
nor update pending _lastRevs, etc.

 What would be the consequences of a restarted DocumentNodeStore?

to the DocumentNodeStore it will look like it was killed and it will
perform recovery (e.g. for the pending _lastRevs).

Regards
 Marcel

Re: 1.3.4 blocked as failing tests

2015-08-17 Thread Stefan Egli

my fault, I¹m looking into it now

On 17/08/15 12:02, Davide Giannella dav...@apache.org wrote:

Hello team,

trying to release Oak 1.3.4 but it's constantly failing on my local.
Details can be found here

https://issues.apache.org/jira/secure/attachment/12750782/oak-1.3.4-failin
g-1439805620.log

looking into it but if you know the answer ping me please.

Davide

Re: [discovery] Introducing a simple mongo-based discovery-light service (to circumvent mongoMk's eventual consistency delays)

2015-08-17 Thread Stefan Egli

Hi all,

I¹ve attached a final final¹ version of discovery lite to OAK-2844 ready
for a final review - depending on feedback I plan to push that to trunk
once 1.3.4 is out.

Cheers,
Stefan

https://issues.apache.org/jira/browse/OAK-2844
https://issues.apache.org/jira/secure/attachment/12750833/OAK-2844.v4.patch

On 07/07/15 12:45, Stefan Egli stefane...@apache.org wrote:

FYI: I've attached a suggested 'final draft' version of the discovery lite
to OAK-2844 for review. Comments very welcome!

Cheers,
Stefan
--
https://issues.apache.org/jira/browse/OAK-2844?focusedCommentId=14616496p
a
ge=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#commen
t
-14616496

On 5/6/15 3:22 PM, Stefan Egli stefane...@apache.org wrote:

Hi,

Pls note a suggestion of a new 'discovery-light' API in OAK-2844.

Would appreciate comments and reviews from this list.

Thanks,
Cheers,
Stefan

[document] lease check activated (OAK-2739)

2015-08-17 Thread Stefan Egli

Hi all,

Just a quick heads-up: I¹ve activated a lease check¹ with OAK-2739 in
trunk: this checks upon every invocation of DocumentStore if the local lease
is still valid. If it is not, it means that the instance is misbehaving and
that others potentially have seen it as inactive. Thus the local instance
will automatically shutdown and not do any further writes towards
DocumentStore.

Cheers,
Stefan

Re: Release dates

2015-08-13 Thread Stefan Egli

I¹d find it more useful (for us) when it would be the cut-date.

Cheers,
Stefan

On 13/08/15 10:08, Davide Giannella dav...@apache.org wrote:

Hello team,

a trivia question about release dates.

Normally in jira I set the release date on a future release for when we
plan to cut it. But we have the voting process of 72hrs that means the
actual release date will be 3 days after the cut.

Shall we put on jira then the release date as the actual announcement or
stick it to the cut?

Cheers
Davide

Re: [discuss] handling of 'wish list' issues - introduce 'wish list' fix version?

2015-07-29 Thread Stefan Egli

perhaps 'unscheduled' and 'wish list' are very similar indeed - even
though I'd have thought of 'unscheduled' more as of 'it should be
scheduled soon-ish' - where as 'wish list' would already have gone through
the decision process of 'no we dont do this anytime soon but its a good
idea so lets not forget it'.

Cheers,
Stefan

On 7/29/15 9:27 AM, Angela Schreiber anch...@adobe.com wrote:

why not simply marking it as 'unscheduled'? IMO that pretty much
expresses that this is is not yet scheduled but still considered
a valid improvement/bug that we want to address at some point.

i only resolve issues 'later' or 'wontfix' that i am confident
that will never be fixed.

adding a 'wish list' fix version will just be another huge container
that we hardly ever look at and i would find it hard to understand
the difference between 'unscheduled' and 'wish list'.

if something is on your wishlist, i would suggest you assign the
issue to yourself in order to keep track of it (compared to the
whole bunch of other unscheduled issues). or flag it with a
label that allows you to find all your wishes.

so, rather -1 from my side.

kind regards
angela

On 29/07/15 08:58, Stefan Egli stefane...@apache.org wrote:

Hi,

Just came across a ticket [0] that has no urgent priority to be fixed in
1.3
but would be a good candidate to be put into the general 'wish list pod'.

Now currently we seem to handle such cases by just closing the ticket.
This
imv has the downside of it getting completely lost and forgotten.

We could thus introduce a new 'wish list' fix version that can be set on
those tickets instead of just closing them.

Wdyt?

Cheers,
Stefan
--
https://issues.apache.org/jira/browse/OAK-2613

Re: Do not add comments when bulk moves are performed in JIRA

2015-07-29 Thread Stefan Egli

+1

There's always the jira history to figure out when what was modified

Cheers,
Stefan

On 7/29/15 8:17 AM, Chetan Mehrotra chetan.mehro...@gmail.com wrote:

Hi Team,

Currently most of the issues scheduled for 1.3.x release have comments
like 'Bulk Move to xxx'. This creates unnecessary noise in the comment
log. Would it be possible to move the issues to next version silently
i.e. just get fix version changed and not add any comment

Chetan Mehrotra

Re: [discovery] Introducing a simple mongo-based discovery-light service (to circumvent mongoMk's eventual consistency delays)

2015-07-07 Thread Stefan Egli

FYI: I've attached a suggested 'final draft' version of the discovery lite
to OAK-2844 for review. Comments very welcome!

Cheers,
Stefan
--
https://issues.apache.org/jira/browse/OAK-2844?focusedCommentId=14616496pa
ge=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment
-14616496

On 5/6/15 3:22 PM, Stefan Egli stefane...@apache.org wrote:

Hi,

Pls note a suggestion of a new 'discovery-light' API in OAK-2844.

Would appreciate comments and reviews from this list.

Thanks,
Cheers,
Stefan

Re: Error handling during AsyncIndexUpdate

2015-06-22 Thread Stefan Egli

+1 to report and continue.

There was a similar issue earlier where the async indexing would fail with
an OOME - in which case the 'rinse and repeat' even made it worse (as each
time more and more data-to-be-indexed accumulates and the likelihood of an
OOME would just increase)

Cheers,
Stefan

On 6/22/15 10:54 AM, Julian Sedding jsedd...@gmail.com wrote:

Hi all

On a freshly migrated Oak setup (AEM 6.1), I recently observed that
async indexing was running all the time. At first I did not worry,
because there were ~14mio nodes to be indexed, but eventually I got
the impression that there was an endless loop.

Here's my take on what's happening, and please feel free to correct
any wrong assumptions I make:

- after a migration there is no checkpoint for async indexing to start
at, so it indexes everything
- a migration is a single commit, so async indexing is all or nothing
(not sure the single commit is relevant, anyone?)
- due to an oddity in the metadata of a PDF file, async indexing
failed with an exception
- async indexing recommences to see if the error persists on any
subsequent run
- rinse and repeat

If my interpretation is correct, I would suggest to review the error
handling.

If an error is not recoverable, the current behaviour basically
prevents any documents to be indexed and the AsyncIndexUpdate stops to
make any progress.

It may be a better trade off to report the paths of failing documents
and continue despite the failure.

What do others think?

Regards
Julian

Re: [mongoNs] using bulk operation for backgroundupdate?

2015-06-22 Thread Stefan Egli

Ok, created a separate OAK-3018 for adapting backgroundWrite to use the
batch-update (once available)

Cheers,
Stefan

On 6/22/15 10:05 AM, Marcel Reutegger mreut...@adobe.com wrote:

Hi,

this is currently not possible because the DocumentStore API
does not have such a method. There's an existing issue closely
related to your request:

https://issues.apache.org/jira/browse/OAK-2066


I think in general it makes sense to add such a method. As
you can see in the issue, the background write is not the only
application that would benefit from it.

Regards
 Marcel

On 18/06/15 17:24, Stefan Egli wrote:

Hi,

This might have been discussed before  but just so I understand:

The DocumentNodeStore.backgroundWrite goes through the heavy work of
updating the lastRev for all pending changes and does so in a
hierarchical-depth-first manner. Unfortunately, if the pending changes
all
come from separate commits (as does not sound so unlikely), the updates
are
sent in individual update calls to mongo (whenever the lastRev differs).
Which, if there are many changes, results in many calls to mongo.

What about replacing that mechanism using mongo's bulk functionality (eg
initializeOrderedBulkOperation)? Is this for some reason not possible or
already in the jira-queue (which ticket)?

Cheers,
Stefan
--
http://api.mongodb.org/java/current/com/mongodb/DBCollection.html#initial
i
ze
OrderedBulkOperation--

[mongoNs] using bulk operation for backgroundupdate?

2015-06-18 Thread Stefan Egli

Hi,

This might have been discussed before  but just so I understand:

The DocumentNodeStore.backgroundWrite goes through the heavy work of
updating the lastRev for all pending changes and does so in a
hierarchical-depth-first manner. Unfortunately, if the pending changes all
come from separate commits (as does not sound so unlikely), the updates are
sent in individual update calls to mongo (whenever the lastRev differs).
Which, if there are many changes, results in many calls to mongo.

What about replacing that mechanism using mongo's bulk functionality (eg
initializeOrderedBulkOperation)? Is this for some reason not possible or
already in the jira-queue (which ticket)?

Cheers,
Stefan
--
http://api.mongodb.org/java/current/com/mongodb/DBCollection.html#initialize
OrderedBulkOperation--

Re: Observation: External vs local - Load distribution

2015-06-15 Thread Stefan Egli

On 6/15/15 2:40 PM, Carsten Ziegeler cziege...@apache.org wrote:

Am 15.06.15 um 14:23 schrieb Marcel Reutegger:
 Hi,
 
 you can write a CommitEditor, which is called with every
 local commit.
 

Is it easy to calculate the changed nodes/properties in this editor?

As I understand yes, the Editor gets callback for all changed nodes and
properties.


I guess the question is how that is encapsulated towards upper layers as
you probably do not want (too much) application code using commit editors.

Cheers,
Stefan

Re: Observation: External vs local - Load distribution

2015-06-15 Thread Stefan Egli

On 6/15/15 4:29 PM, Carsten Ziegeler cziege...@apache.org wrote:

Am 15.06.15 um 16:21 schrieb Chetan Mehrotra:
 On Mon, Jun 15, 2015 at 1:13 PM, Carsten Ziegeler
cziege...@apache.org wrote:
 Now, with Oak there is still this distinction, however if I remember
 correctly under heavy load it might happen that local events are
 reported as external events. And in that case the above pattern fails.
 Regardless of how rare this situation might be, if it can happen it
will
 eventually happen.
 
 This is an implementation detail of BackgroundObserver (BO) which is
 used by OakResourceListener in Sling. BO keeps a queue of changed
 NodeState tuples and if it gets filled it is collapsed. If you want to
 avoid that at *any* cost that you can used a different impl which uses
 say LinkedBlockingQueue and does not enforce any limit. That would be
 similar to how JcrResourceListener works which uses an unbound in
 memory queue

Indeed a good point!

 
Ah, thanks Chetan, that's the first time I hear this - so basically if
we implement our own observer, we can reliably get:
a) all changes
b) local/external info
c) user id

Is that correct?

the way I understand it is:  ;)

* for local changes yes, you'd get all local changes incl user id
* for external changes you'd get them all, but without user id and they
would typically be collapsed (as external changes are only periodically
written by the background updater)

So given this, you could indeed have an Observer that throws away all
external events (which are easily spottable as they have commitInfo==null)
and only process internal ones. And for such a 'local-only' observer I
think this could be a feasible approach.

Speaking more generally however: I guess to support scaling to very large
number of instances, the goal should be that external events are filtered
as much as possible too. Providing fast processing alone (as is the goal
eg with OAK-2829) would not suffice. I think for this we'd need 'oak level
observation filtering'. Such a filter could be applied to the journal
(filling only 'interested' paths into the diff caches).

At which point I wonder if it would not be beneficial to do both 'local vs
external' as well as 'path-filtering' on an oak level, rather than one or
both on the sling level

Re the commit editor use case: I think that would still be the only option
if you'd want 'local-guaranteed' events, ie local events that would not
get lost even in case of a crash. At the moment there are no solutions for
this - local events just get lost. I think we could have three different
event types (local-filtered, local-guaranteed-filtered, external-filtered).

Cheers,
Stefan

DocumentNodeStore background read/update operations synchronized?

2015-05-07 Thread Stefan Egli

Hi,

Just realized that DocumentNodeStore background read and update operations
are synchronized  which basically makes them be executed sequentially 
which somewhat works against OAK-2624.

@Marcel, @Chetan, wdyt, do they have to be synchronized? Could this not be a
bottleneck concurrency-wise?

Cheers,
Stefan

[discovery] Introducing a simple mongo-based discovery-light service (to circumvent mongoMk's eventual consistency delays)

2015-05-06 Thread Stefan Egli

Hi,

Pls note a suggestion of a new 'discovery-light' API in OAK-2844.

Would appreciate comments and reviews from this list.

Thanks,
Cheers,
Stefan

Re: Efficiently process observation event for local changes

2015-03-25 Thread Stefan Egli

Related to this, I've created

https://issues.apache.org/jira/browse/OAK-2683

which is about an issue that happens when the observation queue limit is
reached.

Cheers,
Stefan

On 3/23/15 4:03 PM, Chetan Mehrotra chetan.mehro...@gmail.com wrote:

After discussing this further with Marcel and Michael we came to
conclusion
that we can achieve similar performance by make use of persistent cache
for
storing the diff. This would require slight change in way we interpret the
diff JSOP. This should not require any change in current logic related to
observation event generation. Opened OAK-2669 to track that.

One thing that we might still want to do is to use separate queue size for
listeners interested in local events only and those which can work with
external event. On a system like AEM there 180 listeners which listen for
external changes and ~20 which only listen to local changes. So makes
sense
to have bigger queues for such listners

Chetan Mehrotra

On Mon, Mar 23, 2015 at 4:09 PM, Michael Dürig mdue...@apache.org wrote:



 On 23.3.15 11:03 , Stefan Egli wrote:

 Going one step further we could also discuss to completely moving the
 handling of the 'observation queues' to an actual messaging system.
 Whether this would be embedded to an oak instance or whether it would
be
 shared between instances in an oak cluster might be a different
question
 (the embedded variant would have less implication on the overall oak
 model, esp also timing-wise). But the observation model quite exactly
 matches the publish-subscribe semantics - it actually matches pub-sub
more
 than it fits into the 'cache semantics' to me.


 Definitely something to try out, given someone find the time for it. ;-)
 Mind you that some time ago I implemented persisting events to Apache
Kafka
 [1], which wasn't greeted with great enthusiasm though...

 OTOH the same concern regarding pushing the bottleneck to IO applies
here.
 Furthermore filtering the persisted events through access control is
 something we need yet to figure out as AC is a) sessions scoped and b)
 depends on the tree hierarchy.

 Michael


 [1] https://github.com/mduerig/oak-kafka



 .. just saying ..

 On 3/23/15 10:47 AM, Michael Dürig mdue...@apache.org wrote:



 On 23.3.15 5:04 , Chetan Mehrotra wrote:

 B - Proposed Changes
 ---

 1. Move the notion of listening to local events to Observer level -
So
 upon
 any new change detected we only push the change to a given queue if
its
 local and bounded listener is only interested in local. Currently we
 push
 all changes which later do get filter out but we avoid doing that
first
 level itself and keep queue content limited to local changes only


 I think there is no change needed in the Observer API itself as you
can
 already figure out from the passed CommitInfo whether a commit is
 external or not. BTW please take care with the term local as there
is
 also the concept of session local commits.


 2. Attach the calculated diff as part of commit info which is
attached
 to
 the given change. This would allow eliminating the chances of the
cache
 miss altogether and would ensure observation is not delayed due to
slow
 processing of diff. This can be done on best effort basis if the diff
 is to
 large then we do not attach it and in that case we diff again

 3. For listener which are only interested in local events we can use
a
 different queue size limit i.e. allow larger queues for such
listener.

 Later we can also look into using a journal (or persistent queue) for
 local
 event processing.


 Definitely something to try out. A few points to consider:

 * There doesn't seem to be too much of a difference to me whether this
 is routed via a cache or directly attached to commits. In wither way
it
 adds additional memory requirements and churn, which need to be
managed.

 * When introducing persisted queuing we need to be careful not to just
 move the bottleneck to IO.

 * An eventual implementation should not break the fundamental design.
 Either hide it in the implementation or find a clean way to put this
 into the overall design.

 Michael

Re: Efficiently process observation event for local changes

2015-03-23 Thread Stefan Egli

Going one step further we could also discuss to completely moving the
handling of the 'observation queues' to an actual messaging system.
Whether this would be embedded to an oak instance or whether it would be
shared between instances in an oak cluster might be a different question
(the embedded variant would have less implication on the overall oak
model, esp also timing-wise). But the observation model quite exactly
matches the publish-subscribe semantics - it actually matches pub-sub more
than it fits into the 'cache semantics' to me.

.. just saying ..

On 3/23/15 10:47 AM, Michael Dürig mdue...@apache.org wrote:



On 23.3.15 5:04 , Chetan Mehrotra wrote:
 B - Proposed Changes
 ---

 1. Move the notion of listening to local events to Observer level - So
upon
 any new change detected we only push the change to a given queue if its
 local and bounded listener is only interested in local. Currently we
push
 all changes which later do get filter out but we avoid doing that first
 level itself and keep queue content limited to local changes only

I think there is no change needed in the Observer API itself as you can
already figure out from the passed CommitInfo whether a commit is
external or not. BTW please take care with the term local as there is
also the concept of session local commits.


 2. Attach the calculated diff as part of commit info which is attached
to
 the given change. This would allow eliminating the chances of the cache
 miss altogether and would ensure observation is not delayed due to slow
 processing of diff. This can be done on best effort basis if the diff
is to
 large then we do not attach it and in that case we diff again

 3. For listener which are only interested in local events we can use a
 different queue size limit i.e. allow larger queues for such listener.

 Later we can also look into using a journal (or persistent queue) for
local
 event processing.

Definitely something to try out. A few points to consider:

* There doesn't seem to be too much of a difference to me whether this
is routed via a cache or directly attached to commits. In wither way it
adds additional memory requirements and churn, which need to be managed.

* When introducing persisted queuing we need to be careful not to just
move the bottleneck to IO.

* An eventual implementation should not break the fundamental design.
Either hide it in the implementation or find a clean way to put this
into the overall design.

Michael

Re: [segment] offline compaction broken?

2015-01-27 Thread Stefan Egli

Hi Alex,

There's only 1 checkpoint, so that looks good. I still see the same..
oak-run 1.0.8 compacts fine, but the latest trunk will instead start
filling up tar file after tar file.. (tested with java 1.7 against a
segmentstore-repo that was created with oak 1.1.4)

Cheers,
Stefan

On 1/26/15 7:13 PM, Alex Parvulescu alex.parvule...@gmail.com wrote:

Hi Stefan,

Offline compaction should work properly.
Can you quickly check the number of checkpoints?

alex




On Mon, Jan 26, 2015 at 6:12 PM, Stefan Egli stefane...@apache.org
wrote:

 Hi,

 Before I dig too deep  I built the latest trunk and tried to run
offline
 compaction but see a weird behavior where oak-run starts filling one tar
 file after the other  basically increasing seemingly endlessly.

 Is this known or only me?

 Cheers,
 Stefan

Re: [segment] offline compaction broken?

2015-01-27 Thread Stefan Egli

It looks like no compaction strategy is set in oak-run.

Created https://issues.apache.org/jira/browse/OAK-2449

Cheers,
Stefan

On 1/27/15 9:58 AM, Stefan Egli e...@adobe.com wrote:

Hi Alex,

There's only 1 checkpoint, so that looks good. I still see the same..
oak-run 1.0.8 compacts fine, but the latest trunk will instead start
filling up tar file after tar file.. (tested with java 1.7 against a
segmentstore-repo that was created with oak 1.1.4)

Cheers,
Stefan

On 1/26/15 7:13 PM, Alex Parvulescu alex.parvule...@gmail.com wrote:

Hi Stefan,

Offline compaction should work properly.
Can you quickly check the number of checkpoints?

alex




On Mon, Jan 26, 2015 at 6:12 PM, Stefan Egli stefane...@apache.org
wrote:

 Hi,

 Before I dig too deep  I built the latest trunk and tried to run
offline
 compaction but see a weird behavior where oak-run starts filling one
tar
 file after the other  basically increasing seemingly endlessly.

 Is this known or only me?

 Cheers,
 Stefan

[segment] offline compaction broken?

2015-01-26 Thread Stefan Egli

Hi,

Before I dig too deep  I built the latest trunk and tried to run offline
compaction but see a weird behavior where oak-run starts filling one tar
file after the other  basically increasing seemingly endlessly.

Is this known or only me?

Cheers,
Stefan

Re: Scalability of JCR observation

2013-04-18 Thread Stefan Egli

Hi,

On 4/16/13 4:26 PM, Dominik Süß dominik.su...@gmail.com wrote:

I see some overlap with the latest work of Carsten in Sling regarding
Discovery API[0]. Since Sling typically should work uppon JCR / Oak it
might be good not to follow different patterns. For a combined solution I
do think it would be great to have one pluggable mediating system instead
of two which might have strange sideeffects for rejoin scenarios in a
cluster.

+1

If there was a jms/messaging client available in oak (pluggable) that an
implementation of the discovery.api (at the sling level..) could reuse,
that would definitely result in a more reliable 'cluster view' than having
separate mechanisms. How the 'cross cluster' aspect of the discovery's
topology would be implemented in that case is yet another question, but I
suppose it could just as well use jms cross-cluster...

Cheers,
Stefan


Just my 2 cents
Dominik

[0]http://markmail.org/thread/w3kgl7jxvhki3oqj


On Tue, Apr 16, 2013 at 11:51 AM, Michael Dürig mdue...@apache.org
wrote:



 On 15.4.13 9:46, Julian Reschke wrote:

 On 2013-04-15 10:32, Bertrand Delacretaz wrote:


  So I'm wondering if using an existing distributed message queue
 service (ActiveMQ/RabbitMQ etc) would help implement this. IIUC this
 is only a problem in very large Oak setups, so having to install
 additional components might not be an issue.


 Could that also help with implementing proper JCR Locking (or are we
 there already???).


 Probably. The idea of making external coordinaters pluggable has come up
 before: https://issues.apache.org/**jira/browse/OAK-150?**
 focusedCommentId=13401328**page=com.atlassian.jira.**
 
plugin.system.issuetabpanels:**comment-tabpanel#comment-**13401328https:
//issues.apache.org/jira/browse/OAK-150?focusedCommentId=13401328page=co
m.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13
401328

 Michael

85 matches

Mail list logo