Re: Migration from 2.x

2017-10-28 Thread Michael Marth
Hi Jean-Francois,

Oak works fine without OSGi. Here’s a reference: 
https://jackrabbit.apache.org/oak/docs/construct.html

I cannot answer your security questions, but there is rather extensive 
documentation on the topic. Maybe you will find an answer there:
https://jackrabbit.apache.org/oak/docs/security/overview.html
https://github.com/apache/jackrabbit-oak/tree/trunk/oak-exercise

Re the performance benchmarks: from the benchmarks I have seen (internal to my 
employer) Oak beats Jackrabbit 2 in any possible way. However, I recommend to 
run your own benchmarks that reflect your actual use case. There is a benchmark 
suite in Oak that can get you started:
https://github.com/apache/jackrabbit-oak/tree/trunk/oak-benchmarks


HTH
Michael



On 18/10/17 11:27, "Melian, Jean-Francois"  wrote:

Hi

I am not sure it is the best place to ask a question but Oak does not have 
a user mailing list.

I have to study a migration from Jackrabbit 2.X to Jackrabbit Oak.

At the moment I do not use OSGI.
Is this a good way to proceed?

I was able to use MongoDB and migrate nodes from 2.x (non-datastore) and 
create a Lucene index.
Security is the second step.
In 2.x we have developed our own LoginModule (two Ldap) and our own access 
management parameterized in this way in workspace.xml. (Without JAAS setting)

  
  


Is there a way to reuse these developments with Oak?

In the case of using an external LoginModule, why synchronization with 
internal user management? Is it not possible to delegate user management too?

Where can I find examples of programming the security configuration 
(authentication and authorization) without OSGI?

Are there performance benchmarks between Jackrabbit 2.x and Oak?

Regards,
Jean-François Melian




Re: frozen node modifications

2017-08-01 Thread Michael Marth
Hi Marco,

A better (or at least alternative) way to approach this, is to avoid the 
problem in the first place :) In my experience it is preferable to not create 
custom node types, because typically the business requirements are not 
understood enough - which means you later want to add things and then end up in 
the situation you just described. You could simply use unstructured nodes and 
have one String property describe the “type”.
A good reason to have node types is if you require the repository to enforce 
integrity (as opposed to the application doing that).

The unstructured approach is equivalent to the term “schemaless” as it became 
popular a couple of years ago. Obvs there are pros and cons, so pls take the 
above as a reflection of what has worked well for me (but your circimstances 
might be different)

Cheers
Michael



On 24/07/17 11:19, "Marco Piovesana"  wrote:

>Hi all,
>I'm working on the upgrade module for my application based on Oak. The
>module modifies the custom node types to reflect the modifications in the
>application.
>Some of those modifications may be adding new properties to custom nodes,
>and implicitly to all versions of that node.
>Since frozen nodes are read-only we ended up recreating the node history.
>This, however, makes the system more complex because we have
>weak-references between those nodes and recreating the history means new
>ids for the versions.
>There's really no way to modify the frozen node? There's a better way to
>solve this problem?
>
>Marco.


Re: upgrade repository structure with backward-incompatible changes

2017-05-22 Thread Michael Marth
Marco,

Judging from your original question: is there a problem with the JCR-based 
approach for migrating the content? If so, can you paste the code and explain 
what the problem is?

Cheers
Michael



On 19/05/17 09:04, "Julian Sedding"  wrote:

>Hi Marco
>
>In this case I think you should use the JCR API to implement your
>content changes.
>
>I am not aware of a pure JCR toolkit that helps with this, so you may
>just need to write something yourself.
>
>Regards
>Julian
>
>
>
>On Fri, May 19, 2017 at 5:00 PM, Marco Piovesana  wrote:
>> Hi Julian,
>> I meant I'm using Oak not Sing. Yes I'm using JCR API.
>>
>> Marco.
>>
>> On Fri, May 19, 2017 at 2:22 PM, Julian Sedding  wrote:
>>
>>> Hi Marco
>>>
>>> On Fri, May 19, 2017 at 2:10 PM, Marco Piovesana 
>>> wrote:
>>> > Hi Julian, Michael and Robert
>>> > first of all thanks for the suggestions.
>>> > I'm using Oak directly inside my application,
>>>
>>> Do you mean you are not using the JCR API?
>>>
>>> > so I guess the Sling Pipes
>>> > are not something I can use, or not? Is the concept of Pipe already
>>> defined
>>> > in some way inside oak?
>>>
>>> No Oak has no such concept. Sling Pipes is an OSGi bundle that is
>>> unrelated to Oak but uses the JCR and Jackrabbit APIs (both are
>>> implemented by Oak).
>>>
>>> Regards
>>> Julian
>>>
>>> >
>>> > Marco.
>>> >
>>> > On Fri, May 19, 2017 at 10:39 AM, Julian Sedding 
>>> wrote:
>>> >
>>> >> Hi Marco
>>> >>
>>> >> It sounds like you are dealing with a JCR-based application and thus
>>> >> you should be using the JCR API (directly or indirectly, e.g. via
>>> >> Sling) to change your content.
>>> >>
>>> >> CommitHook is an Oak internal API that does not enforce any JCR
>>> >> semantics. So if you were to go down that route, you would need to be
>>> >> very careful not to change the content structure in a way  that
>>> >> essentially corrupts JCR semantics.
>>> >>
>>> >> Regards
>>> >> Julian
>>> >>
>>> >>
>>> >> On Tue, May 16, 2017 at 6:33 PM, Marco Piovesana 
>>> >> wrote:
>>> >> > Hi Tomek,
>>> >> > yes I'm trying to upgrade within the same repository type but I can
>>> >> decide
>>> >> > weather to migrate the repository or not based on what makes the
>>> upgrade
>>> >> > easier.
>>> >> > The CommitHooks can only be used inside an upgrade to a new
>>> repository?
>>> >> > What is the suggested way to apply backward-incompatible changes if i
>>> >> don't
>>> >> > want to migrate the data from one repository to another but I want to
>>> >> apply
>>> >> > the modifications to the original one?
>>> >> >
>>> >> > Marco.
>>> >> >
>>> >> > On Tue, May 16, 2017 at 4:04 PM, Tomek Rekawek
>>> >> >> >
>>> >> > wrote:
>>> >> >
>>> >> >> Hi Marco,
>>> >> >>
>>> >> >> the main purpose of the oak-upgrade is to migrate a Jackrabbit 2 /
>>> CRX2
>>> >> >> repository into Oak or to migrate one Oak node store (eg. segment) to
>>> >> >> another (like Mongo). On the other hand, it’s not a good choice to
>>> use
>>> >> it
>>> >> >> for the application upgrades within the same repository type. You
>>> didn’t
>>> >> >> mention if your upgrade involves the repository migration (in this
>>> case
>>> >> >> choosing oak-upgrade would be justified) or not.
>>> >> >>
>>> >> >> If you still want to use oak-upgrade, it allows to use custom
>>> >> CommitHooks
>>> >> >> [1] during the migration. They should be included in the class path
>>> with
>>> >> >> the ServiceLoader mechanism [2].
>>> >> >>
>>> >> >> Regards,
>>> >> >> Tomek
>>> >> >>
>>> >> >> [1] http://jackrabbit.apache.org/oak/docs/architecture/
>>> >> >> nodestate.html#The_commit_hook_mechanism
>>> >> >> [2] https://docs.oracle.com/javase/tutorial/sound/SPI-intro.html
>>> >> >>
>>> >> >> --
>>> >> >> Tomek Rękawek | Adobe Research | www.adobe.com
>>> >> >> reka...@adobe.com
>>> >> >>
>>> >> >> > On 14 May 2017, at 12:20, Marco Piovesana 
>>> >> wrote:
>>> >> >> >
>>> >> >> > Hi all,
>>> >> >> > I'm trying to deal with backward-incompatible changes on my
>>> repository
>>> >> >> > structure. I was looking at the oak-upgrade module but, as far as I
>>> >> could
>>> >> >> > understand, I can't really make modifications that require some
>>> logic
>>> >> >> (e.g.
>>> >> >> > remove a property and add a new mandatory property with a value
>>> based
>>> >> on
>>> >> >> > the removed one).
>>> >> >> > I saw that one of the options might be the "namespace migration":
>>> >> >> > - remap the current namespace to a different prefix;
>>> >> >> > - create a new namespace with original prefix;
>>> >> >> > - port all nodes from old namespace to new namespace applying the
>>> >> >> required
>>> >> >> > modifications.
>>> >> >> >
>>> >> >> > I couldn't find much documentation on the topic, so my question
>>> is: is
>>> >> >> this
>>> >> >> > the right way to do it? There are other suggested approaches to the
>>> >> >> 

Re: upgrade repository structure with backward-incompatible changes

2017-05-17 Thread Michael Marth
Hi Marco,

Maybe I don’t understand correctly your use case, but would it be easier to 
simply write a script using the JCR API to do the changes in the repo?

Michael




On 16/05/17 18:33, "Marco Piovesana"  wrote:

>Hi Tomek,
>yes I'm trying to upgrade within the same repository type but I can decide
>weather to migrate the repository or not based on what makes the upgrade
>easier.
>The CommitHooks can only be used inside an upgrade to a new repository?
>What is the suggested way to apply backward-incompatible changes if i don't
>want to migrate the data from one repository to another but I want to apply
>the modifications to the original one?
>
>Marco.
>
>On Tue, May 16, 2017 at 4:04 PM, Tomek Rekawek 
>wrote:
>
>> Hi Marco,
>>
>> the main purpose of the oak-upgrade is to migrate a Jackrabbit 2 / CRX2
>> repository into Oak or to migrate one Oak node store (eg. segment) to
>> another (like Mongo). On the other hand, it’s not a good choice to use it
>> for the application upgrades within the same repository type. You didn’t
>> mention if your upgrade involves the repository migration (in this case
>> choosing oak-upgrade would be justified) or not.
>>
>> If you still want to use oak-upgrade, it allows to use custom CommitHooks
>> [1] during the migration. They should be included in the class path with
>> the ServiceLoader mechanism [2].
>>
>> Regards,
>> Tomek
>>
>> [1] http://jackrabbit.apache.org/oak/docs/architecture/
>> nodestate.html#The_commit_hook_mechanism
>> [2] https://docs.oracle.com/javase/tutorial/sound/SPI-intro.html
>>
>> --
>> Tomek Rękawek | Adobe Research | www.adobe.com
>> reka...@adobe.com
>>
>> > On 14 May 2017, at 12:20, Marco Piovesana  wrote:
>> >
>> > Hi all,
>> > I'm trying to deal with backward-incompatible changes on my repository
>> > structure. I was looking at the oak-upgrade module but, as far as I could
>> > understand, I can't really make modifications that require some logic
>> (e.g.
>> > remove a property and add a new mandatory property with a value based on
>> > the removed one).
>> > I saw that one of the options might be the "namespace migration":
>> > - remap the current namespace to a different prefix;
>> > - create a new namespace with original prefix;
>> > - port all nodes from old namespace to new namespace applying the
>> required
>> > modifications.
>> >
>> > I couldn't find much documentation on the topic, so my question is: is
>> this
>> > the right way to do it? There are other suggested approaches to the
>> > problem? There's already a tool that can be used to define how to map a
>> > source CND definition into a destination CND definition and then apply
>> the
>> > modifications to a repository?
>> >
>> > Marco.
>>


Re: Strong all documents under Root - severe slowness on start-up

2017-02-23 Thread Michael Marth
Hi Eugene,

There are 2 aspects here:
The repository definitely should not take such a long time for startup (or 
presumably read all/many children of root at startup).
But the reason why this was not seen before pertains to your second question:
It is definitely not advised to store that many document anywhere in the repo, 
let alone root. The preferred way to use the repo is by having some kind of 
hierarchy that puts the documents into different folders. This allows for 
access control on these folders. It also allows humans a more manageable way to 
browse the documents.
If there is on apparent hierarchy one can introduce an artificial one like 
day/month of creation or a hash of a property of the document, etc.

HTH
Michael




On 23/02/17 19:11, "Eugene Prystupa"  wrote:

>Can anyone confirm if this is a viable storage option? We have an
>Oak-backed repository with ~100,000 documents in it (and growing), and all
>documents are stored as children nodes of the root node.
>
>We are seeing severe delays on start-up (20 minutes+) when repository is
>created (new Jcr(oak).createRepository()). At appears that for whatever
>reason Oak is trying to read all direct root children on start-up and that
>takes a lot of time. Is storing documents not directly under root a better
>choice?
>
>-- 
>Thanks,
>Eugene


Re: issues introducing non-reversible changes in the repository

2017-01-11 Thread Michael Marth
Hi Tomek,

This was discussed a while (roughly a year) back on the list and rejected at 
that time due to the complexity that comes with supporting downgrades (sorry, 
don’t have the link to that thread handy).
Maybe a middle ground would be using the “non_reversible” tag so that users can 
make an informed decision about upgrading, but not supporting downgrades?

my2c

Cheers
Michael



On 11/01/17 11:26, "Tomek Rekawek"  wrote:

>Hi,
>
>Some of the Oak users are interested in rolling back the Oak upgrade within a 
>branch (like 1.4.10 -> 1.4.1). As far as I understand, it should work, unless 
>some of the commits in (1.4.10, 1.4.10] introduces a repository format change 
>that is not compatible with the previous version (eg. modifies the format of a 
>property in the DocumentMK).
>
>Right now there’s no way to check this other than reviewing all the issues in 
>the given version range related to the given components.
>
>Maybe it’d be useful to mark such issues with a label (like 
>“breaks_compatibility”, “non_reversible", “updates_schema”, etc.)?
>
>WDYT? Which label should we choose and how we can make sure that it’s really 
>used in appropriate cases?
>
>Regards,
>Tomek
>
>-- 
>Tomek Rękawek | Adobe Research | www.adobe.com
>reka...@adobe.com
>


Re: Mandatory property jcr:primaryType not found in a new node issue

2017-01-03 Thread Michael Marth
Hi Bommasani,

I would also like to understand the rationale of this move.
AFAICT the Oak API is meant to develop additional language bindings (like JCR) 
on top. For Java-based apps that language binding would clearly be the JCR API.
What problem are you trying to solve?

Cheers
Michael



On 22/12/16 09:49, "Julian Reschke"  wrote:

>On 2016-12-22 06:02, Bommasani Gangadhar wrote:
>> Hi Team
>>
>> We are planning to migrate from JCR API to OAK API. Earlier we have used
>> JCR file for file operations. Now started eliminating JCR API Code and
>> replacing with OAK API.
>> ...
>
>Why?
>
>Best regards, Julian


Re: segment-tar depending on oak-core

2016-10-27 Thread Michael Marth
fwiw: last year a concrete proposal was made that seemed to have consensus

“Move NodeStore implementations into their own modules"
http://markmail.org/message/6ylxk4twdi2lzfdz

Agree that nothing happened - but I believe that this move might again find 
consenus



On 27/10/16 10:49, "Francesco Mari"  wrote:

>We keep having this conversation regularly but nothing ever changes.
>As much as I would like to push the modularization effort forward, I
>recognize that the majority of the team is either not in favour or
>openly against it. I don't want to disrupt the way most of us are used
>to work. Michael Dürig already provided an extensive list of what we
>will be missing if we keep writing software the way we do, so I'm not
>going to repeat it. The most sensible thing to do is, in my humble
>opinion, accept the decision of the majority.
>
>2016-10-27 11:05 GMT+02:00 Davide Giannella :
>> On 27/10/2016 08:53, Michael Dürig wrote:
>>>
>>> +1.
>>>
>>> It would also help re. backporting, continuous integration, releasing,
>>> testing, longevity, code reuse, maintainability, reducing technical
>>> debt, deploying, stability, etc, etc...
>>
>> While I can agree on the above, and the fact that now we have
>> https://issues.apache.org/jira/browse/OAK-5007 in place, just for the
>> sake or argument I would say that if we want to have any part of Oak
>> with an independent release cycle we need to
>>
>> Have proper API packages that abstract things. Specially from oak-core
>>
>> As soon as we introduce a separate release cycle for a single module we
>> have to look at a wider picture. What other modules are affected?
>>
>> Taking the example of segment-tar we saw that we need
>>
>> - oak-core-api (name can be changed)
>> - independent releases of the oak tools: oak-run, oak-upgrade, ...
>> - independent release cycle for parent/pom.xml
>> - anything I'm missing?
>>
>> So if we want to go down that route than we have to do it properly and
>> for good. Not half-way.
>>
>> Davide
>>
>>


Re: How does having lot of ACL (for different principals) on a single node affect performance

2016-10-25 Thread Michael Marth
Here are some benchmarks, some of the on benchmarking access control 
performance:
https://github.com/apache/jackrabbit-oak/tree/trunk/oak-run/src/main/java/org/apache/jackrabbit/oak/benchmark

HTH
Michael




On 24/10/16 17:01, "Vikas Saurabh"  wrote:

>Hi,
>
>In a project I'm working, we have a some personas which represent the
>kind of operations member of those personas are allowed to do over a
>given node.
>
>The most trivial idea was to have a
>synthetic-group-per-persona-per-such-node and add/remove members to
>these groups. This approach has obvious side-effects:
>* systems gets flooded with system-generated-groups thus requiring
>special UI for user/group management
>* can potentially affect login performance - I haven't checked how
>OAK-3003 works.. maybe, it's a non-issue
>* eerie feeling to require additional groups :)
>
>The other end of the spectrum is to provide explicit ACLs on the node
>per principal. It's ok for us to go this way... but we ended up with
>an open question on the subject the mail. Do we know how ACL
>evaluation performance behave wrt number-of-ACLs on a node - assuming
>ACLs-per-principal won't be a big number?
>
>I was thinking of writing a benchmark to see but wanted to copy some
>closely related existing benchmark. It'd great if there are some
>pointers for this :).
>
>Thanks,
>Vikas


Re: segment-tar depending on oak-core

2016-10-24 Thread Michael Marth
Sorry, my email client made a mess.
Resending my text only




On 24/10/16 13:48, "Michael Marth" <mma...@adobe.com> wrote:

>Hi Tommaso,
>
>I agree with your assessment that this discussion is actually about the 
>delivery granularity and user’s consumption of Oak. Taking the freedom to 
>re-phrase what you said above:
>
>  *   either a complete library that is consumed as a whole (and where the 
> various internal modules are implementation details)
>  *   Or a set of modules where users are expected and allowed to access the 
> modules directly and deploy arbitrary subsets of modules
>
>At least so far, in my view we saw Oak as one library. If we were to change 
>that then we would need to be much more careful about the interactions between 
>the various (internal) modules. So far, these are an implementation detail 
>which we can change at will if needed  - which obviously allows for quite some 
>flexibility on internal changes.
>
>my2c
>Michael


Re: segment-tar depending on oak-core

2016-10-24 Thread Michael Marth
Hi Tommaso,

In my opinion what we're discussing is our view on how Oak should be
architectured, either as a big (layered) blackbox or as a set of reusable
(and interoperable) software components.
The "release all at once with version x.y" approach sounds to me more
inline with the former while the "release every module separately and
abstract APIs as much as possible" sounds more inline with the latter.

I agree with your assessment that this discussion is actually about the 
delivery granularity and user’s consumption of Oak. Taking the freedom to 
re-phrase what you said above:

  *   either a complete library that is consumed as a whole (and where the 
various internal modules are implementation details)
  *   Or a set of modules where users are expected and allowed to access the 
modules directly and deploy arbitrary subsets of modules

At least so far, in my view we saw Oak as one library. If we were to change 
that then we would need to be much more careful about the interactions between 
the various (internal) modules. So far, these are an implementation detail 
which we can change at will if needed  - which obviously allows for quite some 
flexibility on internal changes.

my2c
Michael


Re: [REVIEW] Configuration required for node bundling config for DocumentNodeStore - OAK-1312

2016-10-21 Thread Michael Marth
Hi Chetan,

Re “Should we ship with a default config”:

I vote for a small default config:
- default because: if the feature is always-on in trunk we will get better 
insights in day-to-day work (as opposed to switching it on only occasionally)
- small because: the optimal bundling is probably very specific to the 
application and its read-write patterns. Your suggestion to include nt:file 
(and maybe rep:AccessControllable) looks reasonable to me, though.

Cheers
Michael



On 21/10/16 08:31, "Chetan Mehrotra"  wrote:

>Hi Team,
>
>Work for OAK-1312 is now in trunk. To enable this feature user has to
>provision some config as content in repository. The config needs to be
>created under '/jcr:system/rep:documentStore/bundlor' [1]
>
>Example
>-
>jcr:system
>  rep:documentStore
>bundlor
>  app:Asset{pattern = [jcr:content/metadata, jcr:content/renditions,
>  jcr:content/renditions/**, jcr:content]}
>  nt:file{pattern = [jcr:content]}
>-
>
>Key points
>
>
>* This config is only required when system is using DocumentNodeStore
>* Any change here would be picked via Observation
>* Config is supposed to be changed only by system admin. So needs to
>be secured (OAK-4959)
>* Config can be changed anytime and would impact only newly created nodes.
>
>Open Questions
>
>
>Bootstrap default config
>---
>
>Should we ship with a default config for nt:file (may be other like
>rep:AccessControllable). If yes then how to do that. One way can be to
>introduce a new 'WhiteboardRepositoryInitializer' and then
>DocumentNodeStore can register one which bootstraps a default config
>
>Chetan Mehrotra
>[1] 
>https://issues.apache.org/jira/browse/OAK-1312?focusedCommentId=15387241=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15387241


On adding new APIs

2016-10-20 Thread Michael Marth
Hi all,

I had a discussion with Michael Dürig about adding new Oak-specific APIs for 
public consumption. There is a concern that these might be too “ad-hoc” and 
might come back to bite us later.
OTOH I would not like to add too much process or ceremony.
So I have a proposal: when a new public API is added the developer should drop 
an email with subject tag [REVIEW] onto the dev list, so that others are aware 
and can chime in if needed.

WDYT?
Michael


Re: Seekable access to a Binary

2016-09-07 Thread Michael Marth
Hi,

I believe Oak has no notion of requests - the 1-1 binding of a request to a 
session is done in Sling.
However, having said that: I was not aware of all the complexities you mention. 
To add one more: probably the design would have to encounter for different 
clustered Sling instances (that share 1 repository) that receive chunks 
belonging to the same binary. Is that right?

Afaik branches are not exposed into userland, but are an implementation detail. 
 When I made my comment below, I did not realize that in order for this to work 
branches would have exposed. I am not sure if that's a good idea. Also not sure 
if it would even solve the problem.
Maybe a better approach could be to persist the chunks in a temp space, similar 
to what Marcel suggested. But maybe that temp space could be a functionality of 
the datastore (I believe Marcel suggested to create a temp location by the user 
itself via the JCR API)

Michael

Sent from a mobile device

_
From: Ian Boston <i...@tfd.co.uk<mailto:i...@tfd.co.uk>>
Sent: Wednesday, September 7, 2016 9:36 AM
Subject: Re: Seekable access to a Binary
To: <oak-dev@jackrabbit.apache.org<mailto:oak-dev@jackrabbit.apache.org>>


Hi,

On 6 September 2016 at 18:00, Michael Marth 
<mma...@adobe.com<mailto:mma...@adobe.com>> wrote:

> Hi,
>
> I think it would be neat if we could utilize our existing mechanism rather
> than a new flag. In particular, MVCC and branches for session isolation.
> And also simply use session.save() to indicate that an upload is complete
> (and the branch containing the binaries/chunks can be merged).
>

Do branches and sessions hang around between requests ?

Each body part will come from different requests, sometimes separated by
hours and possibly even from different source IP addresses, especially
under upload restart conditions. At present, in streaming mode, as each
body part is encountered a session.save is performed to cause JCR/Oak to
read that input stream from the request, since JCR does not expose anything
that can be used to write binary data to the repository.

Best Regards
Ian



>
> Michael
>
> Sent from a mobile device
>
>
>
>
> On Tue, Sep 6, 2016 at 1:15 PM +0200, "Marcel Reutegger" <
> mreut...@adobe.com<mailto:mreut...@adobe.com><mailto:mreut...@adobe.com>> 
> wrote:
>
> Hi,
>
> On 06/09/16 12:34, Bertrand Delacretaz wrote:
> > On Tue, Sep 6, 2016 at 9:49 AM, Marcel Reutegger 
> > <mreut...@adobe.com<mailto:mreut...@adobe.com>>
> wrote:
> >> ...we'd still have to add
> >> Jackrabbit API to support it. E.g. something like:
> >>
> >> valueFactory.createBinary(existingBinary, appendThisInputStream); ...
> >
> > And maybe a way to mark the binary as "in progress" to avoid
> > applications using half-uploaded binaries?
>
> This can easily be prevented if the 'in progress' binary is
> uploaded to a temporary location first and then copied over
> to the correct location once complete. Keep in mind that
> copying a large existing binary in Oak is simply a cheap
> copy of the reference.
>
> Regards
> Marcel
>




Property index replacement / evolution

2016-08-05 Thread Michael Marth
Hi,

I have noticed OAK-4638 and OAK-4412 – which both deal with particular 
problematic aspects of property indexes. I realise that both issues deal with 
slightly different problems and hence come to different suggested solutions.
But still I felt it would be good to take a holistic view on the different 
problems with property indexes. Maybe there is a unified approach we can take.

To my knowledge there are 3 areas where property indexes are problematic or not 
ideal:

1. Number of nodes: Property indexes can create a large number of nodes. For 
properties that are very common the number of index nodes can be almost as 
large as the number of the content nodes. A large number of nodes is not 
necessarily a problem in itself, but if the underlying persistence is e.g. 
MongoDB then those index nodes (i.e. MongoDB documents) cause pressure on 
MongoDB’s mmap architecture which in turn affects reading content nodes.

2. Write performance: when the persistence (i.e. MongoDB) and Oak are “far away 
from each other” (i.e. high network latency or low throughput) then synchronous 
property indexes affect the write throughput as they may cause the payload to 
double in size.

3. I have no data on this one – but think it might be a topic: property index 
updates usually cause commits to have / as the commit root. This results on 
pressure on the root document.

Please correct me if I got anything wrong  or inaccurate in the above.

My point is, however, that at the very least we should have clarity which one 
go the items above we intend to tackle with Oak improvements. Ideally we would 
have a unified approach.
(I realize that property indexes come in various flavours like unique index or 
not, which makes the discussion more complex)

my2c
Michael


Session refresh behaviour (was: [suggestion] introduce oak compatibility levels)

2016-07-28 Thread Michael Marth
(Spawning this thread as it diverges away fro Stefan’s suggestion on signaling 
breaking changes)

Hi Thomas,

Looking at [1] I am surprised that the session would get refreshed in your 
example. 
Is that because in your example both sessions live in the same thread?

Thanks for clarifying!
Michael


[1] http://jackrabbit.apache.org/oak/docs/dos_and_donts.html



On 28/07/16 13:19, "Thomas Mueller"  wrote:

>Hi,
>
>>I agree if conflicts conceptually with MVCC. However: is there an actual
>>problem with the auto-refresh behaviour?
>
>Yes. For example with queries. If changes are made while iterating over
>the result of a query, the current behavior is problematic. Example code
>(simplified):
>
>RowIterator it = xxx.createQuery(...).execute().getRows();
>while (it.hasNext()) {
>otherSession.getNode(...).remove();
>otherSession.save();
>Row row = it.nextRow();
>Node node = row.getNode();
>-> node can be null here!
>}
>
>
>So basically the query result contains entries that get removed (by
>another session) while iterating over the result. So this can lead to
>NullPointerException and other strange behavior (you could get nodes that
>no _longer_ match the query constraints), depending on what you do
>exactly. Arguably it would be better if the session is isolated from
>changes done in another session in the same thread. By the way if using
>the same session to remove nodes and iterate over the result, the query
>result has to reflect the changes done by the session (I think this is
>required by the JCR spec).
>
>Regards,
>Thomas
>


Re: Requirements for multiple Oak clients on the same backend (was: [suggestion] introduce oak compatibility levels)

2016-07-28 Thread Michael Marth
Hi Bertrand,

I believe this is unchartered territory.
It is (usually?) safe to assume that the persistence state written by Oak 
version X can be read and modified by version Y if Y > X.
However: version Y might introduce new features or perform changes on the 
state’s format, etc. When such a change is introduced it is not considered that 
version X might still operate on the same state.
For many values of X and Y your setup would probably work in practice. But to 
my knowledge there is no formal way to find out which values of X and Y are 
safe - at least so far. 

Michael




On 28/07/16 10:45, "Bertrand Delacretaz"  wrote:

>Hi,
>
>On Thu, Jul 28, 2016 at 10:23 AM, Stefan Egli  wrote:
>>...we could introduce a concept of
>> 'compatibility levels' which are a set of features/behaviours that a
>> particular oak version has and that application code relies upon
>
>Good timing, I have a related question about multiple client apps
>connecting to the same Oak backend.
>
>Say I have to Java apps A and B which use the same Oak/Mongo/BlobStore
>configuration, are there defined requirements as to the Oak library
>versions or other settings that A and B use?
>
>Do they need to use the exact same versions of the Oak bundles, and
>are violations to that or to other compatibility requirements
>detected?
>
>-Bertrand


Re: [suggestion] introduce oak compatibility levels

2016-07-28 Thread Michael Marth
Hi Stefan,

On the general question of deprecating features and breaking changes: I think 
we should simply stick to SemVer of the released artefacts to signal those 
changes to upstream.

On the more specific topic of session behaviour: could we use session 
attributes to let the app specify session behaviour? [1]

And on the very specific topic of auto-refresh: yes, this was introduced to 
ease application transition onto Oak. I agree if conflicts conceptually with 
MVCC. However: is there an actual problem with the auto-refresh behaviour?

Cheers
Michael

[1] 
https://docs.adobe.com/docs/en/spec/javax.jcr/javadocs/jcr-2.0/javax/jcr/Session.html#getAttribute(java.lang.String)





On 28/07/16 10:23, "Stefan Egli"  wrote:

>Hi,
>
>Here's an idea that just came up in an offline discussion, not sure if it
>has been around elsewhere. But we could introduce a concept of
>'compatibility levels' which are a set of features/behaviours that a
>particular oak version has and that application code relies upon. When
>creating a session by default the 'newest compatibility level' would be
>used, but applications could opt to use an older, compatibility level 1.2
>for example, when they still rely on the feature-set/behaviour oak had at
>that time. (This could also be based on service-user properties for
>example.) As such, compatibility levels could be a vehicle to help properly
>deprecate features over time (when you'd say eg oak 1.10 doesn't support
>compatibility level oak 1.0 anymore).
>
>One concrete case where this could have been useful is the
>backwards-compatible behaviour where a session is auto-refreshed when
>changes are done in another session. This seems counter-intuitive given
>MVCC, but was apparently introduced to remain jackrabbit-2-compatible.
>
>(Another slightly different example could be the warn about a session being
>open for too long without a refresh. This is likely an oversight from an
>application, but it could also be done on purpose (although I wouldn't know
>an example right now) - in which case this could be indicated via another
>flag on the session - but probably doesn't quite fit the compatibility-level
>approach).
>
>Opinions?
>
>Cheers,
>Stefan
>
>


Re: Internals of Apache Jackrabbit OAK repository & OOTB SOLR indexing

2016-07-20 Thread Michael Marth
Some presentations that might also be of interest in this context:

http://de.slideshare.net/teofili/oak-solr-integration
http://de.slideshare.net/teofili/scaling-search-in-oak-with-solr
http://de.slideshare.net/teofili/flexible-search-oakmin

Cheers
Michael



On 20/07/16 09:55, "Michael Marth" <mma...@adobe.com> wrote:

>Hi,
>
>Are you aware of this docu: 
>http://jackrabbit.apache.org/oak/docs/query/solr.html
>?
>It points to some of the relevant classes.
>
>Cheers
>Michael
>
>
>
>On 20/07/16 09:03, "sri vaths" <rsriva...@yahoo.co.in.INVALID> wrote:
>
>>Hi All,
>>Please share details on how OAK repository triggers node updates to SOLR for 
>>indexing, guessing it based on the OAK NodeState model but not sure how the 
>>flow 
>>happens.http://jackrabbit.apache.org/oak/docs/architecture/nodestate.html 
>>And also want to know the list of node states the index updates can happen to 
>>SOLR 
>>with regardsSri


Re: Internals of Apache Jackrabbit OAK repository & OOTB SOLR indexing

2016-07-20 Thread Michael Marth
Hi,

Are you aware of this docu: 
http://jackrabbit.apache.org/oak/docs/query/solr.html
?
It points to some of the relevant classes.

Cheers
Michael



On 20/07/16 09:03, "sri vaths"  wrote:

>Hi All,
>Please share details on how OAK repository triggers node updates to SOLR for 
>indexing, guessing it based on the OAK NodeState model but not sure how the 
>flow happens.http://jackrabbit.apache.org/oak/docs/architecture/nodestate.html 
>And also want to know the list of node states the index updates can happen to 
>SOLR 
>with regardsSri


Re: multilingual content and indexing

2016-07-12 Thread Michael Marth
Hi Lukas,

I am not entirely sure what you want to achieve (or what exactly you mean with 
“dealing with multi language content”), but trying to answer a bit:

Let’s say you have distinct content trees for different languages, like e.g.
/content/en
/content/jp
Etc.

You can choose to index all these trees in one (Lucene) index for full text 
search and filter the results in your query, i.e. Put the burden on the query 
engine.
This is a simple setup which leads to a large index (although I personally have 
not seen this to be a problem)

Alternatively, you can create different index definitions for each subtree (see 
[1]), e.g. Using the “includedPaths” property. This would lead to smaller 
indexes at the downside that you would have to create an index definition if 
you add a new language tree.
This approach has the additional benefit that you can define language-specific 
Lucene analyzers for each sub tree, so that e.g. In the example above the 
Japanese index would have ist own analyzer.

HTH
Michael

[1] http://jackrabbit.apache.org/oak/docs/query/lucene.html



On 12/07/16 10:15, "Lukas Kahwe Smith"  wrote:

>Aloha,
>
>I did a bit of search but didn’t find anything specific on any plans to 
>dealing with multi language content in any specific way inside Oak. 
>Specifically I am wondering as indexing all content from different languages 
>together can lead to suboptimal sorting and needless overhead. So are there 
>any plans to deal with this specifically?
>
>If not inside Oak, are there any projects on top of Oak (or inside AEM) that 
>deal with this?
>
>Or is this basically considered to be a case where one needs to plugin a 
>custom indexer and figure it out on your own?
>
>regards,
>Lukas Kahwe Smith
>sm...@pooteeweet.org
>
>
>


Re: JCR Binary Usecase - UC7 - Random write access in binaries

2016-06-03 Thread Michael Marth
Hi Jukka,

Thanks - this is very helpful context. If I read the issue correctly then a) 
the use case was already requested back then and b) random writes did not make 
it into the spec at least partly because some backing implementations would 
have struggled to implement it.

I think it makes a lot of sense to re-visit this feature as part of the broader 
list of use cases for efficiently handling binaries that Chetan was kind enough 
to compile. Especially, if this extension is implemented specifically for Oak 
then the considerations on different backing impls would be much simplified.

Cheers
Michael 




On 02/06/16 17:02, "Jukka Zitting" <jukka.zitt...@gmail.com> wrote:

>Hi,
>
>See https://java.net/jira/browse/JSR_283-19 for more background on the
>decision to not have this feature in JCR 2.0.
>
>That said, I remember this feature request coming up every now and then,
>and we did for example design handling binary values in the segment store
>in a way that would allow random write access to be implemented efficiently
>if there's enough demand.
>
>Best,
>
>Jukka Zitting
>
>
>On Thu, Jun 2, 2016 at 10:25 AM Michael Marth <mma...@adobe.com> wrote:
>
>>
>> >
>> >...but the limitation is also present in the JCR API, right?
>>
>> yes, that is my understanding
>>


Re: JCR Binary Usecase - UC7 - Random write access in binaries

2016-06-02 Thread Michael Marth

>
>...but the limitation is also present in the JCR API, right?

yes, that is my understanding


Re: JCR Binary Usecase - UC7 - Random write access in binaries

2016-06-02 Thread Michael Marth
Hi Julian,

While WebDAV would be really preferable, I take your word that WebDAV as a 
protocol is not possible (I believe you know a thing or two about WebDAV).

However, the transport protocol is only one aspect. For the sake of the 
argument assume a custom protocol for transport.
What is still missing from a repository perspective is the capability for 
random writes into binaries.
I think it is the latter that we should look at in the context of this use case.

Cheers
Michael



On 01/06/16 14:23, "Julian Reschke"  wrote:

>
>
>> UC7 - Random write access in binaries
>>
>> Think: a video file exposed onto the desktop via WebDAV. Desktop tools would 
>> do random writes in that file. How can we cover this use case without 
>> up/downloading the large file. (essentially: random write access in binaries)
>
>I don't think we can construct a use case here, as there is no standard 
>HTTP (or WebDAV) way to *write* byte ranges.
>
>The WebDAV drivers I'm aware of handle this by caching the complete file 
>locally, applying the change, and re-uploading the whole file.
>
>Best regards, Julian


Re: Another question about oak: different storage support

2016-05-01 Thread Michael Marth
Hi Francesco,

To my knowledge one cannot add (on Oak level) a number of additional discs.
As a workaround one could mount (on OS-level) different discs into the file 
store’s directory. This would somewhat help with increasing disc size. It would 
not help with storing different types of binaries in different locations as the 
binaries are stored content-addressed (so the context of file type etc is lost 
at the datastore level).

In order to have multiple (chained) data stores: I am sure there would be great 
interest in the Oak community for such a feature. IIRC there is such a chained 
implementation in Jackrabbit 2, but I cannot find it right now.
The interesting aspect of a multiple datastores: reading is simple (just go 
along the chain). But rules where to write are more involved because of the 
content-addressed nature of the DS.

There is an improvement issue regarding related work in
https://issues.apache.org/jira/browse/OAK-3140

Cheers
Michael

From: Ancona Francesco 
>
Reply-To: "oak-dev@jackrabbit.apache.org" 
>
Date: Friday 29 April 2016 18:43
To: "oak-dev@jackrabbit.apache.org" 
>
Cc: Morelli Alessandra 
>, Carboniero 
Enrico >, 
Diquigiovanni Simone 
>
Subject: Another question about oak: different storage support

Hi,
as i explained in other mail, we are building an ECM on top of OAK.

A frequent business question about storage management is the following: in oak 
configuration in which we have mongo as document store and filesystem, we set a 
specific path where we mount the storage; if we have space problem, can we 
mount another storage? In other world, oak can manage multiple BlobStore and so 
use different filesystem path to store binary data ?

It could be useful if i want manage different type of storage linked to 
different type of documents: for example i could store a document type in a NAS 
while another in a SAN if i want better performance.

Thanks in advance,
best regards

[cid:image002.png@01D1A247.0EB3D7E0]
Francesco Ancona | Software Dev. Dept. (SP) - Software Architect
tel. +39 049 8979797 | fax +39 049 8978800 | cel. +39 3299060325
e-mail: francesco.anc...@siav.it | www.siav.it

I contenuti di questa e-mail e dei suoi allegati sono confidenziali e riservati 
esclusivamente ai destinatari.
L'utilizzo per qualunque fine del presente messaggio e degli allegati così come 
la relativa divulgazione senza l'autorizzazione del mittente sono vietati.
Se avete ricevuto questa e-mail per errore, vi preghiamo di distruggerla e di 
comunicarcelo.
I dati personali sono trattati esclusivamente per le finalità della presente 
comunicazione in conformità con la legislazione vigente (D.lgs. 196/2003 
"Codice Privacy").
Per informazioni: SIAV S.p.A. – s...@siav.it – 049 8979797

The contents of this e-mail and its attachments are confidential and reserved 
exclusively to the recipients.
The use for any purpose of this message and attachments as well as its 
disclosure without the consent of the sender is prohibited.
If you have received this email in error, please destroy it and notify us.
Personal data shall be processed solely for the purposes of this notice in 
accordance with current legislation (Legislative Decree no. 196/2003 "Code").
For more information: SIAV S.p.A. – s...@siav.it – 049 
8979797



Re: Jackrabbit 2.10 vs Oak 1.2.7

2016-04-10 Thread Michael Marth
Hi Domenic,

My point was that *very* roughly speaking Oak is expected to outperform JR for 
mixed read-write test cases, especially (but not only) in clustered deployments.

My 2nd point was: if you need to optimise pure write throughput then TarMK in 
Oak is expected to get best results.

Not knowing your application, I cannot judge if your test cases make sense.
Just wanted to comment on what can be expected.

Re
“FYI 1000 and 10 node creation these are realistic use cases as our
application generates very large datasets (it is common to see 500gb/1000
files or more get added to a repo in one user session)."

Interesting. In my experience when you deploy DocumentMK (Mongo or RDBMK) and 
need to optimise for file upload throughput then it is beneficial to use the 
file system data store (FSDS), not the data stores within Mongo/RDB.
Btw: I had a quick look at your test case [1]. It uploads the same file again 
and again. Binaries are internally stored content-addressed, so the test case 
does not quite reflect what would go on IRL in your app. But also in JR data 
store was content-addressed, so I do not expect a big impact in terms of 
comparing JR and Oak.

Michael


[1] 
https://github.com/Domenic-Ansys/Jackrabbit2-Oak-Tests/blob/master/Oak-boot/src/main/java/com/test/oak/JCRTests.java




On 08/04/16 07:50, "Domenic DiTano" <domenic.dit...@ansys.com> wrote:

>Hi Michael,
>
>First thank you for your response.
>
>My POV:
>"You are essentially testing how fast Oak or JR can put nodes into
>MySQL/Postgres/Mongo. IMO Oak’s design does not suggest that there should
>be fundamental differences between JR and Oak for this isolated case. (*)"
>
>Are you saying there should not be a difference for this test case between
>oak/jcr?  I understand your point that I am testing how fast Oak/JR put's
>things into a database, but from my perspective I am doing simple JCR
>operations like creating/updating/moving a reasonable number of nodes and
>JR seems to be performing significantly better.  I also ran the tests at
>100 nodes and in general Jackrabbit 2's performance in particular around
>copy, updates, and moves are generally better (I understand why for
>moves) .  Is this expected?
>
>FYI 1000 and 10 node creation these are realistic use cases as our
>application generates very large datasets (it is common to see 500gb/1000
>files or more get added to a repo in one user session).
>
>"To explain:
>Re 1: in reality you would usually have many reading threads for each
>writing thread. Oak’s MVCC design caters for performance for such test
>cases.
>Can you point me to any test cases where I can see the configuration for
>something like this?
>
>Re 2: If you have many cluster nodes the MVCC becomes even more pronounced
>(not only different threads but different processes).
>Also, if you have observation listeners and many cluster nodes then I
>expect to see substantial differences between Oak and JR.
>
>Are there any performance metrics out there for Oak that use
>DocumentNodestore/Filedatastore that someone could share?  If I am
>understanding correctly, I need to add nodes/horizontally scale for Oak's
>performance to improve.  My overall goal here is to determine whether it
>benefits us to upgrade from JR, but is it fair to compare the two?  FYI our
>application can be deployed as one or mult nodes on premise or in a cloud.
>
>thanks,
>Domenic
>
>On Thu, Apr 7, 2016 at 11:04 AM, Michael Marth <mma...@adobe.com> wrote:
>
>> Hi Domenic,
>>
>> My POV:
>> You are essentially testing how fast Oak or JR can put nodes into
>> MySQL/Postgres/Mongo. IMO Oak’s design does not suggest that there should
>> be fundamental differences between JR and Oak for this isolated case. (*)
>>
>> However, where Oak is expected to outperform JR is when
>> 1) the test case reflects realistic usage patterns and
>> 2) horizontal scalability becomes a topic.
>>
>> To explain:
>> Re 1: in reality you would usually have many reading threads for each
>> writing thread. Oak’s MVCC design caters for performance for such test
>> cases.
>> Re 2: If you have many cluster nodes the MVCC becomes even more pronounced
>> (not only different threads but different processes). Also, if you have
>> observation listeners and many cluster nodes then I expect to see
>> substantial differences between Oak and JR.
>>
>> Cheers
>> Michael
>>
>> (*) with the notable exception of TarMK which I expect to outperform
>> anything on any test case ;)
>>
>>
>>
>> On 06/04/16 16:20, "Domenic DiTano" <domenic.dit...@ansys.com> wrote:
>>
>> >Hi Marcel,
>> >
>> >I upload

Re: Jackrabbit 2.10 vs Oak 1.2.7

2016-04-07 Thread Michael Marth
Hi Domenic,

My POV:
You are essentially testing how fast Oak or JR can put nodes into 
MySQL/Postgres/Mongo. IMO Oak’s design does not suggest that there should be 
fundamental differences between JR and Oak for this isolated case. (*)

However, where Oak is expected to outperform JR is when
1) the test case reflects realistic usage patterns and
2) horizontal scalability becomes a topic.

To explain:
Re 1: in reality you would usually have many reading threads for each writing 
thread. Oak’s MVCC design caters for performance for such test cases.
Re 2: If you have many cluster nodes the MVCC becomes even more pronounced (not 
only different threads but different processes). Also, if you have observation 
listeners and many cluster nodes then I expect to see substantial differences 
between Oak and JR.

Cheers
Michael

(*) with the notable exception of TarMK which I expect to outperform anything 
on any test case ;)



On 06/04/16 16:20, "Domenic DiTano"  wrote:

>Hi Marcel,
>
>I uploaded all the source to github along with a summary spreadsheet.  I
>would appreciate any time you have to review.
>
>https://github.com/Domenic-Ansys/Jackrabbit2-Oak-Tests
>
>As you stated the move is a non goal, but in comparison to Jackrabbit 2 I
>am also finding in my tests that create, update, and copy are all faster
>in Jackrabbit 2 (10k nodes).  Any input would be appreciated...
>
>Also, will MySql will not be listed as "Experimental" at some point?
>
>Thanks,
>Domenic
>
>
>-Original Message-
>From: Marcel Reutegger [mailto:mreut...@adobe.com]
>Sent: Thursday, March 31, 2016 6:14 AM
>To: oak-dev@jackrabbit.apache.org
>Subject: Re: Jackrabbit 2.10 vs Oak 1.2.7
>
>Hi Domenic,
>
>On 30/03/16 14:34, "Domenic DiTano" wrote:
>>"In contrast to Jackrabbit 2, a move of a large subtree is an expensive
>>operation in Oak"
>>So should I avoid doing a move of a large number of items using Oak?
>>If we are using Oak then should we avoid operations with a large number
>>of items in general?
>
>In general it is fine to have a large change set with Oak. With Oak you
>can even have change sets that do not fit into the heap.
>
>>  As a FYI - there are other benefits for us to move to Oak, but our
>>application uses executes JCR operations with a large number of items
>>quite often.  I am worried about the performance.
>>
>>The move method is pretty simple - should I be doing it differently?
>>
>>public static long moveNodes(Session session, Node node, String
>>newNodeName)
>>throws Exception{
>>  long start = System.currentTimeMillis();
>>  session.move(node.getPath(), "/"+newNodeName);
>> session.save();
>>  long end = System.currentTimeMillis();
>>  return end-start;
>>}
>
>No, this is fine. As mentioned earlier, with Oak a move operation is not
>cheap and is basically implemented as copy to new location and delete at
>the old location.
>
>A cheap move operation was considered a non-goal when Oak was designed:
>https://wiki.apache.org/jackrabbit/Goals%20and%20non%20goals%20for%20Jackr
>a
>bbit%203
>
>
>Regards
> Marcel


Re: R: R: R: Critical questions about OAK

2016-03-12 Thread Michael Marth
Hi Francesco,

Query Engine
1. I didn't understand how Traverse recovers phisically the graph to traverse. 
Is provided in memory ? does it make a search on filesystem or db to obtain a 
correct portion of graph and then traverse it ?
2. Can you point out the Traverse classes ? Or unit test?

The Traversing Index is a fall back that Oak’s built-in query engine uses if no 
“real” index is able to answer a specific query. (this implies that all your 
queries should be backed by indexes). If the traversal index is used then the 
query engine will traverse the relevant parts of the tree (relevant == the tree 
specified in your query). Whether this traversal happens in memory, on disc or 
else is a concern of the lower level persistence layer and thus transparent to 
the query engine.
You can find related code here: 
https://github.com/apache/jackrabbit-oak/search?utf8=%E2%9C%93=traversingindex=Code
(but please note that: if you see traversals in the log this means that you 
should an index)

Instead for RDBMS question i noticed that with our simple class, the first time 
i add the node. The second time we obtain an error loading RepositoryImpl.
In detail when MutableTree try to make beforewrite, throw an illegalstate 
exception ("this tree does not exist")

It is hard to give a proper answer, but you mention “ MutableTree” which leads 
me to suspect that you have initialized/used Oak-internal classes. On 
application layer you should only use the JCR API to interact with the 
repository.

HTH
Michael


On 07/03/16 15:15, "Ancona Francesco" 
> wrote:

Hi,
sorry if i continue to ask you about these critical questions but we'd like to 
build on OAK a platform that manage over 200M of documents so we'd like to know 
in deep how OAK works.

Query Engine
1. I didn't understand how Traverse recovers phisically the graph to traverse. 
Is provided in memory ? does it make a search on filesystem or db to obtain a 
correct portion of graph and then traverse it ?
2. Can you point out the Traverse classes ? Or unit test?

Instead for RDBMS question i noticed that with our simple class, the first time 
i add the node. The second time we obtain an error loading RepositoryImpl.
In detail when MutableTree try to make beforewrite, throw an illegalstate 
exception ("this tree does not exist")

Thanks in advance,
best regards

-Messaggio originale-
Da: Julian Reschke [mailto:julian.resc...@gmx.de]
Inviato: venerdì 4 marzo 2016 08:09
A: oak-dev@jackrabbit.apache.org
Oggetto: Re: R: R: Critical questions about OAK

On 2016-03-03 15:48, Ancona Francesco wrote:
Yes but i'm asking if there is a way or a configuration to call rdbms using 
jcrrepository like oak examples in getting start.

final DocumentMK.Builder builder = new DocumentMK.Builder();
builder.setBlobStore(createFileSystemBlobStore());
final DocumentNodeStore ns = getRDBDocumentNodeStore(builder);
Oak oak = new Oak(ns);
Jcr jcr = new Jcr(oak);
Repository repo = jcr.createRepository();

Thanks.

It looks like some RepositoryInitializer is missing (AFAIU, it would take care 
of creating the initial content).

Best regards, Julian


This footnote confirms that this email message has been scanned by PineApp 
Mail-SeCure for the presence of malicious code, vandals & computer viruses.







Re: oak-resilience

2016-03-08 Thread Michael Marth
Love it!

Ideas in addition what was already mentioned:
* network deterioration in Cold Standby setups
* OOM in TarMK setups (either on-heap or off-heap)
* out of disc space in TarMK setups
* out of disc space for persistent cache



On 07/03/16 09:30, "Chetan Mehrotra"  wrote:

>Cool stuff Tomek! This was something which was discussed in last
>Oakathon so great to have a way to do resilience testing
>programatically. Would give it a try
>Chetan Mehrotra
>
>
>On Mon, Mar 7, 2016 at 1:49 PM, Stefan Egli  wrote:
>> Hi Tomek,
>>
>> Would also be interesting to see the effect on the leases and thus
>> discovery-lite under high memory load and network problems.
>>
>> Cheers,
>> Stefan
>>
>> On 04/03/16 11:13, "Tomek Rekawek"  wrote:
>>
>>>Hello,
>>>
>>>For some time I've worked on a little project called oak-resilience. It
>>>aims to be a resilience testing framework for the Oak. It uses
>>>virtualisation to run Java code in a controlled environment, that can be
>>>spoilt in different ways, by:
>>>
>>>* resetting the machine,
>>>* filling the JVM memory,
>>>* filling the disk,
>>>* breaking or deteriorating the network.
>>>
>>>I described currently supported features in the README file [1].
>>>
>>>Now, once I have a hammer I'm looking for a nail. Could you share your
>>>thoughts on areas/features in Oak which may benefit from being
>>>systematically tested for the resilience in the way described above?
>>>
>>>Best regards,
>>>Tomek
>>>
>>>[1]
>>>https://github.com/trekawek/jackrabbit-oak/tree/resilience/oak-resilience
>>>
>>>--
>>>Tomek Rękawek | Adobe Research | www.adobe.com
>>>reka...@adobe.com
>>>
>>
>>


Re: R: info on queries and index

2016-02-27 Thread Michael Marth
Hi,

the simplest approach is to just use the built-in Lucene. That pretty much 
rules out the problems you mention (external server overloaded or not 
reachable).
Loosing an index is a problem in any architecture. Re-indexing would happen 
faster with the built-in Lucene as no content has to transported over a network.
Solr is a useful option if you intend to leverage Solr-specific features that 
do not exist in Lucene.

HTH
Michael




On 26/02/16 02:17, "Ancona Francesco"  wrote:

>Hello,
>so if Lucene or Solr have a problem or are busy for some reasons, we can't 
>search anything, if i understand.
>
>So, i imagine, we have to be very careful to the search engine that is a 
>potential single point of failure if it goes down or if loose index and so it 
>has to make a full reindex.
>
>What kind of topology (application and search engine) do you suggest to 
>mitigate this problem ?
>
>Thanks in advance,
>best regards
>
>-Messaggio originale-
>Da: Davide Giannella [mailto:dav...@apache.org] 
>Inviato: venerdì 26 febbraio 2016 10:17
>A: oak-dev@jackrabbit.apache.org
>Oggetto: Re: info on queries and index
>
>On 25/02/2016 16:40, Ancona Francesco wrote:
>>
>> Hello,
>>
>> we'd like to study in deep queries and index. In particular is not so 
>> clear, from documentation, what is indexed by default.
>>
>> For instance if i create a new type of document (IdentityCard with 
>> name IDC) with 2 new properties (idCard and idGeneralAnagrafic) are 
>> these data (ie metadata) indexed ?
>>
>Short answer: oak does not index anything by default.
>
>Long one. It depends by how you construct the repository. For example if you 
>build a JCR repository by providing the InitialContent RepositoryInitializer 
>(0), you'll see that it creates some index definitions (1): uuid, nodetype and 
>counter.
>
>(0) https://goo.gl/MNpam7
>(1) https://goo.gl/G6RChL
>>
>>  
>>
>> And in that case the search is delegated to db (mongo or postgres that 
>> store metadata) or is delegated to solr or lucene ?
>>
>As it is now, Oak does not delegate to the persistence any of the searches. We 
>don't have plans to do so as far as I know. In oak we have mainly 2 types of 
>indexs: PropertyIndex and  LuceneIndex. You can find more details starting 
>from (3)
>
>(3) http://goo.gl/vfMJm3
>
>>  Finally, if we store a few million documents, what kind of strategy 
>> would you suggest for the search?
>>
>>  
>>
>The main strategy around searches is that the query is faster then the index 
>is small. So fine tuning the indexes is the main strategy around fast queries. 
>Depending on the index you use it will make sense one strategy versus the 
>other. As a rule of thumb I'd say that doesn't matter what index you use for 
>as long as you keep the content with a decent structure. For example 
>LucenePropertyIndex can evaluate multiple conditions and path restrictions as 
>well.
>
>When defining an index you can specify for what path they should index making 
>therefore the index as accurate as possible. It's a tradeoff you'll have to 
>find yourself as with all the performance tuning. Again I'd start with (3).
>
>HTH
>Davide
>
> 
> 
>
>This footnote confirms that this email message has been scanned by PineApp 
>Mail-SeCure for the presence of malicious code, vandals & computer viruses.
>
>
>
>


Re: R: info about jackrabbitoak.

2016-02-23 Thread Michael Marth
Hi,

I am not a committer, so this is not an authoritative answer:


1.   Could you give jackrabbit and jackrabbitoak roadmap (when the fusion 
is expected)?


Jackrabbit is the reference implementation for JSR-283. As such it needs to 
cover the breadth of the full spec. Oak on the other hand implements a part of 
the spec only (what the committers consider the most useful features) and 
focusses on scalability. As such, I do not think those 2 efforts should merge, 
given the different goals.


2.   Have you got clients that use jackrabbit oak ?


Oak is being used in Adobe’s Experience Manager product which has a lot of 
deployments


3.   Is there any company that can give support on jackrabbit oak ?


Not to my knowledge.


4.   Could you give us stress test result (if you have)


Scalability and performance tests are very much depending on the exact workload 
(read, write, payload type, query, etc). To get meaningful results for your 
case it makes sense to tweak the existing test suite to your needs. [1]



Re a non-Osgi setup please see [2]


HTH

Michael



[1] 
https://github.com/apache/jackrabbit-oak/tree/trunk/oak-run/src/main/java/org/apache/jackrabbit/oak/benchmark

[2] https://github.com/apache/jackrabbit-oak/tree/trunk/oak-examples/standalone


From: Ancona Francesco 
>
Reply-To: "oak-dev@jackrabbit.apache.org" 
>
Date: Tuesday 23 February 2016 03:09
To: "oak-dev@jackrabbit.apache.org" 
>
Subject: R: info about jackrabbitoak.

Hi,
in the previous mail i forgot the following questions:

· Our runtime target is j2ee container but i read that the correct 
runtime for Oak is OSGI. Is it true?

If OSGI runtime is the best solution, what do you suggest to execute Oak in our 
enviroment:

-  Wildfly with osgi subsystem to run also Oak services ?

-  Embed felix ?

Thanks in advance
Best regards

Da: Ancona Francesco
Inviato: lunedì 22 febbraio 2016 16:29
A: 'oak-dev@jackrabbit.apache.org' 
>
Oggetto: info about jackrabbitoak.

Hello.
I’m Francesco Ancona; i’m Senior Software Architect in Siav S.p.A.
we are a company in ECM arena and we are upgrading and extending our business 
and technical offer to our clients either in SaaS or on premise.
Now we are in .net enviroment but we wish migrate to java and we are looking 
for a ecm platform open source on which building our services.

We have, for instance, client such as Bank of Italy or Menarini; we have 
installation that manage also 250M of documents.

So we are looking for an ECM  framework to start implementation. Actual 
software selection involves: modshape, jackrabbit and jackrabbitoak.
In particular we’d like to know:

1.   Could you give jackrabbit and jackrabbitoak roadmap (when the fusion 
is expected)?

2.   Have you got clients that use jackrabbitoak ?

3.   Is there any company that can give support on jackrabbitoak ?

4.   Could you give us stress test result (if you have)

Thanks in advance
Best regards


[cid:image001.png@01D16E31.1ADC40B0]
Francesco Ancona | Software Dev. Dept. (SP) - Software Architect
tel. +39 049 8979797 | fax +39 049 8978800 | cel. +39 3299060325
e-mail: francesco.anc...@siav.it | www.siav.it

I contenuti di questa e-mail e dei suoi allegati sono confidenziali e riservati 
esclusivamente ai destinatari.
L'utilizzo per qualunque fine del presente messaggio e degli allegati così come 
la relativa divulgazione senza l'autorizzazione del mittente sono vietati.
Se avete ricevuto questa e-mail per errore, vi preghiamo di distruggerla e di 
comunicarcelo.
I dati personali sono trattati esclusivamente per le finalità della presente 
comunicazione in conformità con la legislazione vigente (D.lgs. 196/2003 
"Codice Privacy").
Per informazioni: SIAV S.p.A. – s...@siav.it – 049 8979797

The contents of this e-mail and its attachments are confidential and reserved 
exclusively to the recipients.
The use for any purpose of this message and attachments as well as its 
disclosure without the consent of the sender is prohibited.
If you have received this email in error, please destroy it and notify us.
Personal data shall be processed solely for the purposes of this notice in 
accordance with current legislation (Legislative Decree no. 196/2003 "Code").
For more information: SIAV S.p.A. – s...@siav.it – 049 
8979797



Re: bulk updates heuristics

2015-12-16 Thread Michael Marth
Hi Tomek,

Trying to wrap my head around this… So this is just a thought dump :)

First off, my example of the root document was probably a bad one, as direct 
root modifications will be rare. The root node will mostly be modified by the 
background thread. A better example might be a property index’s root. Is that 
correct?
(not that it matters a lot - just for understanding the problem better).

I wondered if we could find optimal parameters through tests, i.e. Find the 
value at which applying the fallback right away is overall cheaper than 
re-trying bulk updates 3 times. The problem of course is that I imagine this to 
depend heavily on the write pattern.
Related to this: do you have numbers on the performance difference between a) 
going to fallback directly and b) trying 3 (failing) bulk updates first? My 
point being: I wonder how much value is in tweaking the exact parameters.

Cheers
Michael



On 15/12/15 14:04, "Tomek Rekawek" <reka...@adobe.com> wrote:

>Hi Michael,
>
>The algorithm forgets history after 1h, so yes, it’ll include the root 
>document again when it has no longer 20 fresh records about failures/successes.
>
>Let’s assume that there’re 5 bulk operations every minute and root conflicts 
>in 4 of them:
>
>12:00 - root failed 5 times (success: 1, failures: 4)
>12:01 - root failed 5 times (s: 2, f: 8)
>12:02 - root failed 5 times (s: 3, f: 12)
>12:03 - root failed 5 times (s: 4, f: 16)
>
>At this point root won’t be included in the bulk update (as we have 20 samples 
>with 75% failure rate). At 13:00 we’ll forget about 5 failures from the 12:00. 
>The history will be to small (15 entries) to make a decision, so the root will 
>be included again in the bulk update.
>
>
>I thought that there may be cases in which “being a hotspot” is a temporary 
>condition, that’s why I didn’t want to block documents forever. We can improve 
>this by increasing history TTL depending on the failure rate. For instance, a 
>document failing in 100% may be blocked for 3 hours, not just one.
>
>Also, it’s worth mentioning that a conflicting document doesn’t cause the 
>whole bulk update to fail. The batch result contains a list of successful and 
>failed modifications and we’re trying to re-apply only the latter. There are 3 
>iterations of the bulk updates and after that there’s a sequential fallback 
>for the remaining ones. The above algorithm redirects hotspots directly to the 
>fallback.
>
>Best regards,
>Tomek
>
>On 15/12/15 12:47, "Michael Marth" <mma...@adobe.com> wrote:
>
>>Hi Tomek,
>>
>>I like the statistical approach to finding the hotspot documents.
>>However, I have a question about the criterion “conflicted in more than 50% 
>>cases”:
>>
>>Let’s say root conflicts often (more than 50%). In the proposed algorithm you 
>>would then remove it from bulk updates. So for the next 1h there would not be 
>>conflicts on root in bulk updates. But, after that: would the algorithm 
>>basically start with fresh data, find that there are no conflicts in root and 
>>therefore re-add it to bulk updates? Meaning that conflicting documents would 
>>move in and out of bulk updates periodically?
>>Or do you envision that removal from bulk updates would be forever, once a 
>>document is removed?
>>
>>Michael
>>
>>
>>
>>
>>On 15/12/15 11:35, "Tomek Rekawek" <reka...@adobe.com> wrote:
>>
>>>Hello,
>>>
>>>The OAK-2066 contains a number of patches, which finally will lead to use 
>>>batch insert/update operations available in RDB and Mongo. It’ll increase 
>>>the performance of applying a commit, especially when we have many small 
>>>updates of different documents.
>>>
>>>There are some documents that shouldn’t be included in the batch update, 
>>>because they are changing too often (like root). Otherwise, they’ll cause a 
>>>conflict and we need to send another bulk update, containing only failing 
>>>documents, etc. (detailed description can be found in OAK-3748). It would be 
>>>good to find such documents, extract them from the bulk operation and update 
>>>them sequentially, one after another.
>>>
>>>I prepared OAK-3748, which uses following way to find the hotspots: if the 
>>>document was included in at least 20 bulk operations during the last 1h and 
>>>it conflicted in more than 50% cases, it should be extracted from the future 
>>>bulk updates. The first two constraints makes it self refreshing - after a 
>>>while the number of bulk operations in which the “blocked" document was 
>>>included during the last hour will be less than 20 (all constants are 
>>>configurable).
>>>
>>>I’d appreciate a feedback, both on the “algorithm” and on the implementation 
>>>in OAK-3748.
>>>
>>>Best regards,
>>>Tomek
>>>
>>>-- 
>>>Tomek Rękawek | Adobe Research | www.adobe.com
>>>reka...@adobe.com
>>>
>>>
>>>


Re: bulk updates heuristics

2015-12-15 Thread Michael Marth
Hi Tomek,

I like the statistical approach to finding the hotspot documents.
However, I have a question about the criterion “conflicted in more than 50% 
cases”:

Let’s say root conflicts often (more than 50%). In the proposed algorithm you 
would then remove it from bulk updates. So for the next 1h there would not be 
conflicts on root in bulk updates. But, after that: would the algorithm 
basically start with fresh data, find that there are no conflicts in root and 
therefore re-add it to bulk updates? Meaning that conflicting documents would 
move in and out of bulk updates periodically?
Or do you envision that removal from bulk updates would be forever, once a 
document is removed?

Michael




On 15/12/15 11:35, "Tomek Rekawek"  wrote:

>Hello,
>
>The OAK-2066 contains a number of patches, which finally will lead to use 
>batch insert/update operations available in RDB and Mongo. It’ll increase the 
>performance of applying a commit, especially when we have many small updates 
>of different documents.
>
>There are some documents that shouldn’t be included in the batch update, 
>because they are changing too often (like root). Otherwise, they’ll cause a 
>conflict and we need to send another bulk update, containing only failing 
>documents, etc. (detailed description can be found in OAK-3748). It would be 
>good to find such documents, extract them from the bulk operation and update 
>them sequentially, one after another.
>
>I prepared OAK-3748, which uses following way to find the hotspots: if the 
>document was included in at least 20 bulk operations during the last 1h and it 
>conflicted in more than 50% cases, it should be extracted from the future bulk 
>updates. The first two constraints makes it self refreshing - after a while 
>the number of bulk operations in which the “blocked" document was included 
>during the last hour will be less than 20 (all constants are configurable).
>
>I’d appreciate a feedback, both on the “algorithm” and on the implementation 
>in OAK-3748.
>
>Best regards,
>Tomek
>
>-- 
>Tomek Rękawek | Adobe Research | www.adobe.com
>reka...@adobe.com
>
>
>


Re: Oak planning boards

2015-11-18 Thread Michael Marth
Hi,

I have added some new Quick Filters to the issue board to make it easier to 
navigate:
TestFailure/Not-TestFailure and
Bug/Improvement/New Feature

Cheers
Michael



On 13/11/15 11:15, "Michael Marth" <mma...@adobe.com> wrote:

>Right - I fixed that.
>
>Cheers
>Michael
>
>
>
>On 12/11/15 16:01, "Davide Giannella" <dav...@apache.org> wrote:
>
>>On 11/11/2015 15:19, Michael Marth wrote:
>>> Hi all,
>>>
>>> In order to get a better view on planned work for Oak 1.4 I have created 2 
>>> Kanban boards:
>>>
>>>   *   Epics: 
>>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=93
>>>   *   Issues: 
>>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=20
>>>
>>> This should hopefully reduce the usage of fix version as a planning tool.
>>>
>>> Only 2 swim lanes in both boards: one for fix version 1.4 (and 1.3.x for 
>>> now) and one for the rest. The issues in the board are ordered by “rank” so 
>>> you can re-arrange according to prio.
>>>
>>> If you need additional quick filters or anything else let me know.
>>>
>>
>>Speaking of the Oak Board I think we have to do something with the
>>`done` column. It should include all resolved and or closed with
>>fixversion 1.4 or 1.3.x.
>>
>>Cheers
>>Davide


Re: Oak planning boards

2015-11-13 Thread Michael Marth
Right - I fixed that.

Cheers
Michael



On 12/11/15 16:01, "Davide Giannella" <dav...@apache.org> wrote:

>On 11/11/2015 15:19, Michael Marth wrote:
>> Hi all,
>>
>> In order to get a better view on planned work for Oak 1.4 I have created 2 
>> Kanban boards:
>>
>>   *   Epics: 
>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=93
>>   *   Issues: 
>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=20
>>
>> This should hopefully reduce the usage of fix version as a planning tool.
>>
>> Only 2 swim lanes in both boards: one for fix version 1.4 (and 1.3.x for 
>> now) and one for the rest. The issues in the board are ordered by “rank” so 
>> you can re-arrange according to prio.
>>
>> If you need additional quick filters or anything else let me know.
>>
>
>Speaking of the Oak Board I think we have to do something with the
>`done` column. It should include all resolved and or closed with
>fixversion 1.4 or 1.3.x.
>
>Cheers
>Davide


Oak planning boards

2015-11-11 Thread Michael Marth
Hi all,

In order to get a better view on planned work for Oak 1.4 I have created 2 
Kanban boards:

  *   Epics: https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=93
  *   Issues: https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=20

This should hopefully reduce the usage of fix version as a planning tool.

Only 2 swim lanes in both boards: one for fix version 1.4 (and 1.3.x for now) 
and one for the rest. The issues in the board are ordered by “rank” so you can 
re-arrange according to prio.

If you need additional quick filters or anything else let me know.

Cheers
Michael


Re: Oak planning boards

2015-11-11 Thread Michael Marth
Aye, done




On 11/11/15 16:25, "Marcel Reutegger" <mreut...@adobe.com> wrote:

>Hi Michael,
>
>thanks for setting up the boards.
>
>Can you please create a new quick filter for the 'documentmk' component?
>
>See also: 
>http://mail-archives.apache.org/mod_mbox/jackrabbit-oak-dev/201510.mbox/%3CD257AD91.42FA0%25mreutegg%40adobe.com%3E
>
>Regards
> Marcel
>
>On 11/11/15 16:19, "Michael Marth" wrote:
>
>Hi all,
>
>In order to get a better view on planned work for Oak 1.4 I have created 2 
>Kanban boards:
>
>  *   Epics: https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=93
>  *   Issues: 
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=20
>
>This should hopefully reduce the usage of fix version as a planning tool.
>
>Only 2 swim lanes in both boards: one for fix version 1.4 (and 1.3.x for now) 
>and one for the rest. The issues in the board are ordered by "rank" so you can 
>re-arrange according to prio.
>
>If you need additional quick filters or anything else let me know.
>
>Cheers
>Michael
>


Re: Lucene auto-tune of cost

2015-11-09 Thread Michael Marth
Hi,

Afaiu your proposal would mix 2 different concerns: a) the query plan, which 
should be guided by the cost to execute a query and b) how up-to-date an index 
is.
I do not see why b) should interfere with a) unless you have a situation where 
an async index is so far behind that it should not be used at all.

From your original mail:

Sometimes we could have the same query better served by a property
index, or traversing for example.

Could you please specify “better” in the proposal above?

Thanks!
Michael

On 04/11/15 17:37, "Davide Giannella" 
> wrote:

On 04/11/2015 00:49, Ian Boston wrote:
...
Going down the property index route, for a DocumentMK instance will bloat
the DocumentStore further. That already consumes 60% of a production
repository and like many in DB inverted indexes is not an efficient storage
structure. It's probably ok for TarMK.

Traversals are a problem for production. They will create random outages
under any sort of concurrent load.

---
If the way the indexing was performed is changed, it could make the index
NRT or real time depending on your point of view. eg. Local indexes, each
Oak index in the cluster becoming a shard with replication to cover
instance unavailability. No more indexing cycles, soft commits with each
instance using a FS Directory and a update queue replacing the async
indexing queue. Query by map reduce. It might have to copy on write to seed
new instances where the number of instances falls below 3.


I didn't mean to replace the lucene indexes with property or traversing.
The index definitions remain the same. It's only on the cost of the plan
that Lucene itself is aware of when the last successfull indexing and if
lagging behind, corrects its own cost accordingly.

Something like saying: hey, I'm normally ok in serving this query but I
know I may be slightly/highly out of date and therefore I'm giving room
to any other potential index that could serve the query.

So if there won't be any other indexes to serve the query, lucene will
still do it.

Davide





Re: v1.3.9 - http state of play?

2015-11-09 Thread Michael Marth
Hi Francesco,

How does the API mentioned below relate to the work you did in OAK-2502 ?

Michael




On 09/11/15 10:28, "Lukas Kahwe Smith"  wrote:

>
>> On 09 Nov 2015, at 10:14, Davide Giannella  wrote:
>> 
>>> On 09/11/2015 00:55, Jason Harrop wrote:
>>> Hi guys,
>>> 
>>> I have a standalone server running:
>>> 
>>>java -jar oak-run-1.3.9.jar server
>>> 
>>> What's the current state of play for talking to it via http?
>> 
>> oak-run provides a very primitive http interface that is not really
>> maintained. If you want a fully functional HTTP interface I would
>> suggest to use the latest Sling.
>> 
>> Team, shall we deprecate/remove the server runmode from oak-run? I don't
>> recall we use it anywhere.
>
>I think this question is really a key question that the Oak community needs to 
>answer:
>Does the Oak community care for use cases outside of the Java world?
>
>The http interface and standalone server option are key there to allow 
>interaction with Oak and to make at least the initial setup easy for non Java 
>experts.
>
>Obviously as one of the maintainers of PHPCR I would love to hear that Oak is 
>fully committed to providing a content repository for the general world and 
>not just the Java community.
>
>regards,
>Lukas
>


Re: JCR node name and non space whitespace chars like line break etc

2015-09-14 Thread Michael Marth
Hi Chetan,

Given that JR2 did not allow those characters I see no good reason why Oak 
should.

my2c
Michael




On 14/09/15 11:47, "Chetan Mehrotra"  wrote:

>Hi Team,
>
>While looking into OAK-3395 it was realized that in Oak we allow node
>name with non space whitespace chars like \t, \r etc. This is
>currently causing problem in DocumentNodeStore logic (which can be
>fixed).
>
>However it might be better to prevent such node name to be created as
>it can cause problem other. Specially when JR2 does not allow creation
>of such node names [1]
>
>So the question is
>
>Should Oak allow node names with non space whitespace chars like \t, \r etc
>
>Chetan Mehrotra
>[1] 
>https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-spi-commons/src/main/java/org/apache/jackrabbit/spi/commons/conversion/PathParser.java#L257


Re: Queries related to Oak

2015-09-03 Thread Michael Marth
Hi Soumya,

Welcome to the list. This is the right place to ask questions.

Best regards
Michael


From: "Banerjee, Soumya J"
Reply-To: "oak-dev@jackrabbit.apache.org"
Date: Thursday 3 September 2015 08:01
To: "oak-dev@jackrabbit.apache.org"
Subject: Queries related to Oak

Hi,
My team in Tesco, is currently working on developing a platform that uses 
Jackrabbit Oak as a repository. We are facing a few problems with Oak and have 
a few queries as well.
Also, we are doing a few things in Oak that we are not very sure about (if it 
is the right way to do it). Can we get in touch with someone who may help us 
resolve our doubts and queries?
Thanks in advance.

Thanks & Regards,
Soumya Jyoti Banerjee
(Sr. Software Engineer)
[http://www.top10broadband.co.uk/images/logos/tesco_vs.gif]



This is a confidential email. Tesco may monitor and record all emails. The 
views expressed in this email are those of the sender and not Tesco.

Tesco Stores Limited
Company Number: 519500
Registered in England
Registered Office: Tesco House, Delamare Road, Cheshunt, Hertfordshire EN8 9SL
VAT Registration Number: GB 220 4302 31


Re: Repo Inconsistencies due to OAK-3169

2015-08-28 Thread Michael Marth
Upon further consideration, here’s an alternative proposal.

To recap the problem:
Say version 1.0.x has an issue that leads to repo inconsistencies or even 
(repairable) data loss
The issue is fixed in 1.0.y, but users running an inbetween version might have 
experienced that data loss (maybe without noticing).

If we only put the repair code in oak-run then users first need to notice the 
problem, before even running the repair. Otoh I still think that we should stay 
clear of complicating the core code.

So, alternative proposal: put such repair code in a new module (bundle), say, 
oak-selfheal. That way we could at least keep core clean from that.
In this case oak-jcr could automatically detect the problem and invoke the 
repair code.

However, I am not sure if we have many of these situations where a) the problem 
can be detected and b) repaired automatically to warrant such a new module.

Thoughts?
Michael




On 24/08/15 10:55, Marcel Reutegger mreut...@adobe.com wrote:

Hi,

On 24/08/15 10:02, Michael Marth wrote:
IMO we should collect such repair actions in oak-run. This should be a
one-off action that should not complicate code in oak-core I think.

agreed. in addition, the tool could be split into two phases.
in the first phase the tool just checks the repository and finds
all the dangling references. this phase could also be run on
a copy of the data. in a second phase the tool fixes nodes
based on the output from the first phase. this reduces impact
on the production instance.

Regards
 Marcel



Re: System.exit()???? , was: svn commit: r1696202 - in /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document: ClusterNodeInfo.java DocumentMK.java DocumentNodeStore.j

2015-08-18 Thread Michael Marth
I think option b) would be not so bad - maybe by starting a commit hook that 
denies any commit?
(and screaming loudly in the logs)




On 18/08/15 11:24, Julian Reschke julian.resc...@gmx.de wrote:

On 2015-08-18 11:14, Stefan Egli wrote:
 On 18/08/15 10:57, Julian Reschke wrote:
 ...
 Hi Julian,

 The idea is indeed that if an instance fails to update the lease then it
 will be considered by other instances in the cluster as dead/crashed -
 even though it still continues to function. It is the only one that is
 able to detect such a situation. Imv letting the instance shutdown is at
 this moment the only reasonable reaction as upper level code might
 otherwise continue to function on the assumption it is part of the cluster
 - to which the other instances do not agree, the others consider this
 instance as died.

 So taking one step back: the lease becomes a vital part of the functioning
 of Oak indeed.

 I see three alternatives:

 a) Oak itself behaves fail-safe and does the System.exit (that¹s the path
 I have suggested for now)

 b) Oak does not do the System.exit but refuses to update anything towards
 the document store (thus just throws exceptions on each invocation) - and
 upper level code detects this situation (eg a Sling Health Check) and
 would do a System.exit based on how it is configured

 c) same as b) but upper level code does not do a System.exit (I¹m not sure
 if that makes sense - the instance is useless in such a situation)

 d) none of the above and Oak tries to rejoin the cluster and continues to
 function (in my view this will not result in unmanageable edge cases)
 ...

Yes, we need to think about how to stop Oak in this case. However I do 
not think that stopping the *VM* is something we can do here. Keep in 
mind that there might be many other things running in the VM which have 
nothing to do with the content repository.

Best regards, Julian


Re: Release dates

2015-08-13 Thread Michael Marth
+1




On 13/08/15 10:17, Stefan Egli stefanegli.apa...@gmail.com wrote:

I¹d find it more useful (for us) when it would be the cut-date.

Cheers,
Stefan

On 13/08/15 10:08, Davide Giannella dav...@apache.org wrote:

Hello team,

a trivia question about release dates.

Normally in jira I set the release date on a future release for when we
plan to cut it. But we have the voting process of 72hrs that means the
actual release date will be 3 days after the cut.

Shall we put on jira then the release date as the actual announcement or
stick it to the cut?

Cheers
Davide

 




Re: Release dates

2015-08-13 Thread Michael Marth
+1




On 13/08/15 14:15, Marcel Reutegger mreut...@adobe.com wrote:

I'd leave it at the date the release was cut. this makes it
possible to roughly judge whether a commit made it into the
release or not.


Re: [discuss] Near real time search to account for latency in background indexing

2015-07-24 Thread Michael Marth
Hi Chetan,

Question about the indexing step:
From your description I am not sure how the indexing would be triggered for 
local changes. Probably not through the Async Indexer (this would not gain us 
much, right?). Would this be a Commit Hook?

Michael




On 23/07/15 13:48, Chetan Mehrotra chetan.mehro...@gmail.com wrote:

Hi Team,

As the use of async index like lucene is growing we would need to
account for delay in showing updated result due to async nature of
indexing. Depending on system load the asyn indexer might lag behind
the latest state by some margin. We have improved quite a bit in terms
of performance but by design there would be a lag and with load that
lag would increase at times.

For e.g. a typical flow in content authoring involves the user
uploading some asset to application. And after uploading the asset he
goes to the authoring view and look for that uploaded asset via
content finder kind of ui. That ui relies on query to show the
available assets. Due to delay introduced by async indexer it would
take some time (10-15 sec)

To account for that we can go for a near real time (NRT*) in memory
indexing which would complement the actual persisted async indexer and
would exploit the fact the request from same user in a give session
would most likely hit same cluster node.

Below is brief proposal - This would require changes in layer above in
Oak but for now focus is on feasibility.

Proposal
===

A - Indexing Side
--

The Lucene index can be configured to support NRT mode. If this mode
is enabled then on each cluster node we would perform AsyncIndex only
for local changes. For such indexer LuceneIndexEditor would use a
RAMDirectory. This directory would only have *recently* modified/added
documents.

B - Query Side
---

On Query side the LucenePropertyIndex would perform search against two
IndexSearcher

1. IndexSearcher based on persisted OakDirectory
2. IndexSearcher obtained from the current active IndexWrite used with
RAMDirectory

Query would be performed against both and a merged cursor [2] would be
returned back

C - Benefits


This approach would allow the user to at least see his modifications
appear quickly in search results and would make the search results
accuracy more deterministic.

This feature need not be enabled globally but can be enabled on per
index basis. Based on business requirement

D- Challenges
---
1. Ensuring that RAMDirectory is bounded and only contain recently
modified documents. The lower limit can be based on last indexed time
from AsyncIndexer. Periodically we would need to prune old documents
from this RAMDirectory

2. IndexUpdate would need to be adapted to support this hybrid model
for same index type - So something to be looked into

Thoughts?

Chetan Mehrotra

NRT - Near real Time is technically a Lucene term
https://wiki.apache.org/lucene-java/NearRealtimeSearch. However using
here as approach is bit similar!

[2] Such a merged cursor and performing query against multiple
searcher would anyway be required to support zero downtime kind of
requirement where index content would be split across local and global
instance


Re: [discuss] Near real time search to account for latency in background indexing

2015-07-24 Thread Michael Marth
Ah OK - makes sense!

One more thing I pondered when thinking about the use case:
Would the indexer need to be Lucene-based? It seems what is needed is that 
nodes can be found quickly based on certain properties, but not the additional 
features Lucene provides. So, an alternative for the “fast local indexer” could 
maybe be our PropertyIndex implementation. But it would not persist in the repo 
- only keep the index in-mem.

Reason why I thought about this: it *might* be easier with such an impl to 
evict old items. Also, it *might* be more light-weigth solution.
I have no real arguments or proof for either of these considerations, though.
WDYT?


On 24/07/15 09:09, Chetan Mehrotra chetan.mehro...@gmail.com wrote:

On Fri, Jul 24, 2015 at 12:15 PM, Michael Marth mma...@adobe.com wrote:
 From your description I am not sure how the indexing would be triggered for 
 local changes. Probably not through the Async Indexer (this would not gain 
 us much, right?). Would this be a Commit Hook?

My thought was to use an Observor so as to not add cost to commit
call. Observor would listen only for local changes and would invoke
IndexUpdate on the diff

Chetan Mehrotra


Re: [discuss] Near real time search to account for latency in background indexing

2015-07-24 Thread Michael Marth

The reason I preferred using Lucene is that current
property index only support single condition evaluation.

I did not know this. That’s a strong argument in favour of using Lucene.


Re: Branches

2015-07-16 Thread Michael Marth
Hi Jim,

I think the most accurate comparison is in Jira, query roughly like this:

https://issues.apache.org/jira/issues/?jql=project%20%3D%20OAK%20AND%20fixVersion%20not%20in%20%281.0.1%2C%201.0.2%2C%201.0.3%2C%201.0.4%2C%201.0.5%2C%201.0.6%2C%201.0.7%2C%201.0.8%2C%201.0.9%2C%201.0.10%2C%201.0.13%2C%201.0.14%2C%201.0.15%2C%201.0.16%29%20AND%20fixVersion%20in%20%281.1.0%2C%201.1.1%2C%201.1.2%2C%201.1.3%2C%201.1.4%2C%201.1.5%2C%201.1.6%2C%201.1.7%2C%201.1.8%2C%201.2.0%2C%201.2.1%2C%201.2.2%2C%201.2.3%29%20and%20resolution%20not%20in%20%28Duplicate%2C%20%22Won%27t%20Fix%22%2C%20%22Cannot%20Reproduce%22%29%20ORDER%20BY%20created%20ASC
 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20OAK%20AND%20fixVersion%20not%20in%20%281.0.1%2C%201.0.2%2C%201.0.3%2C%201.0.4%2C%201.0.5%2C%201.0.6%2C%201.0.7%2C%201.0.8%2C%201.0.9%2C%201.0.10%2C%201.0.13%2C%201.0.14%2C%201.0.15%2C%201.0.16%29%20AND%20fixVersion%20in%20%281.1.0%2C%201.1.1%2C%201.1.2%2C%201.1.3%2C%201.1.4%2C%201.1.5%2C%201.1.6%2C%201.1.7%2C%201.1.8%2C%201.2.0%2C%201.2.1%2C%201.2.2%2C%201.2.3%29%20ORDER%20BY%20created%20ASC

HTH
Michael





On 14/07/15 18:41, Jim.Tully jim.tu...@target.com wrote:

Would it be possible for the Oak documentation to give some kind of indication 
as to the differences between the 1.0.x and 1.2.x branches?

Thanks,

Jim Tully


Re: Managing backport work for issues fixed in trunk

2015-07-03 Thread Michael Marth
+1




On 03/07/15 15:27, Davide Giannella dav...@apache.org wrote:

On 03/07/2015 12:21, Chetan Mehrotra wrote:
 ...I propose we use following labels

 candidate_oak_1_0
 candidate_oak_1_2

sounds good to me.

Davide




Re: S3DataStore leverage Cross-Region Replication

2015-06-30 Thread Michael Marth
Shashank,

In case we think it’s needed to implement multiple chained S3 DSs then I think 
we should model it after Jackrabbit’s Multidatastore which allows arbitrary DS 
implementations to be chained:
http://jackrabbit.510166.n4.nabble.com/MultiDataStore-td4655772.html

Michael




On 30/06/15 12:11, Shashank Gupta shgu...@adobe.com wrote:

Hi Tim,
There is no time bound SLA provided by AWS when a given binary would be 
successfully replicated to destination S3 bucket.  There would be cases of 
missing binaries if mongo nodes sync faster than S3 replication.  Also S3 
replication works between a given pair of buckets. So one S3 bucket can 
replicate to a single S3 destination bucket. 

I think we can implement a tiered S3Datastore which writes/reads to/from 
multiple S3 buckets. The tiered S3DS first tries to read from same-region 
bucket and if not found than fallback to cross-geo buckets. 

 Has this been tested already ? Generally, wdyt ?
No. I suggest to first test cross geo mongo deployment with single S3 bucket. 
There shouldn't be functional issue in using single S3 bucket. Few customers 
use single shared S3 bucket between non-clustered cross-geo jackrabbit2 
repositories in production. 

Thanks,
-shashank




-Original Message-
From: maret.timot...@gmail.com [mailto:maret.timot...@gmail.com] On Behalf Of 
Timothée Maret
Sent: Monday, June 29, 2015 4:05 PM
To: oak-dev@jackrabbit.apache.org
Subject: S3DataStore leverage Cross-Region Replication

Hi,

In a cross region setup using the S3 data store, it may make sense to leverage 
the Cross-Region auto replication of S3 buckets [0,1].

In order to avoid data replication issues it would make sense IMO to allow 
configuring the S3DataStore with two S3 buckets, one for writing and one for 
reading.
The writing bucket would be shared among all instance (from all regions) while 
the reading bucket would be in each region (thus decreasing the latency).
The writing bucket would auto replicate to the reading buckets.

Has this been tested already ? Generally, wdyt ?

Regards,

Timothee



[0]
https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/
[1] https://docs.aws.amazon.com/AmazonS3/latest/dev/crr.html


Re: [VOTE] Release Apache Jackrabbit Oak 1.2.0

2015-04-09 Thread Michael Marth
 I think the question is rather, are we OK releasing 1.2 with
 this known issue and fix it e.g. in 1.2.1.
 
 IIUC this issue was already present in many previous Oak releases
 and we still published them.
 

(non-binding) +1 to that



Re: [DISCUSS] Enable CopyOnRead feature for Lucene indexes by default

2015-03-31 Thread Michael Marth
+1

 On 31 Mar 2015, at 12:54, Amit Jain am...@ieee.org wrote:
 
 +1
 
 On Tue, Mar 31, 2015 at 4:18 PM, Chetan Mehrotra chetan.mehro...@gmail.com
 wrote:
 
 Hi Team,
 
 CopyOnRead feature was provided as part of 1.0.9 release and has been in
 used in quite a few customer deployment. Of late we have to recommend to
 enable this setting on most of the deployments where queries are found to
 be performing slowly and it provides considerable better performance.
 
 I would like to enable this feature by default now [1]. Both in trunk and
 in branch.
 
 Would it be fine to do that?
 
 Chetan Mehrotra
 [1] https://issues.apache.org/jira/browse/OAK-2708
 



Re: Efficiently process observation event for local changes

2015-03-30 Thread Michael Marth
fwiw: I think separating queues for listeners interested in local events from a 
queue for listeners interested in global events is a a very promising approach.

Cheers
Michael

 On 23 Mar 2015, at 16:03, Chetan Mehrotra chetan.mehro...@gmail.com wrote:
 
 After discussing this further with Marcel and Michael we came to conclusion
 that we can achieve similar performance by make use of persistent cache for
 storing the diff. This would require slight change in way we interpret the
 diff JSOP. This should not require any change in current logic related to
 observation event generation. Opened OAK-2669 to track that.
 
 One thing that we might still want to do is to use separate queue size for
 listeners interested in local events only and those which can work with
 external event. On a system like AEM there 180 listeners which listen for
 external changes and ~20 which only listen to local changes. So makes sense
 to have bigger queues for such listners
 
 Chetan Mehrotra
 
 On Mon, Mar 23, 2015 at 4:09 PM, Michael Dürig mdue...@apache.org wrote:
 
 
 
 On 23.3.15 11:03 , Stefan Egli wrote:
 
 Going one step further we could also discuss to completely moving the
 handling of the 'observation queues' to an actual messaging system.
 Whether this would be embedded to an oak instance or whether it would be
 shared between instances in an oak cluster might be a different question
 (the embedded variant would have less implication on the overall oak
 model, esp also timing-wise). But the observation model quite exactly
 matches the publish-subscribe semantics - it actually matches pub-sub more
 than it fits into the 'cache semantics' to me.
 
 
 Definitely something to try out, given someone find the time for it. ;-)
 Mind you that some time ago I implemented persisting events to Apache Kafka
 [1], which wasn't greeted with great enthusiasm though...
 
 OTOH the same concern regarding pushing the bottleneck to IO applies here.
 Furthermore filtering the persisted events through access control is
 something we need yet to figure out as AC is a) sessions scoped and b)
 depends on the tree hierarchy.
 
 Michael
 
 
 [1] https://github.com/mduerig/oak-kafka
 
 
 
 .. just saying ..
 
 On 3/23/15 10:47 AM, Michael Dürig mdue...@apache.org wrote:
 
 
 
 On 23.3.15 5:04 , Chetan Mehrotra wrote:
 
 B - Proposed Changes
 ---
 
 1. Move the notion of listening to local events to Observer level - So
 upon
 any new change detected we only push the change to a given queue if its
 local and bounded listener is only interested in local. Currently we
 push
 all changes which later do get filter out but we avoid doing that first
 level itself and keep queue content limited to local changes only
 
 
 I think there is no change needed in the Observer API itself as you can
 already figure out from the passed CommitInfo whether a commit is
 external or not. BTW please take care with the term local as there is
 also the concept of session local commits.
 
 
 2. Attach the calculated diff as part of commit info which is attached
 to
 the given change. This would allow eliminating the chances of the cache
 miss altogether and would ensure observation is not delayed due to slow
 processing of diff. This can be done on best effort basis if the diff
 is to
 large then we do not attach it and in that case we diff again
 
 3. For listener which are only interested in local events we can use a
 different queue size limit i.e. allow larger queues for such listener.
 
 Later we can also look into using a journal (or persistent queue) for
 local
 event processing.
 
 
 Definitely something to try out. A few points to consider:
 
 * There doesn't seem to be too much of a difference to me whether this
 is routed via a cache or directly attached to commits. In wither way it
 adds additional memory requirements and churn, which need to be managed.
 
 * When introducing persisted queuing we need to be careful not to just
 move the bottleneck to IO.
 
 * An eventual implementation should not break the fundamental design.
 Either hide it in the implementation or find a clean way to put this
 into the overall design.
 
 Michael
 
 
 
 



Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Michael Marth
Could the Lucene indexer explicitly track these files (e.g. as a property in 
the index definition)? And also take care of removing them? (the latter part is 
assuming that the same index file is not identical across various definitions)

 On 10 Mar 2015, at 12:18, Chetan Mehrotra chetan.mehro...@gmail.com wrote:
 
 On Tue, Mar 10, 2015 at 4:12 PM, Michael Dürig mdue...@apache.org wrote:
 The problem is that you don't even have a list of all previous revisions of
 the root node state. Revisions are created on the fly and kept as needed.
 
 hmm yup. Then we would need to think of some other approach to know
 all the blobId referred to by the Lucene Index files
 
 
 Chetan Mehrotra



Re: Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

2015-03-10 Thread Michael Marth
Hi Chetan,

I like the idea.
But I wonder: how do you envision that this new index cleanup would locate 
indexes in the content-addressed DS?

Michael

 On 10 Mar 2015, at 07:46, Chetan Mehrotra chetan.mehro...@gmail.com wrote:
 
 Hi Team,
 
 With storing of Lucene index files within DataStore our usage pattern
 of DataStore has changed between JR2 and Oak.
 
 With JR2 the writes were mostly application based i.e. if application
 stores a pdf/image file then that would be stored in DataStore. JR2 by
 default would not write stuff to DataStore. Further in deployment
 where large number of binary content is present then systems tend to
 share the DataStore to avoid duplication of storage. In such cases
 running Blob GC is a non trivial task as it involves a manual step and
 coordination across multiple deployments. Due to this systems tend to
 delay frequency of GC
 
 Now with Oak apart from application the Oak system itself *actively*
 uses the DataStore to store the index files for Lucene and there the
 churn might be much higher i.e. frequency of creation and deletion of
 index file is lot higher. This would accelerate the rate of garbage
 generation and thus put lot more pressure on the DataStore storage
 requirements.
 
 Any thoughts on how to avoid/reduce the requirement to increase the
 frequency of Blob GC?
 
 One possible way would be to provide a special cleanup tool which can
 look for such old Lucene index files and deletes them directly without
 going through the full fledged MarkAndSweep logic
 
 Thoughts?
 
 Chetan Mehrotra



Re: Emulating multiple workspaces in Oak?

2015-01-20 Thread Michael Marth
Hi John,

thought about this for a while, but I don’t have a good answer.
Afaics your use case would be best served with workspaces. Until these are 
implemented in Oak one possibility would be to emulate in the app as mentioned 
before.
Under the hood Oak indeed uses branches that are designed to work much like git 
branches, i.e. they are lightweight. However, a) these are not exposed on API 
level and b) the current intended usage for those is for large transactions, 
i.e. they are rather short-lived in nature.
a) could be fixed, but I am not so sure about b). The question is if the 
current branch design would work fine for longer lived branches, as I expect 
that there would be (much) more complicated merge logic, maybe even 
app-specific merge logic.
Maybe someone else has a thoughts about this?

Cheers
Michael


 On 06 Jan 2015, at 23:41, Lukas Kahwe Smith sm...@pooteeweet.org wrote:
 
 
 On 06 Jan 2015, at 16:17, John Gretz jgre...@yahoo.com.INVALID wrote:
 
 Hi Michael,
 What I mean is allowing for multiple authors to work in parallel on the same 
 set of assets and eventually merge the changes back in the main branch after 
 several days.In my mind this roughly translates to Jackrabbit's workspaces 
 (or Git branches).
 
 I agree that this is one use case of workspaces that I would like to see 
 supported, ideally with a copy on write approach that would make user 
 specific workspaces cheap in terms of creation time and storage space.
 
 That being said Jackrabbit 2.x (and I guess therefore JCR) is kind of limited 
 when it comes to merging, for example merging only changes in a parent 
 without also merging changes in the children is afaik not supported, which is 
 likely why afaik none of the big Jackrabbit based CMS use the native merging 
 capabilities and instead reimplement the logic in user land.
 
 regards,
 Lukas Kahwe Smith
 sm...@pooteeweet.org
 
 
 



Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-12 Thread Michael Marth
Hi,

Davide’s proposal (let users specify maximum number of entries per facet) is 
basically a generalisation of my proposal to return a facet if there is more 
than 1 entry in the facet. I think we can try either, but we might want to test 
the performance on cases with large result sets where only few results are 
readable by the user.
AFAIR Amit and Davide have been working on a “micro scalability test framework” 
(measuring how queries scale with content). We could maybe add these tests 
there.

On Ard’s suggestion “possibly incorrect, fast counts”: I think this is only 
feasible if “incorrect” is guaranteed to always be lower than the exact amount. 
Otherwise facets would lead to information leakage as users could find 
information about nodes they otherwise cannot read.

Cheers
Michael


On 10 Dec 2014, at 11:12, Tommaso Teofili tommaso.teof...@gmail.com wrote:

 2014-12-10 10:17 GMT+01:00 Ard Schrijvers a.schrijv...@onehippo.com:
 
 On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella dav...@apache.org
 wrote:
 On 09/12/2014 17:10, Michael Marth wrote:
 ...
 
 The use cases problematic case for counting the facets I have in mind
 are when a query returns millions of results. This is problematic when one
 wants to retrieve the exact size of the result set (taking ACLs into
 account, obviously). When facets are to be retrieved this will be an even
 harder problem (meaning when the exact number is to be calculated per
 facet).
 As an illustration consider a digital asset management application that
 displays mime type as facets. A query could return 1 million images and,
 say, 10 video.
 
 Is there a way we could support such scenarios (while still counting
 results per facet) and have a performant implementation?
 
 We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If
 we're done within it, then we can output the actual number. In case
 after 1000 nodes checked we still have some left we can leave the number
 either empty or with something like many, +, or any other fancy way
 if we want.
 
 In the end is the same approach taken by Amazon (as Tommaso already
 pointed) or for example google. If you run a search, their facets
 (Searches related to...) are never with results.
 
 I don't think Amazon and Google have customers that can demand them to
 show correct facet counts...our customers typically do :).
 
 
 I see, however something along the lines of what Davide was proposing
 doesn't sound too bad to me even for such use cases (but I may be wrong).
 
 
 My take on
 on this would be to have a configurable option between
 
 1) exact and possibly slow counts
 2) unauthorized, possibly incorrect, fast counts
 
 Obviously, the second just uses the faceted navigation counts from the
 backing search implementation (with node by node access manager
 check), whether it is the internal lucene index, solr or Elastic
 Search. If you opt for the second option, then, depending on your
 authorization model you can get fast exact authorized counts as well :
 When the authorization model can be translated into a search query /
 filter that is AND-ed with every normal search. For ES this is briefly
 written at [1]. Most likely the filter is internally cached so even
 for very large authorization queries (like we have at Hippo because of
 fine grained ACL model) it will just perform. Obviously it depends
 quite heavily on your authorization model whether it can be translated
 to a query. If  it relies on an external authorization check or has
 many hierarchical constraints, it will be very hard. If you choose to
 have it based on, say, nodetype, nodename, node properties and
 jcr:path (fake pseudo property) it can be easily translated to a
 query. Note that for the jcr:path hierarchical ACL (eg read everything
 below /foo) it is not possible to write a lucene query easily unless
 you index path information as wellthis results in that moves of
 large subtree's are slow because the entire subtree needs to be
 re-indexed. A different authorization model might be based on groups,
 where every node also gets the groups (the token of the group) indexed
 that can read that node. Although I never looked much into the code, I
 suspect [2] does something like this.
 
 
 that's what I had in mind in my proposal #4, the hurdles there relate to
 the fact that each index implementation aiming at providing facets would
 have to implement such an index and search with ACLs which is not trivial.
 One possibly good thing is that this is for sure not a new issue, as you
 pointed out Apache ManifoldCF has something like that for Solr (and I think
 for ES too). One the other hand this would differ quite a bit from the
 approach taken so far (indexes see just node and properties, the
 QueryEngine post filters results on ACLs, node types, etc.), so that'd be a
 significant change.
 
 
 
 So, instead of second guessing which might be acceptable (slow
 queries, wrong counts, etc) for which customers/users I'd try

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-09 Thread Michael Marth
Hi,

I agree that facets *with* counts are better than without counts, but disagree 
that they are worthless without counts (see the Amazon link Tommaso posted 
earlier on this thread). There is value in providing the information that 
*some* results will appear when a user selects a facet .

The use cases problematic case for counting the facets I have in mind are when 
a query returns millions of results. This is problematic when one wants to 
retrieve the exact size of the result set (taking ACLs into account, 
obviously). When facets are to be retrieved this will be an even harder problem 
(meaning when the exact number is to be calculated per facet).
As an illustration consider a digital asset management application that 
displays mime type as facets. A query could return 1 million images and, say, 
10 video.

Is there a way we could support such scenarios (while still counting results 
per facet) and have a performant implementation?

(I should note that I have not tested how long it takes to retrieve and 
ACL-check 1 million nodes - maybe my concern is invalid)

Best regards
Michael


On 09 Dec 2014, at 09:57, Thomas Mueller muel...@adobe.com wrote:

 Hi,
 
 I would like the counts.
 
 I agree. I guess this feature doesn't make much sense without the counts.
 
 1, 2, and 4 seem like
 bad ideas
 
 1 undercuts the idea that we'd use lucene/solr to get decent
 performance. 
 
 Sorry I don't understand... This is just about the API to retrieve the
 data. It still uses Lucene/Solr (the same as all other options). I'm not
 sure if you talk about the performance overhead of converting the facet
 data to a string and back? This performance overhead is very very small (I
 assume not measurable).
 
 Regards,
 Thomas
 



Re: Oak queries for a particular branch?

2014-11-27 Thread Michael Marth
Hi John,

as Thomas mentioned, MK branches are an implementation detail of the micro 
kernels that cannot (should not) be leveraged on higher level.
Your use  case might be solvable by access control (so that authors 1 and 2 see 
different parts of the tree). The query engine honours such ACLs so the 
corresponding result sets would be different.

In Adobe’s AEM6 the use case you describe has been solved on application-level 
(i.e. above the JCR API) by creating a complete copy of the tree under edit and 
merging back after edit (the merge happening on application level again).

Best regards
Michael

On 27 Nov 2014, at 11:12, Thomas Mueller muel...@adobe.com wrote:

 Hi,
 
 Oak internally branches in some cases, but this is not exposed in the JCR
 API (you can not enforce creating a branch).
 
 Regards,
 Thomas
 
 On 26/11/14 18:56, John Gretz jgre...@yahoo.com.INVALID wrote:
 
 Hey guys, a newbie Oak question:
 Since there is support for a single workspace only in Oak, is there a
 good mechanism to simulate scenarios like below using branches in Oak?
 The JCR-compliant queries are unaware of Oak MK branches, so is there a
 way to use the Oak's Query to query on a particular branch only?
 Say in the context of multi-user authoring env. where users can work on
 different projects simultaneously:- author1 runs a query and gets some
 result- author2 runs the same query and sees the same result- author1
 changes some data in the result set and commits to a branch- author1 then
 runs the query in the context of his branch and sees his changed nodes-
 author2 runs the query and still sees the unchanged nodes from the base
 revision
 How do people usually manage such scenarios with Oak? Adobe's AEM 6 has
 been using Oak for a few months now, wonder how they handle such
 concurrent projects scenarios...
 Thx!
 



Application hints for conflict handling

2014-11-27 Thread Michael Marth
Hi,

OAK-2287 made me think if we should generalise the approach implemented in 
OAK-2287 (which is just for jcr:lastModified) in order to make conflict 
handling more clever.

For example, consider a property myapp:lastActivated (a time stamp). Should we 
allow such application logic to specify for the conflict handler that the last 
write (highest value) always wins?
There could be complications like another property myapp:lastActivatedBy (a 
user name) where also the last value should win for the node to be in a 
consistent state.

So overall, I am really not sure if this is a good idea or if we need it at all.
Any thoughts?

Michael

Re: Getting Started With JackRabbit Oak - A Complete Beginner

2014-11-07 Thread Michael Marth
Hi Bruce,

I can take some of these questions:

 The oak JCR itself is fairly low level and requires a lot of additional
 infrastructure to provide a functional application. I¹ve looked into some
 more enterprisey type systems, like alfresco, magnolia, eXo, etc, and they
 all appear to use the older jackrabbit non-oak version.
 Are there any more comprehensive apps that are currently using oak as a
 foundation?

Adobe’s Experience Manager uses Oak as its foundation (starting with version 6)

 Our needs include:
 - a JCR for binary content, text, pdf, video, audio, etc. All kinds of
 media files, grouped in a hierarchical fashion.

Perfect fit for the JCR content model

 - RBAC for controlling access to content as well as tracking changes by
 user and providing an audit path

Acces control is provided.
Tracking changes is not, but it would be simple to write an observation 
listener that writes to a log or so.

 - Some way of managing users and permissions - this alone is an argument
 for using a higher level app than direct jcr coding.

User management (as in the UI) is a concern for higher layers. There is an API 
on Jackrabbit level to manage users and groups on repo level (see [1])

 - Allowing users direct access to the webDAV view of the repo for content
 editing, while tracking edits by user and generating events on edit
 commits.

WebDAV is supported, the same security and user management considerations 
apply. Again, tracking could be implemented as a listener.
One strength of JCR is that these mechanism are independent of the access 
channel (Java API or WebDAV)

 - Some form of workflow management, again, this has been done 1e6 times
 already. Why re-invent. What¹s available that works with oak/sling?

I am not aware of an open source WF engine that works with JCR content ootb.

 - and of course the push/pull of data into the jcr. This is the primary
 reason for looking at oak, but it¹s the associated support tasks that are
 pushing for a more fully functional framework.

There is a full import/export feature (via XML). Of course, you can also use 
the Java API for that.

 What are the package blocks people are using with oak? Does everyone use
 sling? Is that the only option for oak or are there others?

In my view Sling is very popular, but there is also a Spring connector for JCR.

HTH
Michael


[1] 
http://jackrabbit.apache.org/api/2.4/org/apache/jackrabbit/api/security/user/UserManager.html


On 27 Oct 2014, at 17:19, Bruce Edge bruce.e...@nextissuemedia.com wrote:

 I¹m in the same boat as the OP. I¹m also having a hard time getting my
 head around both the components within oak, but more so, the question of
 wrapper components that sit on top of the JCR. My apologies for hijacking
 your thread, but I thought it may help to consolidate related rookie info.
 Plus, the subject fits exactly.
 
 The oak JCR itself is fairly low level and requires a lot of additional
 infrastructure to provide a functional application. I¹ve looked into some
 more enterprisey type systems, like alfresco, magnolia, eXo, etc, and they
 all appear to use the older jackrabbit non-oak version.
 Are there any more comprehensive apps that are currently using oak as a
 foundation?
 
 Our needs include:
 - a JCR for binary content, text, pdf, video, audio, etc. All kinds of
 media files, grouped in a hierarchical fashion.
 - RBAC for controlling access to content as well as tracking changes by
 user and providing an audit path
 - Some way of managing users and permissions - this alone is an argument
 for using a higher level app than direct jcr coding.
 - Allowing users direct access to the webDAV view of the repo for content
 editing, while tracking edits by user and generating events on edit
 commits.
 - Some form of workflow management, again, this has been done 1e6 times
 already. Why re-invent. What¹s available that works with oak/sling?
 - and of course the push/pull of data into the jcr. This is the primary
 reason for looking at oak, but it¹s the associated support tasks that are
 pushing for a more fully functional framework.
 
 What are the package blocks people are using with oak? Does everyone use
 sling? Is that the only option for oak or are there others?
 
 thanks in advance.
 
 -Bruce
 
 
 
 From:  Michael Dürig mdue...@apache.org
 Reply-To:  oak-dev@jackrabbit.apache.org oak-dev@jackrabbit.apache.org
 Date:  Monday, September 8, 2014 at 1:42 AM
 To:  oak-dev@jackrabbit.apache.org oak-dev@jackrabbit.apache.org
 Subject:  Re: Getting Started With JackRabbit Oak - A Complete Beginner
 
 
 
 Hi Aman,
 
 On 8.9.14 7:44 , Aman Arora wrote:
 
 1.   For a complete beginner to start developing on Jackrabbit
 Oak, we didn't find sufficient resources online.
 
 Unfortunately there is currently not much more than the Oak
 documentation web site at http://jackrabbit.apache.org/oak/docs/, which
 is still work in progress. Fortunately however, Oak implements the JCR
 specification. So unless you want to customise 

Re: Features to be supported while enabling boost support in Lucene Full text index

2014-11-07 Thread Michael Marth
Chetan,

Given existing config is part of 1.0.8 we would need to support both
but users would not be allowed to mix both approaches.

When you refer to “existing config” do you mean only the part that configures 
boosting or more?
If the former: given that 1.0.8 is out only for a couple of days I do not think 
we would create big problems by changing the the config syntax in 1.0.9

Michael


Re: Oak documentation and features added in specific versions

2014-10-16 Thread Michael Marth
Hi,

I opt for 2

Michael

On 16 Oct 2014, at 08:01, Chetan Mehrotra chetan.mehro...@gmail.com wrote:

 Hi Team,
 
 I need to update documentation for Lucene based property indexes. This
 is currently in trunk and is planned to be part of Oak 1.0.8. So while
 updating the docs should
 
 1. Update in trunk and then merge to master but deploy to
website from trunk
 2. OR Update in trunk and mention that its a 1.0.8+ feature
 3. OR Have the docs at apache site include version in url.
http://jackrabbit.apache.org/oak/1.0.8/docs/osgi_config.html
 
 Chetan Mehrotra



Re: questions

2014-10-16 Thread Michael Marth
Hi,

If I had to do that I would probably model the ACLs for those state changes on 
application level (in your Workflow engine), not in the repository.

But if you really want to do it in the repository I see 2 possible ways:
1. model the states as child nodes of the item in workflow, e.g.
|
-item
-- draft
Then, you could probably use wild card ACLs such that e.g. only a given group 
can remove nodes named “draft” and add nodes named “approved”.
2. another possible approach is to add your own SecurityProvider (Angela would 
know what the actual name is) that evaluates writes based on your logic.

HTH
Michael


On 13 Oct 2014, at 18:58, TALHAOUI Mohamed m.talha...@rsd.com wrote:

 Hi,
 
 Most probably for the states.
 What about enforcing allowed transition and permissions ?
 Ex :
 state cannot change from DRAFT to APPROVED
 only users with approve privilege can set the state to APPROVE  
 
 What would be your recommendation here ?
 
 Thanks
 
 -Original Message-
 From: Michael Marth [mailto:mma...@adobe.com] 
 Sent: lundi 13 octobre 2014 17:43
 To: oak-dev@jackrabbit.apache.org
 Subject: Re: questions
 
 Hi,
 
 My use case is very basic, I need to bind some LC states to a node type 
 (something like DRAFT, PENDING, REJECTED, APPROVED) and allow a node to 
 follow LC transition in response to a user action or a workflow action.
 
 I would simply add a property with those values to these nodes. Would that 
 work?
 
 Cheers
 Michael



Re: JCR sorting and array properties

2014-10-15 Thread Michael Marth
Hi,

should we not check what the spec says about sorting MVPs? (and if allowed: 
model the behaviour after JR2?)

Cheers
Michael

On 15 Oct 2014, at 16:20, Amit Jain am...@apache.org wrote:

 What should be the output with
 
 /a {v: [1, 10]}
 /b {v: [2,9]}
 
 
 Shouldn't it be /a because its encountered first for both ascending and
 descending?



Re: Setup Guide

2014-09-23 Thread Michael Marth
Hi,

 - Requirements are be able to store different first level paths under
 different NFS Mounts i.e multi tenancy based on paths.

Not sure if you refer to the NodeStore (where the nodes are stored) or the 
DataStore (binaries).
For the DataStore: this is not implemented and I believe it will be 
conceptually difficult because the DataStore has no notion of the path where a 
binary is referenced (could be many).
For the NodeStore: the most actively maintained NodeStores are based on Tar and 
MongoDB, respectively (upcoming is the RDBMK for relational databases). For Tar 
I cannot see how to easily satisfy that requirement. For MongoDB it would be 
possible by e.g. tag-aware sharing in Mongo.


 - All Files must be stored on a disk with no chunking/segmentation
 FileDataStore??

The File DataStore does not chunk.
(at least the version that was moved from Jackrabbit2, there is also a new 
chunking version)

 Please correct me if I am wrong as this is where it gets confusing.
  - Is there any existing MicroKernel implementation which will provide me
 these requirements. if so, with which implementations

As above: I do not think so.

 - What part of the source code should I start looking/extending  to do this.

To me it is not clear what physical storage mechanism you are aiming for (e.g. 
file system for binaries and nodes or is Mongo an option or is S3 an option or 
does it not matter at all). If you could clarify that I could better comment on 
the above.

Best regards
Michael


On 16 Sep 2014, at 19:10, Eren Erdemli erenerde...@gmail.com wrote:

 Hi All,
 
 
 First of all I am sorry, if my question is already answered and will really
 appreciate a link to it.
 
 I would like to eventually replace my Jackrabbit installation with oak
 however I am having difficulties understanding the concepts.
 
 I would like to setup a OAK Cluster (Load Balanced) with Shared and
 Distributed Storage for a very large repository  over several terabytes at
 the moment,
 
 My Current Setup is ,
 
 I have Rest based web aplication which embeds JackRabbit and uses
 DBFileStore, FileDataStore and DatabaseJournal.  to gain LoadBalanced
 Repository
 
 One of the reasons we would like to move to OAK is its distributed
 architecture.
 
 - Requirements are be able to store different first level paths under
 different NFS Mounts i.e multi tenancy based on paths.
 
 - All Files must be stored on a disk with no chunking/segmentation
 FileDataStore??
 
 - Be able to embed it in existing war project and use it on a load balanced
 rest server or be able to move my web services to installation.
 
 Please correct me if I am wrong as this is where it gets confusing.
  - Is there any existing MicroKernel implementation which will provide me
 these requirements. if so, with which implementations
 If not
 - What part of the source code should I start looking/extending  to do this.
 
 How would one start implementing such scenario, your pointers are
 appreciated.
 
 Thanks in advance
 Eren



unstable releases from trunk

2014-09-12 Thread Michael Marth
Hi all,

I would like to propose that we start to regularly release Oak releases from 
trunk (marked as “unstable”). This would allow downstream projects to start 
incorporating and testing new Oak features without having to use snapshots. The 
next stable release branch could then be 1.2. I believe this is the same 
pattern as we already do in Jackrabbit 2.

WDYT?
Michael

Re: Using Cassandra as Back End for publish

2014-09-04 Thread Michael Marth
Hi Abhijit,

I assume you refer to replication as implemented in Sling and AEM. Those work 
on top of the JCR API, so they are independent of the Micro Kernel 
implementation.

For running Oak on Cassandra you would need a specific MK implementation 
(presumably based on the DocumentMK). Is that something you intend to work on 
(I am sure there would be a lot interest in such an impl).

Best regards
Michael

On 04 Sep 2014, at 11:07, Abhijit Mazumder abhijit.mazum...@gmail.com wrote:

 Hi,
  We are considering using Cassandra as back end for the publish
 environment. In author we are using mongo.
 What are the options we have to customize replication agent to achieve
 this?
 Regards,
 Abhijit



Re: Using Cassandra as Back End for publish

2014-09-04 Thread Michael Marth
Hi,

I think your best guess would be
http://jackrabbit.apache.org/oak/docs/nodestore/documentmk.html
as a general overview (even if skewed towards MongoDB) and looking into
http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/
There is the Mongo impl, as well as the upcoming impl for relational DBs

Cheers
Michael


On 04 Sep 2014, at 17:18, Abhijit Mazumder abhijit.mazum...@gmail.com wrote:

 Hi Michael,
 I would love to. Currently we are designing keeping  Mongo as back end for
 author with Scene 7 cloud for image repository. For author with asset heavy
 operations Mongo is automatic choice. However for Publish we are tend to
 think now Cassandra would be better for User generated content and linear
 scalability.
 I went through some of the documentation but could not find a Getting
 Started for custom MK implementation. Could you point me to some relevant
 documentation which would help us get started?
 
 Regards,
 Abhijit
 
 
 On Thu, Sep 4, 2014 at 4:14 PM, Michael Marth mma...@adobe.com wrote:
 
 Hi Abhijit,
 
 I assume you refer to replication as implemented in Sling and AEM. Those
 work on top of the JCR API, so they are independent of the Micro Kernel
 implementation.
 
 For running Oak on Cassandra you would need a specific MK implementation
 (presumably based on the DocumentMK). Is that something you intend to work
 on (I am sure there would be a lot interest in such an impl).
 
 Best regards
 Michael
 
 On 04 Sep 2014, at 11:07, Abhijit Mazumder abhijit.mazum...@gmail.com
 wrote:
 
 Hi,
 We are considering using Cassandra as back end for the publish
 environment. In author we are using mongo.
 What are the options we have to customize replication agent to achieve
 this?
 Regards,
 Abhijit
 
 



Re: [DISCUSS] - QueryIndex selection

2014-06-28 Thread Michael Marth
Hi,

I looked a bit into how MongoDB selects indexes (query plans) and think we 
could take some inspiration.

So, the way MongoDB does it afaiu:
* query gets parsed into Abstract Syntax Tree (so that parameters can get 
stripped out)
* the first time this query is performed then the query is executed against 
*all* available indexes
* the fastest index is put into a cache, so that when the same query 
(abstracted, regardless of parameters) comes in, then only that fastest index 
will be used (will be looked up from cache)
* after a number of modifications that index-selection-cache is flushed. 
Process starts at beginning.

What I dislike about this process is that the first query puts a lot more into 
the system (due to the fact that all indexes perform the query). Moreover, the 
first execution of that query could be disturbed by noise, so the selection 
could be wrong.

What I like, though, (if we ignore the noise issue from above) is that the 
selected index is the one that has actually proven to be the fastest.

So, for Oak: maybe we could enhance the deterministic selection process we have 
right now. We could run queries in the background to determine if the cost 
factors that the indexers claim to have are actually correct (and if not, 
correct them in the query engine). Those background queries could be the ones 
“most often executed” by users on that repo that have multiple indexes capable 
of answering the query.

Consider such a scenario: you have the same nodes indexed in the local property 
index (on the same machine that also serves requests) and a remote SolrCloud 
cluster. If we only reason about index size etc then we can never account for 
the fact that the local machine’s index might be much slower than those 
external machines that are used exclusively for answering queries. We could 
though, if we actually run those queries a number of times on both indexes.

Cheers
Michael




Re: Oak 1.0.1 release plan

2014-06-12 Thread Michael Marth
Hi,

given the problems caused by the TarMK disc size growth in production I would 
prefer to release a 1.0.1 as soon as that issue is resolved.

Best regards
Michael

On 12 Jun 2014, at 08:53, Davide Giannella giannella.dav...@gmail.com wrote:

 On 10/06/2014 22:47, Jukka Zitting wrote:
 Hi,
 
 It's a few weeks since 1.0 was released, and we already have a bunch
 of feedback on the release and various bug fixes ready to go out. Thus
 I think it's time to cut Oak 1.0.1 later this week.
 
 Please use Jira to tag any existing bug fixes that should go into this
 maintenance release.
 
 I have a bunch of issues to merge. would it be possible to cut next week?
 
 Still chasing up on email and tomorrow this week will be over.
 
 D.
 
 



Re: svn commit: r1577449 - in /jackrabbit/oak/trunk/oak-core/src: main/java/org/apache/jackrabbit/oak/plugins/segment/ main/java/org/apache/jackrabbit/oak/plugins/segment/file/ main/java/org/apache/ja

2014-04-02 Thread Michael Marth

On 02 Apr 2014, at 08:06, Jukka Zitting 
jukka.zitt...@gmail.commailto:jukka.zitt...@gmail.com wrote:

That design gets broken if components
start storing data separately in the repository folder.

Agree with that design principle, but the (shared) file system DS is a valid 
exception IMO (same for the S3 DS).

Later we would probably store the config files when using Oak outside
of std OSGi env like with PojoSR

@Chetan: why would the configs not be stored in the repo? I do not see how this 
relates to non-OSGi environments


Re: Making merging changes from concurrent commits more intelligent

2014-03-20 Thread Michael Marth
IMO the benefits (less avoidable conflicts for concurrent writes or unexpected 
creation of SNSs) outweigh the downside (reproduce JR2 behaviour).

my 2c
Michael

On 20 Mar 2014, at 16:38, Michael Dürig mdue...@apache.org wrote:

 
 Hi,
 
 This came up with OAK-1541 where nodes are being added from multiple sessions 
 concurrently:
 
 Session 1: root.addNode(a).addNode(b);
 Session 2: root.addNode(a).addNode(c);
 
 This currently fails for whichever session saves last because node a is 
 different from the already existing node a. The MicroKernel contract makes 
 this precise: addExistingNode: a node has been added that is different from 
 a node of them same name that has been added to the trunk.
 
 In OAK-1553 I proposed to relax this contract such that concurrently added 
 nodes could be merged. For the above case the resulting tree would then be
 
 root:{a:{b:{}, c:{}}}
 
 However note that this differs very much from what you usually would get on 
 Jackrabbit 2 when same name siblings enter the stage:
 
 root:{a[1]/b:{}, a[2]/c:{}}
 
 I understand that to be able to add users concurrently (OAK-1541) we need 
 more intelligent merging. However doing so will change the behaviour wrt. 
 Jackrabbit 2 quite significantly.
 
 Thoughts?
 
 Michael
 
 
 [1] https://issues.apache.org/jira/browse/OAK-1541
 [2] https://issues.apache.org/jira/browse/OAK-1553



Re: Queries related to various BlobStore implementations

2014-03-11 Thread Michael Marth
Hi,

Q2 - For a system which is getting upgraded from JR2 to Oak. Would
2.1 It continue to use its existing DataStore implementation.
2.2 Migrate all the content first then switch to one of the BlobStore
implementations
2.3 Both DataStore and BlobStore would be used together

I think we need at least 1 option where existing users can share a 
filesystem-based DS across Oak instances. Not sure, if 2.1 is the only option 
that supported, though.

Michael


Re: Adding a 'restore' method to the NodeStore apis

2014-01-28 Thread Michael Marth
Hi,

re
 This assumption was challenged via OAK-1357 with the idea that the MongoDB
 backup can already produce consistent (non-blocking) backups so there's
 nothing to be done in this area. Also restoring means only replacing the
 old db with the backup one.

I am not sure if MongoDB is able to produce a consistent backup as it cannot be 
aware of  Oak’s document model. For example, a consistent revision might 
involve changes to documents located in /content (in Oak’s POV) and also the 
corresponding index nodes.

There is one caveat (and I actually might be wrong with the above): afaik Oak 
commits the new revision as the last document to MongoDB. If MongoMK can 
guarantee that and additionally guarantee that any changes that are committed 
without the corresponding revision-ID-commit we might be able to do without 
explicitly marking a checkpoint.

Michael


On 28 Jan 2014, at 15:30, Alex Parvulescu alex.parvule...@gmail.com wrote:

 Hi,
 
 I'm back trying to gather some feedback from the people involved in the
 MongoDB store impl.
 
 This is about creating a backup of the current repository using a native
 backup tool, a tool that is running at databasel level.
 It was my understanding that if you run a (non blocking) backup at any
 given time, you might catch a repository in an inconsistent state (large
 transactions half-way completed maybe?), so you might need a way to mark
 the latest stable head before basically copying everything.
 Next on restore you would simply need to reset the head to the last known
 stable state and you get the full circle.
 
 I've found that the checkpoint mechanism we use for the async indexing fits
 this model nicely, and I was planning on using it in this context as well:
 marking the last state with a checkpoint, then using the same checkpoint id
 as a reference for the restore.
 This would work both in the case of a MongoDB store (also the RDB one) but
 also in the cases where the repository is too big and out backup code
 cannot handle it efficiently (think huge repo + file system snapshots).
 
 This assumption was challenged via OAK-1357 with the idea that the MongoDB
 backup can already produce consistent (non-blocking) backups so there's
 nothing to be done in this area. Also restoring means only replacing the
 old db with the backup one.
 
 If this is true, I'm as happy as it gets, I can already close down a bunch
 of issues :) but I want clear confirmation that this is in fact the way it
 works and that everybody agrees with it and so there are no loose ends.
 
 thanks for your attention,
 alex
 
 
 
 
 
 On Mon, Jan 27, 2014 at 1:22 PM, Alex Parvulescu
 alex.parvule...@gmail.comwrote:
 
 Hi,
 
 I've created OAK-1357 asking for a new method on the NodeStore apis:
 'restore', please add your thoughts to the issue.
 
 thanks,
 alex
 



Re: Strategies around storing blobs in Mongo

2013-10-30 Thread Michael Marth
Hi Chetan,

 
 3. Bring back the JR2 DataStore implementation and just save metadata
 related to binaries in Mongo. We already have S3 based implementation
 there and they would continue to work with Oak also
 

I think we will need the data store impl for Oak in any case (regardless the 
outcome of this discussion) in order to enable the migration of large repos 
from JR2 where the data store cannot be moved. That would include the 
filesystem based DS and S3 DS.

When you write

 Mongo also provides GridFS[2]. However it also uses a similar strategy
 like we are currently using and such a support is built into the
 Driver. For server they are just collection entries.

do you imply that you consider a GridFS-backed DS implementation not doable or 
ideal? I am referring to the However :)

Michael

Re: Migration without an embedded Jackrabbit

2013-10-13 Thread Michael Marth
Hi Jukka,

I think the situation is slightly different between OAK-458 and OAK-805 
(although I come to roughly the same conclusion in both cases)

OAK-805: JR users that cannot upgrade their existing data store will be stuck 
eternally after a migration (I think it is fair to assume and a reasonable 
deployment case that a large amount of binaries might not be feasible to move 
ever). So, in turn this means that we would be eternally stuck with all the 
mentioned JR dependencies in Oak unless we re-implement the existing 
functionality. However, it also means, that this re-implementation needs to be 
production quality for read/write (as opposed to the second case). So, I 
think a re-implmentation makes sense.

OAK-458: In this case the functionality that needs to be re-implemented is 
certainly a subset, as it requires only read-access to the repo (and does not 
need to cover a lot of edge cases that a full fledged read-write support needs 
to cover). So, I am inclined to also opt for a re-implementation if this subset 
is sufficiently easy to implement.
If not, we could pull in the whole enchilada of dependencies as we know that we 
can drop them later: IMO it would be fair to drop (or make optional) the 
ability to upgrade an existing JR repo to Oak at a certain point in the future 
and thus remove the additionally needed deps.

Michael


On Oct 11, 2013, at 4:28 PM, Jukka Zitting wrote:

Hi,

I've been thinking about the upgrade/migration code (oak-upgrade,
OAK-458) over the past few days, and trying to figure out how we could
achieve that without having to keep the full Jackrabbit 2.x codebase
as dependency. The same question comes up for the support for
Jackrabbit 2.x datastores (OAK-805).

The key problem here is that the Jackrabbit 2.x codebase is already so
convoluted that it's practically impossible to just pick up say
something like an individual persistence manager or data store
implementation and access it directly without keeping the rest of the
2.x codebase around. This is troublesome for many reasons, for example
using such components require lots of extra setup code (essentially a
full RepositoryImpl instance) and the size of the required extra
dependencies is about a dozen megabytes.

Thus I'm inclined to instead just implement the equivalent
functionality directly in Oak. This requires some code duplication
(we'd for example need the same persistence managers in both Oak and
Jackrabbit), but the versions in Oak could be a lot simpler and more
streamlined as only a subset of the functionality is needed. To reduce
the amount of duplication we could push some of the shared utility
code (like NodePropBundle, etc.) to jackrabbit-jcr-commons or to a new
jackrabbit-shared component.

WDYT?

BR,

Jukka Zitting



Re: Rethinking access control evaluation

2013-10-07 Thread Michael Marth
Hi Jukka,

you are right that the majority of repositories we see (or at least that I see) 
have few principals and few ACLs. But as Angela mentioned there is a 
not-so-small number of cases with a very large number of principals (10, 
e.g. a public portal or forum) and/or a large number of ACLs (5, e.g. 
Intranet where ACLs are not hierarchic).
From my POV it makes sense (as it was suggested on this thread) to optimize 
for the normal case (few ACLs) out of the box, but make the ACL evaluation 
pluggable, so that different strategies could be used in the different 
scenarios.

my2c
Michael

On Oct 5, 2013, at 5:31 AM, Jukka Zitting wrote:

Do we have real-world examples of such ACL-heavy repositories? Do they
also require optimum performance? I'm not aware of any such use cases,
but that of course doesn't mean they don't exist.

If possible I'd rather avoid the extra complexity and go with just a
single strategy that's optimized for the kinds of repositories we
normally see.



Re: When optimistic locking fails

2013-03-08 Thread Michael Marth
Jukka, Marcel,

you saw this problem for SegmentMK which uses branch-merge extensively, but is 
this not a problem all distributed MK implementations will run into? After all, 
branch-merge is part of the MK API. Unless the MK impl uses pessimistic locking 
from the start, of course.
In particular, I wonder about the MongoMK.

Michael

On Mar 7, 2013, at 11:06 AM, Jukka Zitting wrote:

 Hi,
 
 There are a few scenarios where the optimistic locking approach used
 by the SegmentMK fails in practice:
 
 1) A large batch operation while other smaller changes are being committed.
 
 2) Lots of concurrent changes being committed against the same journal.
 
 In scenario 1 the large operation like an import can't complete itself
 since while it is rebasing itself and re-applying commit hooks other
 smaller operations have already updated the journal, triggering new
 rounds of rebasing and hook processing for the large operation until
 it bails out with the System overloaded exception.
 
 In scenario 2 the same System overloaded exception occurs once there
 are too many concurrent changes for the system to keep up with when
 using just a single journal. As noted by Marcel and others, this case
 comes up pretty quickly in a benchmark that explicitly tries to push
 the system to the limit.
 
 While in scenario 2 the System overloaded exception is a valid
 alternative to a potentially prolonged wait until the commit can go
 through, in scenario 1 it is clearly troublesome. Thus I'd like to
 address it, and the solution I have in mind actually works for both
 cases:
 
 When encountering a case where the optimistic locking mechanism can't
 push a commit through in say one second, instead of waiting for a
 longer while I'd have the SegmentMK fall back to pessimistic locking
 where it explicitly acquired a hard lock on the journal and does the
 rebase/hook processing one more time while holding that lock. This
 guarantees that all commits will go through eventually (unless there's
 a conflict or a validation failure), while keeping the benefits of
 optimistic locking for most cases. And even for scenario 1 the bulk of
 the commit has already been persisted when the pessimistic locking
 kicks in, so the critical section should still be much smaller than
 with Jackrabbit 2.x where the lock is held also while the change set
 is being persisted.
 
 BR,
 
 Jukka Zitting



Re: Conflict handling in Oak

2012-12-18 Thread Michael Marth
Agree with Felix, we should stay away from MAY especially if we want to achieve 
clarity for Oak-Core what it can expect the MK to do

On Dec 18, 2012, at 9:49 AM, Felix Meschberger wrote:

 Hi,
 
 Just remember that MAY is difficult to handle by developers: Can I depend 
 on it or not ? What if the MAY feature does not exist ? What if I develop 
 on an implementation providing the MAY feature and then running on an 
 implementation not providing the MAY feature ?
 
 In essence, a MAY feature basically must be considered as non-existing :-(
 
 All in all, please don't use MAY. Thanks from a developer ;-)
 
 Regards
 Felix
 
 Am 18.12.2012 um 09:37 schrieb Marcel Reutegger:
 
 Hi,
 
 To address 1) I suggest we define a set of clear cut cases where any
 Microkernel implementations MUST merge. For the other cases I'm not sure
 whether we should make them MUST NOT, SHOULD NOT or MAY merge.
 
 I agree and I think three cases are sufficient. MUST, MUST NOT and MAY.
 MUST is for conflicts we know are easy and straight forward to resolve.
 MUST NOT is for conflicts that are known to be problematic because there's
 no clean resolution strategy.
 MAY is for conflicts that have a defined resolution but we think happen
 rarely and is not worth implementing.
 
 I don't see how SHOULD NOT is useful in this context.
 
 regards
 marcel
 



Re: [MongoMK] BlobStore garbage collection

2012-11-06 Thread Michael Marth
this might be a weird question from the leftfield, but are we actually sure 
that the existing data store concept is worth the trouble? afaiu it saves us 
from storing the same binary twice, but leads into the DSGC topic. would it be 
possible to make it optional to store/address binaries by hash (and thus not 
need DSGC for these configurations)?

In any case we should definitely avoid to require repo traversal for DSGC. This 
would operationally limit the repo sizes Oak can support.


--
Michael Marth | Engineering Manager
+41 61 226 55 22 | mma...@adobe.commailto:mma...@adobe.com
Barfüsserplatz 6, CH-4001 Basel, Switzerland

On Nov 6, 2012, at 9:24 AM, Thomas Mueller wrote:

Hi,

1- What's considered an old node or commit? Technically, anything other
than the head revision is old but can we remove them right away or do we
need to retain a number of revisions? If the latter, then how far back do
we need to retain?

we discussed this a while back, no good solution back then[1]

Yes. Somebody has to decide which revisions are no longer needed. Luckily
it doesn't need to be us :-) We might set a default value (10 minutes or
so), and then give the user the ability to change that, depending on
whether he cares more about disk space or the ability to read old data /
roll back to an old state.

To free up disk space, BlobStore garbage collection is actually more
important, because usually 90% of the disk space is used by the BlobStore.
So it would be nice if items (files) in the BlobStore are deleted as soon
as possible after deleting old revisions. In Jackrabbit 2.x we have seen
that node and data store garbage collection that has to traverse the whole
repository is problematic if the repository is large. So garbage
collection can be a scalability issue: if we have to traverse all
revisions of all nodes in order to delete unused data, we basically tie
garbage collection speed with repository size, unless if we find a way to
run it in parallel. But running mark  sweep garbage collection completely
in parallel is not easy (is it even possible? if yes I would have guessed
modern JVMs should have it since a long time). So I think if we don't need
to traverse the repository to delete old nodes, but just traverse the
journal, this would be much less of a problem.

Regards,
Thomas




Re: The destiny of Oak (Was: [RESULT] [VOTE] Codename for the jr3 implementation effort)

2012-10-05 Thread Michael Marth
+1 to the Bertrand's suggestion same project name, different software name

This would keep the community together, but also allows us to have different 
aims for Jackrabbit (reference impl) and Oak (some level of compliance, not 
reference impl).

On Oct 3, 2012, at 7:23 PM, Bertrand Delacretaz wrote:

 On Wed, Oct 3, 2012 at 1:53 PM, Tommaso Teofili teof...@adobe.com wrote:
 ...there is a goal in Oak to be less strict with regard to JCR spec 
 compatibility which, in my opinion, makes a
 possibly important point of distinction.
 If this understanding is correct then I think it'd make sense to have 
 separate projects
 
 You can have separate software projects/modules in a single Apache
 project - I agree with others that keeping Oak (the software
 project/module) within Jackrabbit the Apache project is less risky in
 terms of community.
 
 Saying that Jackrabbit 2 is the JCR reference implementation, and
 Jackrabbit Oak (or whatever it's called) is a mostly compliant JCR
 repository is perfectly fine as long as that's clear.
 
 -Bertrand



Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-12 Thread Michael Marth
 As an alternative: we could use a separate method getSize(int max) which
 
 * if called with max == -1 returns the exact size if quickly available,
 * returns -1 otherwise, and
 * returns the exact size but not more then max when called with max = 0.
 
 This allows for estimates but leaves the caller in control.

+1

(and getSize() would still return -1 I guess?)


[jira] [Commented] (OAK-210) granularity of persisted data

2012-07-27 Thread Michael Marth (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423847#comment-13423847
 ] 

Michael Marth commented on OAK-210:
---

Do you see this as something to be implemented (or not) by each MK 
independently (i.e. something like an MK implementation detail)?

 granularity of persisted data
 -

 Key: OAK-210
 URL: https://issues.apache.org/jira/browse/OAK-210
 Project: Jackrabbit Oak
  Issue Type: Bug
  Components: mk
Reporter: Stefan Guggisberg
Assignee: Stefan Guggisberg

 the current persistence granularity is _single nodes_ (a node consists of 
 properties and child node information). 
 instead of storing/retrieving single nodes it would IMO make sense to store 
 subtree aggregates of specific nodes. the choice of granularity could be 
 based on simple filter criteria (e.g. property value).
 dynamic persistence granularity would help reducing the number of records and 
 r/w operations on the underlying store, thus improving performance.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira