Re: Safe read after write

2015-12-06 Thread Chetan Mehrotra
Does the above setup involves multiple Oak instances (say in separate
JVM) or message producer and consumer of queue are use same Oak
instance?


Chetan Mehrotra


On Thu, Dec 3, 2015 at 8:33 PM,   wrote:
> I am using Oak with a DocumentNodeStore.  I am storing content then adding a
> message onto a queue.  The consumer of the message uses and id to retrieve
> the content.  I am seeing frequent failures in the consumer (node not
> available/does not exist).  If I add a Thread.sleep after I store the node I
> do not see these failures.  My initial thought was this was related to the
> default Mongo WriteConcern of Acnknowledged, so I changed my code:
>
> public Repository getRepository() throws ClassNotFoundException,
> RepositoryException {
> DB db = new MongoClient(mongoHost, mongoPort).getDB(mongoOakDbName);
> db.setWriteConcern(WriteConcern.JOURNALED); // I also tried using
> FSYNC
> DocumentNodeStore ns = new
> DocumentMK.Builder().setMongoDB(db).getNodeStore();
> return new Jcr(new Oak(ns)).createRepository();
> }
>
> but I still see the problem.  Am I missing something?
>
> Thanks


Re: Lucent index speed

2015-12-06 Thread Chetan Mehrotra
Hi Jim,

How does the indexing performs if you say just run a single webapp node?

Chetan Mehrotra


On Sat, Dec 5, 2015 at 7:18 AM, Jim.Tully  wrote:
> We are using Oak embedded in a web application, and are now experiencing 
> significant delays in async indexing.  New nodes added are sometimes not 
> available by query for up to an hour.  I’m hoping you can identify areas I 
> might explore to improve this performance.
>
> We have multiple instances of the web application running with the same 
> Mongodb cluster connected via SSL.  Our Repository constructor is:
>
>
>
> ns = new DocumentMK.Builder().setMongoDB(createMongoDB()).getNodeStore();
>
>
> Oak oak = new Oak(ns);
>
>
> LuceneIndexProvider provider = new LuceneIndexProvider();
>
> Jcr jcr = new Jcr(oak).with((QueryIndexProvider) provider).with((Observer) 
> provider)
>
> .with(new LuceneIndexEditorProvider()).withAsyncIndexing();
>
> repository = jcr.createRepository();
>
>
> The web application creates the repository at start up, and disposes of it as 
> shutdown.  We have no observers registered at all, but do have 6 lucene 
> indexes defined.  The index that is currently giving me heartburn looks like 
> below.  Where would I start to find what is dragging performance down so 
> drastically?
>
>
> 
>
> http://www.jcp.org/jcr/sv/1.0"; sv:name="PageIndex">
>
> 
>
> oak:QueryIndexDefinition
>
> 
>
> 
>
> 2
>
> 
>
> 
>
> 2
>
> 
>
> 
>
> lucene
>
> 
>
> 
>
> async
>
> 
>
> 
>
> PageIndex
>
> 
>
> 
>
> /pages/oak:index/PageIndex
>
> 
>
>
>
> 
>
> 
>
> nt:unstructured
>
> 
>
> 
>
> 
>
> nt:unstructured
>
> 
>
> 
>
> 
>
> nt:unstructured
>
> 
>
> 
>
> 
>
> nt:unstructured
>
> 
>
> 
>
> Date
>
> 
>
> 
>
> true
>
> 
>
> 
>
> true
>
> 
>
> 
>
> 
>
> 
>
> nt:unstructured
>
> 
>
> 
>
> Date
>
> 
>
> 
>
> true
>
> 
>
> 
>
> true
>
> 
>
> 
>
> 
>
> 
>
> nt:unstructured
>
> 
>
> 
>
> true
>
> 
>
> 
>
> 
>
> 
>
> nt:unstructured
>
> 
>
> 
>
> true
>
> 
>
> 
>
> 
>
> 
>
> 
>
> 
>
> 
>
> 
>
> org.apache.lucene.analysis.standard.StandardAnalyzer
>
> 
>
> 
>
> LUCENE_47
>
> 
>
> 
>
> 
>
> Standard
>
> 
>
> 
>
> 
>
> 
>
> 
>
> Thanks,
>
> Jim
>


Re: New Oak Example - Standalone Runnable Example based on Spring Boot

2015-12-06 Thread Chetan Mehrotra
On Fri, Dec 4, 2015 at 11:33 AM, Torgeir Veimo  wrote:
> Do you have a similar example on how to configure with an actual embedded
> osgi container running?

Not yet as thats not a very common usecase! Feel free to open an issue
for that. If there is a wider demand for such a usecase then it can be
looked into

Chetan Mehrotra


Re: Safe read after write

2015-12-07 Thread Chetan Mehrotra
Hi David,

To elaborate a bit on what Vikas and Davide said

Oak has a MVCC storage model which is eventually consistent. So any
change made on one cluster node would not be immediately visible on
other cluster nodes. Instead each node would periodically poll for
changes in the backend store (Mongo for above case) and then updates
it head revision. After that only changes made in those revisions
would be "visible" to that cluster node.

So in above setup if on cluster node N1 you add a node and then that
information is communicated to other cluster node N2  outside of Oak
(here a message queue) and then other cluster nodes reacts to that
then there is a chance that such a change would not have became
visible on that cluster node.

Currently there is no deterministic way other than introducing polling
as part of queue consumer logic

Chetan Mehrotra


On Mon, Dec 7, 2015 at 7:20 PM, Vikas Saurabh  wrote:
> On Mon, Dec 7, 2015 at 7:14 PM, David Marginian  wrote:
>> Yes, each node however is referencing the same mongo instance.  Is there a 
>> way to tell jackrabbit to grab the document from mongo instead of using the 
>> cluster cache (assuming that is what's going on).
>
> Each cluster node has a thread (background read thread) which, in
> crude sense, absorbs changes from other nodes. Although, simultaneous
> conflicting writes are avoided but a state of node that's visible to
> layers above (Jcr, etc) don't get to see changes from other nodes
> until background read is done with absorbing changes from other nodes.
>
> Thanks,
> Vikas


Re: Safe read after write

2015-12-07 Thread Chetan Mehrotra
On Mon, Dec 7, 2015 at 8:28 PM,   wrote:
> Are you recommending that my consumer attempts to retrieve the node until it
> is present?

Kind of. One approach I can think of

1. If your code is adding node under specific path say /workItems then
have a JCR Listener registered to monitor changes under those path

2. The queue consumer upon getting message can check if node is
present or not. If not it waits on a lock

3. The listener upon receiving any event (specifically external event
[1]) would then notify such listeners.

4. Listener checks if the required node is found. if not it goes for
sleep again. Such retry can be done for 'n' times before giving up

Chetan Mehrotra
[1] 
https://jackrabbit.apache.org/api/2.1/org/apache/jackrabbit/api/observation/JackrabbitEvent.html#isExternal()


Re: Lucene index speed

2015-12-07 Thread Chetan Mehrotra
On Mon, Dec 7, 2015 at 9:06 PM, Jim.Tully  wrote:
> When running locally with similar data, the indexing is nearly
> instantaneous.

Okie thats what I was expecting. The problem here is that AsyncIndexer
job is to be run as a singleton in a cluster. This is done at [1].
This is undocumented dependency on Sling way of scheduling things
(SLING-2979) which allows one to schedule jobs as singleton in a
cluster.

The default scheduler used by Oak (outside of Sling) does not honor
this contract which causes this job to be executed concurrently on
each cluster node and that causes conflict/retries etc. So in a way
Oak is outsourcing the job execution in cluster to embedding
application. Would be good to document this aspect (if you can open an
issue that would be helpful)

Given the recent work on DocumentDiscoveryLiteService it might be
possible for Oak to manage such thing on its own (@Stefan thoughts?).
But as of now this is not possible. So only way out currently is to
provide your own Whiteboard implementation which can handle such kind
of singleton scheduled jobs. Doing this is certainly non trivial!

Chetan Mehrotra
[1] 
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/spi/whiteboard/WhiteboardUtils.java#L59


Re: Lucene index speed

2015-12-07 Thread Chetan Mehrotra
Hi Jim,

Proper way to do this would be to have your own Whiteboard
implementation and implement the logic as preset in Oak whiteboard [1]
and the modify the logic around scheduling. However given currently
there is only AsyncIndexing task which requires to be run as a
singleton you can disable the default async indexing and trigger on
your own. Just use IndexMBeanRegistration

> Is there an optimal frequency for indexing that you would recommend?

Default is 5 sec which so far we have seen works fine

> Why doesn’t the checkpoints prevent resource contention?  It would appear to 
> me that they should.

Checkpoint are not meant to prevent contention. AsyncIndexer has an
inbuilt "lease" support to prevent concurrent runs but there have been
some issues like OAK-3436 which can result in complete reindexing at
times! They should be addressed soon


Chetan Mehrotra
[1] 
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/Oak.java#L247


On Tue, Dec 8, 2015 at 8:45 AM, Jim.Tully  wrote:
> Chetan,
>
> It appears that I can at least trigger the async indexing from within my
> application.  That leaves me with two questions that I hope you can find
> time to answer:
>
> 1.  Is there an optimal frequency for indexing that you would recommend?
> 2.  Why doesn’t the checkpoints prevent resource contention?  It would
> appear to me that they should.
>
> Many thanks,
>
> Jim Tully
>
>
>
>
>
> On 12/7/15, 10:50 AM, "Jim.Tully"  wrote:
>
>>Chetan,
>>
>>I really appreciate the quick response.  Our application is capable of
>>running singleton scheduled jobs already, so I believe I can take care of
>>that aspect.  Would it be as simple as omitting the withAsyncIndexing()
>>argument to the constructor, and then
>>
>>- create an AsyncIndexUpdate instance
>>- schedule the instance to invoke it¹s run() method
>>
>>
>>Jim
>>
>>
>>
>>
>>
>>On 12/7/15, 9:50 AM, "Chetan Mehrotra"  wrote:
>>
>>>On Mon, Dec 7, 2015 at 9:06 PM, Jim.Tully  wrote:
>>>> When running locally with similar data, the indexing is nearly
>>>> instantaneous.
>>>
>>>Okie thats what I was expecting. The problem here is that AsyncIndexer
>>>job is to be run as a singleton in a cluster. This is done at [1].
>>>This is undocumented dependency on Sling way of scheduling things
>>>(SLING-2979) which allows one to schedule jobs as singleton in a
>>>cluster.
>>>
>>>The default scheduler used by Oak (outside of Sling) does not honor
>>>this contract which causes this job to be executed concurrently on
>>>each cluster node and that causes conflict/retries etc. So in a way
>>>Oak is outsourcing the job execution in cluster to embedding
>>>application. Would be good to document this aspect (if you can open an
>>>issue that would be helpful)
>>>
>>>Given the recent work on DocumentDiscoveryLiteService it might be
>>>possible for Oak to manage such thing on its own (@Stefan thoughts?).
>>>But as of now this is not possible. So only way out currently is to
>>>provide your own Whiteboard implementation which can handle such kind
>>>of singleton scheduled jobs. Doing this is certainly non trivial!
>>>
>>>Chetan Mehrotra
>>>[1]
>>>https://github.com/apache/jackrabbit-oak/blob/trunk/oak-core/src/main/jav
>>>a
>>>/org/apache/jackrabbit/oak/spi/whiteboard/WhiteboardUtils.java#L59
>>>
>>
>


Re: [VOTE] Release Apache Jackrabbit Oak 1.0.25

2015-12-07 Thread Chetan Mehrotra
On Tue, Dec 8, 2015 at 10:43 AM, Amit Jain  wrote:
>   [X] +1 Release this package as Apache Jackrabbit Oak 1.0.25

Chetan Mehrotra


Re: fixVersions in jira

2015-12-09 Thread Chetan Mehrotra
On Tue, Dec 8, 2015 at 9:36 PM, Julian Reschke  wrote:
> So what's the correct JIRA state for something that has been fixed in 1.3.x,
> which is intended to be backported to 1.2, but hasn't been backported yet?
> Can I still set that to "resolved"?

So far the practice some of us follow is to add label
candidate_oak_1_0 or candidate_oak_1_2. See [1] for some earlier
discussion around this

Chetan Mehrotra
[1] http://markmail.org/thread/7sbse6lpgxaqgplv


Re: svn commit: r1718848 - /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/spi/state/NodeStateUtils.java

2015-12-09 Thread Chetan Mehrotra
On Wed, Dec 9, 2015 at 6:36 PM,   wrote:
> +private static String multiplier(String in, int times) {
> +StringBuilder sb = new StringBuilder();
> +for (int i = 0; i < times; i++) {
> +sb.append(in);
> +}
> +return sb.toString();
> +}
> +

May be use com.google.common.base.Strings#repeat?

Chetan Mehrotra


Re: Oak crypto API

2015-12-09 Thread Chetan Mehrotra
On Tue, Dec 1, 2015 at 6:21 PM, Timothée Maret  wrote:
> The API would
>
> - Cover encrypting/decrypting data using an instance global master key
> - Make no assumption regarding how to get the master key

This looks useful!. I think having just the decryption part should be
sufficient as part of API which needs to be used by Oak code.

The encryption method can be part of implementation such that testcase
can make use of that to create test data. Depending on embedding
application how the encrypted data is created might vary so making
that method part of API may pose some problems

Chetan Mehrotra


Remove/Disable ordered property indexes in trunk

2015-12-10 Thread Chetan Mehrotra
Given that ordered property indexes have been deprecated long time
back it would be better if we remove the corresponding code or at
least disable the OSGi components for it

I would prefer removal of code as otherwise we need to take care of
that code also if any cross cutting refactoring is to be performed
(say some change which touch all indexes)

Thoughts?

Chetan Mehrotra


Re: Oak crypto API

2015-12-10 Thread Chetan Mehrotra
Hi Timothée,

On Thu, Dec 10, 2015 at 4:59 PM, Timothée Maret
 wrote:
> However, I think that encryption and decryption go in pair (use the same
> algo) and maybe it would be best to reflect it in the API.

>From what I understand Its main usecase is to allow components in Oak
to make use of encrypted credentials which interacting with third
party services. For e.g. in LDAP the password to access LDAP server
can be encrypted and Oak LDAP logic would need to use some API to
decrypt it. How it is encrypted and what is the encryption algo is not
the concern of this logic. The encryption algo might be encoded in the
encrypted config itself.

So just having support for following call would be sufficient

-
//one can use a byte[] also as argument type but keeping it string
//as the key would be provided via some property file/OSGi
//config and hence would be expected to be encoded say in base64

byte[] decrypt(String cipherText)
-

And this is the api which can be used in other places also (say
decrypting Mongo connection credentials)

Further the implementation might vary quite a bit

1. Credentials obtained from third party service - cipherText might be
a logical name of some credential config - Say prod1LdapPwd. And in
that deployment there is support for some third party credential
storage server which can provide the credentials at runtime. In such
deployment even the encrypted key would not be present in local system
and the crypto implementation would use that service SDK to fetch the
credential at runtime (using some off band authentication to that
service)

2. cipherText having algo encoded - For some impl the cipherText would
be like '{AES/CBC/PKCS5Padding}' -
Implementation can then decode the value as per requirement

So how the encrypted key is created and managed is not a concern for
Oak logic. For Oak it just need a way to get plain text credential
given some opaque key data. Any method related to encrypting would not
be used by other part of Oak so need not be part of API which we
expose as extension point

Chetan Mehrotra


Re: Remove/Disable ordered property indexes in trunk

2015-12-11 Thread Chetan Mehrotra
On Thu, Dec 10, 2015 at 7:43 PM, Davide Giannella  wrote:
> Can any of you please file an issue and assign it to myself?

Done with OAK-3768

Chetan Mehrotra


Re: Missing SessionStatistics Mbeans

2015-12-16 Thread Chetan Mehrotra
Hi Marc,

Thanks for reporting this. It looks like a regression due to changes
done for OAK-3477 (affect 1.3.11). Opened OAK-3802 for that.

Chetan Mehrotra


On Wed, Dec 16, 2015 at 9:51 PM, Marc Pfaff  wrote:
> Hi
>
> Using oak-1.3.11.r1716789, I have a situation, where I see the session
> counter, as per RepositoryStats#SessionCount, constantly increasing over
> time.
>
> This makes me wonder if I stumbled over a session leak. So far, I
> consulted the SessionStatistics beans in the system console in those cases
> to get an idea of suspicious sessions, by looking at
> SessionStatistics#InitStackTrace. But it looks like there are no
> SessionStatistics mbeans no more in the system console.
>
> Now I wonder where have the SessionStatistics mbeans gone? Or is there an
> issue in the value reported by RepositoryStats#SessionCount and I don¹t
> have a session leak at all? What other options do I have to find a session
> leak in my code?
>
> The last checkpoint where I still have the SessionStatistics beans is with
> oak-1.3.10.r1713699.
>
> Thanks a lot.
>
> Regards
> Marc
>
>
>


Re: [Oak origin/trunk] Apache Jackrabbit Oak matrix - Build # 634 - Still Failing

2015-12-16 Thread Chetan Mehrotra
On Thu, Dec 17, 2015 at 9:45 AM, Apache Jenkins Server
 wrote:
> Stack Trace:
> junit.framework.ComparisonFailure: expected: hallo (1)], text:[hallo (1), hello (1), oh hallo (1)], text:[hallo (1), hello 
> (1), oh hallo (1)]]> but was:
> at junit.framework.Assert.assertEquals(Assert.java:100)
> at junit.framework.Assert.assertEquals(Assert.java:107)
> at junit.framework.TestCase.assertEquals(TestCase.java:269)
> at 
> org.apache.jackrabbit.oak.jcr.query.FacetTest.testFacetRetrievalWithAnonymousUser(FacetTest.java:102)

Looks like most failures are in new Facet tests

@Tommaso - Can you have a look

Chetan Mehrotra


Re: [Oak origin/trunk] Apache Jackrabbit Oak matrix - Build # 634 - Still Failing

2015-12-16 Thread Chetan Mehrotra
On Thu, Dec 17, 2015 at 12:07 PM, Tommaso Teofili
 wrote:
> can you guys reproduce locally?

I tried but it passes. Looking ta failure [1] it appears to be coming
in oak-solr-core (and not in oak-lucene). Just wondering if there is
some async behavior involved due to Solr. In such a case if we make
any commit it might happen that changes made to index are not yet
reflected to index readers.

Also I remember quite a few failures in Spell check support (OAK-3355)
also which might have same root cause. So may be Solr support for
facets, spell check and suggestor might have some race condition
involve

Chetan Mehrotra
[1] 
https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/634/jdk=jdk1.8.0_11,label=Ubuntu,nsfixtures=SEGMENT_MK,profile=unittesting/console


JIRA issue not showing associated commits

2016-01-03 Thread Chetan Mehrotra
Hi Team,

Earlier the JIRA for Oak used to show commits related to the issues via
fisheye integration. Now there is no such link. This makes it difficult to
determine what changes were done for that issue.

Any idea on how to get that integration back? Probably when we moved to
Epic based model it got lost.

Chetan Mehrotra


Re: svn commit: r1722496 - /jackrabbit/oak/trunk/oak-jcr/src/main/java/org/apache/jackrabbit/oak/jcr/xml/ImporterImpl.java

2016-01-04 Thread Chetan Mehrotra
On Fri, Jan 1, 2016 at 7:56 PM,  wrote:

>  }
> -} else if (getDefinition(parent).isProtected()) {
> -if (pnImporter != null) {
> -pnImporter.end(parent);
> -// and reset the pnImporter field waiting for the next
> protected
> -// parent -> selecting again from available importers
> -pnImporter = null;
> -}
> +} else if ((pnImporter != null) &&
> getDefinition(parent).isProtected()) {
> +pnImporter.end(parent);
> +// and reset the pnImporter field waiting for the next
> protected
> +// parent -> selecting again from available importers
> +pnImporter = null;
>  }
>

Above change is causing couple of test failures in CUG

===
Failed tests:
testNestedCug(org.apache.jackrabbit.oak.spi.security.authorization.cug.impl.CugImportIgnoreTest)

testNestedCug(org.apache.jackrabbit.oak.spi.security.authorization.cug.impl.CugImportAbortTest)

testNestedCug(org.apache.jackrabbit.oak.spi.security.authorization.cug.impl.CugImportBesteffortTest)
===

It happens because `getDefinition(parent).isProtected()` has a side of
effect of triggering an exception. With above code change that call is not
made if 'pnImporter' is null and thus causes a change in behaviour. So
better to revert that change.

Chetan Mehrotra


Re: [Oak origin/trunk] Apache Jackrabbit Oak matrix - Build # 653 - Failure

2016-01-05 Thread Chetan Mehrotra
On Wed, Jan 6, 2016 at 11:13 AM, Apache Jenkins Server
 wrote:
> Stack Trace:
> java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
> at 
> org.apache.jackrabbit.oak.fixture.DocumentRdbFixture.toString(DocumentRdbFixture.java:82)

Looks like commons-lang is not available in oak-jcr. Added it as a
test dependency to see if this gets resolved

Chetan Mehrotra


Re: svn commit: r1724598 - in /jackrabbit/oak/trunk/oak-core/src: main/java/org/apache/jackrabbit/oak/api/ main/java/org/apache/jackrabbit/oak/plugins/document/rdb/ main/java/org/apache/jackrabbit/oak

2016-01-14 Thread Chetan Mehrotra
On Thu, Jan 14, 2016 at 6:40 PM,   wrote:
> 
> jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/api/Blob.java
> 
> jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/rdb/RDBDocumentStore.java
> 
> jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/value/BinaryImpl.java

I see some changes to Blob/BinaryImpl. Are those change related to
this issue? Most likely look like a noise but just wanted to confirm

Chetan Mehrotra


Re: svn commit: r1725250 - in /jackrabbit/oak/trunk: oak-core/src/main/java/org/apache/jackrabbit/oak/ oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/atomic/ oak-core/src/test/java/org/apach

2016-01-18 Thread Chetan Mehrotra
Hi Davide,

On Mon, Jan 18, 2016 at 5:46 PM,   wrote:
> + */
> +public AtomicCounterEditorProvider() {
> +clusterSupplier = new Supplier() {
> +@Override
> +public Clusterable get() {
> +return cluster.get();
> +}
> +};
> +schedulerSupplier = new Supplier() {
> +@Override
> +public ScheduledExecutorService get() {
> +return scheduler.get();
> +}
> +};
> +storeSupplier = new Supplier() {
> +@Override
> +public NodeStore get() {
> +return store.get();
> +}
> +};
> +wbSupplier = new Supplier() {
> +@Override
> +public Whiteboard get() {
> +return whiteboard.get();
> +}
> +};
> +}

Just curious about use of above approach. Is it for keeping the
dependencies as non static or using final instance variable? If you
mark references as static then all those bind and unbind method would
not be required as by the time component is active the dependencies
would be set.


Chetan Mehrotra


Re: Restructure docs

2016-01-20 Thread Chetan Mehrotra
On Wed, Jan 20, 2016 at 2:46 PM, Davide Giannella  wrote:
> When you change/add/remove an item from the left-hand menu, you'll have
> to redeploy the whole site as it will be hardcoded within the html of
> each page. Deploying the whole website is a long process. Therefore
> limiting the changes over there make things faster.

I mostly do partial commit i.e. only the modified page and it ha
worked well. Changing of left side menu is not a very frequent task
and for that I think doing full deploy of site is fine for now

Chetan Mehrotra


Re: Issue using the text extraction with lucene

2016-01-23 Thread Chetan Mehrotra
On Sat, Jan 23, 2016 at 9:34 PM, Stephan Becker
 wrote:
> Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.commons.csv.CSVFormat.withIgnoreSurroundingSpaces()Lorg/apache/commons/csv/CSVFormat;

Looks like tika-app-1.11 is using commons-csv 1.0 [1] while Oak uses
1.1 and CSVFormat.withIgnoreSurroundingSpaces is added in v1.1. We
tested it earlier with Tika 1.6. So you can try adding commons-csv jar
as the first one in the classpath

java -cp commons-csv-1.1.jar:tika-app-1.11.jar:oak-run-1.2.4.jar

Chetan Mehrotra
[1]http://svn.apache.org/viewvc/tika/tags/1.11-rc1/tika-parsers/pom.xml?view=markup#l328


Re: Issue using the text extraction with lucene

2016-01-24 Thread Chetan Mehrotra
On Sun, Jan 24, 2016 at 2:28 AM, Stephan Becker
 wrote:
> How does it then further extract the
> text from added documents?

Currently the extracted text support does not allow updates i.e. it
only has extracted text at the time when extraction is done via the
tool. Later extracted text would not be added. The primary aim was to
speed up indexing time in migration.

Chetan Mehrotra


Re: JUnit tests with FileDataStore

2016-01-27 Thread Chetan Mehrotra
To make use of FileDataStore you would need to configure a
SegmentNodeStore as MemoryNodeStore does not allow plugging in custom
BlobStore

Have a look at snippet [1] for a possible approach

Chetan Mehrotra
[1] https://gist.github.com/chetanmeh/6242d0a7fe421955d456


On Wed, Jan 27, 2016 at 6:42 AM, Tobias Bocanegra  wrote:
> Hi,
>
> I have some tests in filevault that I want to run with the
> FileDataStore, but I couldn't figure out how to setup the repository
> correctly here [0]. I also looked at the tests in oak, but I couldn't
> find a valid reference.
>
> The reason for this is to test the binary references, which afaik only
> work with the FileDataStore.
> at least my test [1] works with jackrabbit, but not for oak.
>
> thanks.
> regards, toby
>
> [0] 
> https://github.com/apache/jackrabbit-filevault/blob/trunk/vault-core/src/test/java/org/apache/jackrabbit/vault/packaging/integration/IntegrationTestBase.java#L118-L120
> [1] 
> https://github.com/apache/jackrabbit-filevault/blob/trunk/vault-core/src/test/java/org/apache/jackrabbit/vault/packaging/integration/TestBinarylessExport.java


Re: svn commit: r1727311 - in /jackrabbit/oak/trunk/oak-core/src: main/java/org/apache/jackrabbit/oak/osgi/OsgiWhiteboard.java test/java/org/apache/jackrabbit/oak/osgi/OsgiWhiteboardTest.java

2016-01-29 Thread Chetan Mehrotra
On Fri, Jan 29, 2016 at 4:08 PM, Michael Dürig  wrote:
>
> Shouldn't we make this volatile?

Ack. Would do that

Chetan Mehrotra


Re: svn commit: r1728341 - /jackrabbit/oak/trunk/oak-segment/src/main/java/org/apache/jackrabbit/oak/plugins/segment/SegmentGraph.java

2016-02-03 Thread Chetan Mehrotra
On Wed, Feb 3, 2016 at 10:17 PM,   wrote:
> +private static String toString(Throwable e) {
> +StringWriter sw = new StringWriter();
> +PrintWriter pw = new PrintWriter(sw, true);
> +try {
> +e.printStackTrace(pw);
> +return sw.toString();
> +} finally {
> +pw.close();
> +}
>  }
> +

May be use com.google.common.base.Throwables#getStackTraceAsString


Chetan Mehrotra


Re: svn commit: r1728341 - /jackrabbit/oak/trunk/oak-segment/src/main/java/org/apache/jackrabbit/oak/plugins/segment/SegmentGraph.java

2016-02-05 Thread Chetan Mehrotra
On Fri, Feb 5, 2016 at 2:54 PM, Michael Dürig  wrote:
> There's always another library ;-)

For utility stuff well almost !

Chetan Mehrotra


Re: R: info about jackrabbitoak.

2016-02-24 Thread Chetan Mehrotra
On Wed, Feb 24, 2016 at 2:46 PM, Ancona Francesco
 wrote:
> that the project depends on felix (osgi) dependency.

It does not depend on Felix framework but some modules from Felix
project. There is a webapp example [1] where you can deploy the war on
Tomcat/WebContainer and have your code in the war access repository
instance

Chetan Mehrotra
[1] https://github.com/apache/jackrabbit-oak/tree/trunk/oak-examples/webapp


Re: testing blob equality

2016-02-29 Thread Chetan Mehrotra
On Mon, Feb 29, 2016 at 6:42 PM, Tomek Rekawek  wrote:
> I wonder if we can switch the order of length and identity comparison in 
> AbstractBlob#equal() method. Is there any case in which the 
> getContentIdentity() method will be slower than length()?

That can be switched but I am afraid that it would not work as
expected. In JackrabbitNodeState#createBlob determining the
contentIdentity involves determining the length. You can give
org.apache.jackrabbit.oak.upgrade.blob.LengthCachingDataStore a try
(See OAK-2882 for details)

Chetan Mehrotra


Re: [1.4.0][blocked] oak-examples and circular dependencies on oak itself

2016-03-01 Thread Chetan Mehrotra
On Tue, Mar 1, 2016 at 10:51 PM, Davide Giannella  wrote:
> I'm kind-of stuck in the release process as oak-examples contains
> dependencies to oak-1.4-SNAPSHOT. The problem is the -SNAPSHOT bit.

Wondering how it used to work so far for various 1.3.x releases

One approach we can try is get rid of oak.version and make use of
project.version. Such a way should be similar to how oak-lucene
depends on oak-core and hence should work

Chetan Mehrotra


Re: oak-resilience

2016-03-07 Thread Chetan Mehrotra
Cool stuff Tomek! This was something which was discussed in last
Oakathon so great to have a way to do resilience testing
programatically. Would give it a try
Chetan Mehrotra


On Mon, Mar 7, 2016 at 1:49 PM, Stefan Egli  wrote:
> Hi Tomek,
>
> Would also be interesting to see the effect on the leases and thus
> discovery-lite under high memory load and network problems.
>
> Cheers,
> Stefan
>
> On 04/03/16 11:13, "Tomek Rekawek"  wrote:
>
>>Hello,
>>
>>For some time I've worked on a little project called oak-resilience. It
>>aims to be a resilience testing framework for the Oak. It uses
>>virtualisation to run Java code in a controlled environment, that can be
>>spoilt in different ways, by:
>>
>>* resetting the machine,
>>* filling the JVM memory,
>>* filling the disk,
>>* breaking or deteriorating the network.
>>
>>I described currently supported features in the README file [1].
>>
>>Now, once I have a hammer I'm looking for a nail. Could you share your
>>thoughts on areas/features in Oak which may benefit from being
>>systematically tested for the resilience in the way described above?
>>
>>Best regards,
>>Tomek
>>
>>[1]
>>https://github.com/trekawek/jackrabbit-oak/tree/resilience/oak-resilience
>>
>>--
>>Tomek Rękawek | Adobe Research | www.adobe.com
>>reka...@adobe.com
>>
>
>


Re: [VOTE] Release Apache Jackrabbit Oak 1.4.0 (take 3)

2016-03-08 Thread Chetan Mehrotra
On Mon, Mar 7, 2016 at 4:21 PM, Davide Giannella  wrote:
> [ ] +1 Release this package as Apache Jackrabbit Oak 1.4.0

All check ok including integration tes [1]

Chetan Mehrotra
[1] Run check-release.sh with following mvn command
mvn verify -fn -PintegrationTesting,unittesting,rdb-derby
-Drdb.jdbc-url=jdbc:derby:foo\;create=true


Re: parent pom env.OAK_INTEGRATION_TESTING

2016-03-22 Thread Chetan Mehrotra
On Tue, Mar 22, 2016 at 9:49 PM, Davide Giannella  wrote:
> I can't really recall why and if we use this.

Its referred to in main README.md so as to allow a developer to always
enable running of integration test

Chetan Mehrotra


Re: [VOTE] Release Apache Jackrabbit Oak 1.4.1

2016-03-27 Thread Chetan Mehrotra
On Thu, Mar 24, 2016 at 8:02 PM, Davide Giannella  wrote:
> [ ] +1 Release this package as Apache Jackrabbit Oak 1.4.1

+1 (ALL CHECKS OK)

Chetan Mehrotra


Re: Extracting subpaths from a DocumentStore repo

2016-03-29 Thread Chetan Mehrotra
Hi Robert,

On Mon, Mar 28, 2016 at 7:59 PM, Robert Munteanu  wrote:
> - create a repository (R1) , populate /foo and /bar with some content
> - extract data for /foo and /bar from R1
> - pre-populate a DS 'storage area' ( MongoDB collection or RDB table )
> with the data extracted above
> - configure a new repository (R2) to mount /foo and /bar with the data
> from above

Instead of relying on DocumentStore API for "cloning" certain path it
might be easier to use Repository Sidegrade [1] sort of logic which
works at NodeState level. In that case you would not need to rely on
Document details

Chetan Mehrotra
[1] https://jackrabbit.apache.org/oak/docs/migration.html


Re: svn commit: r1737349 - /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/rdb/RDBConnectionHandler.java

2016-04-01 Thread Chetan Mehrotra
Hi Julian,

On Fri, Apr 1, 2016 at 5:19 PM,   wrote:
> +@Nonnull
> +private Connection getConnection() throws IllegalStateException, 
> SQLException {
> +long ts = System.currentTimeMillis();
> +Connection c = getDataSource().getConnection();
> +if (LOG.isDebugEnabled()) {
> +long elapsed = System.currentTimeMillis() - ts;
> +if (elapsed >= 100) {
> +LOG.debug("Obtaining a new connection from " + this.ds + " 
> took " + elapsed + "ms");
> +}
> +}
> +return c;
> +}

You can also use PerfLogger here which is also used in other places in
DocumentNodeStore

---
final PerfLogger PERFLOG = new PerfLogger(
LoggerFactory.getLogger(DocumentNodeStore.class.getName()
+ ".perf"));

final long start = PERFLOG.start();
Connection c = getDataSource().getConnection();
PERFLOG.end(start, 100, "Obtaining a new connection from {} ", ds);
---

This would also avoid the call to System.currentTimeMillis() if debug
log is not enabled

Chetan Mehrotra


Re: svn commit: r1737349 - /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/rdb/RDBConnectionHandler.java

2016-04-01 Thread Chetan Mehrotra
On Fri, Apr 1, 2016 at 6:40 PM, Julian Reschke  wrote:
> Did you benchmark System.currentTimeMillis() as opposed to checking the log
> level?

Well time taken by single isDebugEnabled would always be less than
System.currentTimeMillis()  + isDebugEnabled! In this case it anyway
does not matter much as remote call would have much more overhead.

Suggestion here was more to have a consistent way of doing such things
but not a hard requirement per se ...

Chetan Mehrotra


Re: [VOTE] Release Apache Jackrabbit Oak 1.2.14

2016-04-19 Thread Chetan Mehrotra
On Wed, Apr 20, 2016 at 10:25 AM, Amit Jain  wrote:

>   [ ] +1 Release this package as Apache Jackrabbit Oak 1.2.14


All checks ok

Chetan Mehrotra


Re: [VOTE] Please vote for the final name of oak-segment-next

2016-04-26 Thread Chetan Mehrotra
Missed sending nomination on earlier thread. If not late then one more
proposal

oak-segment-v2

This is somewhat similar to names used in Mongo mmapv1 and mmapv2.

Chetan Mehrotra

On Tue, Apr 26, 2016 at 2:32 PM, Tommaso Teofili 
wrote:

> oak-segment-store +1
>
> Regards,
> Tommaso
>
> Il giorno lun 25 apr 2016 alle ore 16:52 Vikas Saurabh <
> vikas.saur...@gmail.com> ha scritto:
>
> > > oak-embedded-store +1
> >
> >
> > Thanks,
> > Vikas
> >
>


API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-03 Thread Chetan Mehrotra
Hi Team,

For OAK-1963 we need to allow access to actaul Blob location say in form
File instance or S3 object id etc. This access is need to perform optimized
IO operation around binary object e.g.

1. The File object can be used to spool the file content with zero copy
using NIO by accessing the File Channel directly [1]

2. Client code can efficiently replicate a binary stored in S3 by having
direct access to S3 object using copy operation

To allow such access we would need a new API in the form of
AdaptableBinary.

API
===

public interface AdaptableBinary {

/**
 * Adapts the binary to another type like File, URL etc
 *
 * @param  The generic type to which this binary is adapted
 *to
 * @param type The Class object of the target type, such as
 *File.class
 * @return The adapter target or null if the binary cannot
 * adapt to the requested type
 */
 AdapterType adaptTo(Class type);
}

Usage
=

Binary binProp = node.getProperty("jcr:data").getBinary();

//Check if Binary is of type AdaptableBinary
if (binProp instanceof AdaptableBinary){
 AdaptableBinary adaptableBinary = (AdaptableBinary) binProp;

//Adapt it to File instance
 File file = adaptableBinary.adaptTo(File.class);
}



The Binary instance returned by Oak
i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then
implement this interface and calling code can then check the type and cast
it and then adapt it

Key Points


1. Depending on backing BlobStore the binary can be adapted to various
types. For FileDataStore it can be adapted to File. For S3DataStore it can
either be adapted to URL or some S3DataStore specific type.

2. Security - Thomas suggested that for better security the ability to
adapt should be restricted based on session permissions. So if the user has
required permission then only adaptation would work otherwise null would be
returned.

3. Adaptation proposal is based on Sling Adaptable [2]

4. This API is for now exposed only at JCR level. Not sure should we do it
at Oak level as Blob instance are currently not bound to any session. So
proposal is to place this in 'org.apache.jackrabbit.oak.api' package

Kindly provide your feedback! Also any suggestion/guidance around how the
access control be implemented

Chetan Mehrotra
[1] http://www.ibm.com/developerworks/library/j-zerocopy/
[2]
https://sling.apache.org/apidocs/sling5/org/apache/sling/api/adapter/Adaptable.html


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
On Wed, May 4, 2016 at 10:07 PM, Ian Boston  wrote:

> If the File or URL is writable, will writing to the location cause issues
> for Oak ?
>

Yes that would cause problem. Expectation here is that code using a direct
location needs to behave responsibly.

Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
On Thu, May 5, 2016 at 1:31 PM, Davide Giannella  wrote:

> Would it be possible to avoid the `instaceof`? Which means, in my
> opinion, all our binaries should be Adaptable.
>

The Binary interface is part of JCR API so cannot be be modified to extend
Adaptable. Hence the client code would need to cast and special case it

> Plus I would add anyhow an oak.api interface Adaptable so that we can
then, if needed, apply the same concept anywhere else.

That can also be done. For now I was being conservative in the API being
introduced. If later we find that Adaptable kind of support is need for
other place that can be introduced as a first class api

Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
> This proposal introduces a huge leak of abstractions and has deep security
implications.

I understand the leak of abstractions concern. However would like to
understand the security concern bit more.

One way I can think of that it can cause security concern is you have some
malicious code running in same jvm which can then do bad things with the
file handle. Do note that the File handle would not get exposed via any
remoting api we currently support. Now in this case if malicious code is
already running in same jvm then security is breached and code can anyway
make use of reflection to access internal details.

So if there is any other possible security concern then would like to
discuss.

Coming to usecases

Usecase A - Image rendition generation
-

We have some bigger deployments where lots of images gets uploaded to the
repository and there are some conversions (rendition generation) which are
performed by OS specific native executables. Such programs work directly on
file handle. Without this change currently we need to first spool the file
content into some temporary location and then pass that to the other
program. This add unnecessary overhead and something which can be avoided
in case there is a FileDataStore being used where we can provide a direct
access to the file

Usecase B - Efficient replication across regions in S3
--

This for AEM based setup which is running on Oak with S3DataStore. There we
have global deployment where author instance is running in 1 region and
binary content is to be distributed to publish instances running in
different regions. The DataStore size is huge say 100TB and for efficient
operation we need to use Binary less replication. In most cases only a very
small subset of binary content would need to be present in other
regions. Current
way (via shared DataStore) to support that would involve synchronizing the
S3 bucket across all such regions which would increase the storage cost
considerable.

Instead of that plan is to replicate the specific assets via s3 copy
operation. This would ensure that big assets can be copied efficiently at
S3 level and that would require direct access to the S3 object.

Again in all such cases one can always resort to current level support i.e.
copy over all the content via inputstream into some temporary store and
then use that. But that would add considerable overhead when assets are of
100MB sizes or more. So the approach proposed would allow client code to
this efficiently depending on the underlying storage capability

> To me sounds like breaching the JCR and NodeState layers to directly
> manipulate NodeStore binaries (from the DataStore), e.g. to perform smart
> replication across different instances, but imho the right way to address
> that is extending one of the current DataStore implementations or create a
> new one.

The original proposed approach in OAK-1963 was like that i.e. introduce
this access method on BlobStore which works on reference. But in that case
client code would need to deal with BlobStore API. In either case access to
actual binary storage data would be required

Chetan Mehrotra

On Thu, May 5, 2016 at 2:49 PM, Tommaso Teofili 
wrote:

> +1 to Francesco's concerns, exposing the location of a binary at the
> application level doesn't sound good from a security perspective.
> To me sounds like breaching the JCR and NodeState layers to directly
> manipulate NodeStore binaries (from the DataStore), e.g. to perform smart
> replication across different instances, but imho the right way to address
> that is extending one of the current DataStore implementations or create a
> new one.
> I am also concerned that this Adaptable pattern would open room for other
> such hacks into the stack.
>
> My 2 cents,
> Tommaso
>
>
> Il giorno gio 5 mag 2016 alle ore 11:00 Francesco Mari <
> mari.france...@gmail.com> ha scritto:
>
> > This proposal introduces a huge leak of abstractions and has deep
> security
> > implications.
> >
> > I guess that the reason for this proposal is that some users of Oak would
> > like to perform some operations on binaries in a more performant way by
> > leveraging the way those binaries are stored. If this is the case, I
> > suggest those users to evaluate an applicative solution implemented on
> top
> > of the JCR API.
> >
> > If a user needs to store some important binary data (files, images, etc.)
> > in an S3 bucket or on the file system for performance reasons, this
> > shouldn't affect how Oak handles blobs internally. If some assets are of
> > special interest for the user, then the user should bypass Oak and take
> > care of the storage of those assets directly. Oak can be used to store

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
On Thu, May 5, 2016 at 4:38 PM, Francesco Mari 
wrote:

> The security concern is quite easy to explain: it's a bypass of our
> security model. Imagine that, using a session with the appropriate
> privileges, a user accesses a Blob and adapts it to a file handle, an S3
> bucket or a URL. This code passes this reference to another piece of code
> that modifies the data directly even if - in the same deployment - it
> shouldn't be able to access the Blob instance to begin with.
>

How is this different from the case where a code obtains a Node via an
admin session and passes that Node instance to another code which say
deletes important content via it. In the end we have to trust the client
code to do correct thing when given appropriate rights. So in current
proposal the code can only adapt the binary if the session has expected
permissions. Post that we need to trust the code to behave properly.

> In both the use case, the customer is coupling the data with the most
> appropriate storage solution for his business case. In this case, customer
> code - and not Oak - should be responsible for the management of that
data.

Well then it means that customer implements its very own DataStore like
solution and all the application code do not make use of JCR Binary and
instead use another service to resolve the references. This would greatly
reduce the usefulness of JCR for asset heavy application which use JCR to
manage binary content along with its metadata


Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
On Thu, May 5, 2016 at 5:07 PM, Francesco Mari 
wrote:

>
> This is a totally different thing. The change to the node will be committed
> with the privileges of the session that retrieved the node. If the session
> doesn't have enough privileges to delete that node, the node will be
> deleted, There is no escape from the security model.


A "bad code" when passes a node backed via admin session can still do bad
thing as admin session has all the privileges. In same way if a bad code is
passed a file handle then it can cause issue. So I am still not sure on the
attack vector which we are defending against.

Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Chetan Mehrotra
To highlight - As mentioned earlier the user of proposed api is tying
itself to implementation details of Oak and if this changes later then that
code would also need to be changed. Or as Ian summed it up

> if the API is introduced it should create an out of band agreement with
the consumers of the API to act responsibly.

The method is to be used for those important case where you do rely on
implementation detail to get optimal performance in very specific
scenarios. Its like DocumentNodeStore making use of some Mongo specific API
to perform some important critical operation to achieve better performance
by checking if the underlying DocumentStore is Mongo based.

I have seen discussion of JCR-3534 and other related issue but still do not
see any conclusion on how to answer such queries where direct access to
blobs is required for performance aspect. This issue is not about exposing
the blob reference for remote access but more about optimal path for in VM
access

> who owns the resource? Who coordinates (concurrent) access to it and how?
What are the correctness and performance implications here (races,
deadlock, corruptions, JCR semantics)?

The client code would need to be implemented in a proper way. Its more like
implementing a CommitHook. If implemented in incorrect way it would cause
issues deadlocks etc. But then we assume that any one implementing that
interface would take proper care in implementation.

>  it limits implementation freedom and hinders further evolution
(chunking, de-duplication, content based addressing, compression, gc, etc.)
for data stores.

As mentioned earlier. Some part of API indicates a closer dependency on how
things work (like SPI, or ConsumerType AP on OSGi terms). By using such API
client code definitely ties itself to Oak implementation detail but it
should not limit how Oak implementation detail evolve. So when it changes
client code need to adapt itself accordingly. Oak can express that
by increment the minor version of exported package to indicate change
in behavior.

> bypassing JCR's security model

I yet do not see the attack vector which we need to defend differently
here. Again the blob url is not being exposed say as part of webdav or any
other remote call. So would like to understand the security concern better
here (unless it defending against a malicious , badly implemented client
code which we discussed above)

> Can't we come up with an API that allows the blobs to stay under control
of Oak?

The code need to work either at OS level say file handle or say S3 object.
So I do not see a way where it can work without having access to those
details

FWIW there is code out there which reverse engineers the blobId to access
the actual binary. People do it so as to get decent throughput in image
rendition logic for large scale deployment. The proposal here was to
formalize that approach by providing a proper api. If we do not provide
such an API then the only way for them would be to continue relying on
reverse engineering the blobId!

> If not, this is probably an indication that those blobs shouldn't go into
Oak but just references to it as Francesco already proposed. Anything else
is whether fish nor fowl: you can't have the JCR goodies but at the same
time access underlying resources at will.

Thats a fine argument to make. But then users here have real problem to
solve which we should not ignore. Oak based systems are being proposed for
large asset deployment where one of the primary requirement is asset
handling/processing of 100 of TB of binary data. So we would then have to
recommend for such cases to not use JCR Binary abstraction and manage the
binaries on your own. That would then solve both the problems (that might
though break lots of tooling build on top of JCR API to manage those
binaries)!

Thinking more - Another approach that I can then suggest it people
implement there own BlobStore (may be by extending ours) and provide this
API there i.e. say which takes Blob id and provide the required details.
This way we "outsource" the problem. Would that be acceptable?

Chetan Mehrotra

On Mon, May 9, 2016 at 2:28 PM, Michael Dürig  wrote:

>
> Hi,
>
> I very much share Francesco's concerns here. Unconditionally exposing
> access to operation system resources underlying Oak's inner working is
> troublesome for various reasons:
>
> - who owns the resource? Who coordinates (concurrent) access to it and
> how? What are the correctness and performance implications here (races,
> deadlock, corruptions, JCR semantics)?
>
> - it limits implementation freedom and hinders further evolution
> (chunking, de-duplication, content based addressing, compression, gc, etc.)
> for data stores.
>
> - bypassing JCR's security model
>
> Pretty much all of this has been discussed in the scope of
> https://issues.apache.org/jira/browse/JCR-3534 and

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Chetan Mehrotra
Had an offline discussion with Michael on this and explained the usecase
requirement in more details. One concern that has been raised is that such
a generic adaptTo API is too inviting for improper use and Oak does not
have any context around when this url is exposed for what time it is used.

So instead of having a generic adaptTo API at JCR level we can have a
BlobProcessor callback (Approach #B). Below is more of a strawman proposal.
Once we have a consensus then we can go over the details

interface BlobProcessor {
   void process(AdaptableBlob blob);
}

Where AdaptableBlob is

public interface AdaptableBlob {
 AdapterType adaptTo(Class type);
}

The BlobProcessor instance can be passed via BlobStore API. So client would
look for a BlobStore service (so use the Oak level API) and pass it the
ContentIdentity of JCR Binary aka blobId

interface BlobStore{
 void process(String blobId, BlobProcessor processor)
}

The approach ensures

1. That any blob handle exposed is only guaranteed for the duration
of  'process' invocation
2. There is no guarantee on the utility of blob handle (File, S3 Object)
beyond the callback. So one should not collect the passed File handle for
later use

Hopefully this should address some of the concerns raised in this thread.
Looking forward to feedback :)

Chetan Mehrotra

On Mon, May 9, 2016 at 6:24 PM, Michael Dürig  wrote:

>
>
> On 9.5.16 11:43 , Chetan Mehrotra wrote:
>
>> To highlight - As mentioned earlier the user of proposed api is tying
>> itself to implementation details of Oak and if this changes later then
>> that
>> code would also need to be changed. Or as Ian summed it up
>>
>> if the API is introduced it should create an out of band agreement with
>>>
>> the consumers of the API to act responsibly.
>>
>
> So what does "to act responsibly" actually means? Are we even in a
> position to precisely specify this? Experience tells me that we only find
> out about those semantics after the fact when dealing with painful and
> expensive customer escalations.
>
> And even if we could, it would tie Oak into very tight constraints on how
> it has to behave and how not. Constraints that would turn out prohibitively
> expensive for future evolution. Furthermore a huge amount of resources
> would be required to formalise such constraints via test coverage to guard
> against regressions.
>
>
>
>> The method is to be used for those important case where you do rely on
>> implementation detail to get optimal performance in very specific
>> scenarios. Its like DocumentNodeStore making use of some Mongo specific
>> API
>> to perform some important critical operation to achieve better performance
>> by checking if the underlying DocumentStore is Mongo based.
>>
>
> Right, but the Mongo specific API is a (hopefully) well thought through
> API where as with your proposal there are a lot of open questions and
> concerns as per my last mail.
>
> Mongo (and any other COTS DB) for good reasons also don't give you direct
> access to its internal file handles.
>
>
>
>> I have seen discussion of JCR-3534 and other related issue but still do
>> not
>> see any conclusion on how to answer such queries where direct access to
>> blobs is required for performance aspect. This issue is not about exposing
>> the blob reference for remote access but more about optimal path for in VM
>> access
>>
>
> One bottom line of the discussions in that issue is that we came to a
> conclusion after clarifying the specifics of the use case. Something I'm
> still missing here. The case you brought forward is too general to serve as
> a guideline for a solution. Quite to the contrary, to me it looks like a
> solution to some problem (I'm trying to understand).
>
>
>
>> who owns the resource? Who coordinates (concurrent) access to it and how?
>>>
>> What are the correctness and performance implications here (races,
>> deadlock, corruptions, JCR semantics)?
>>
>> The client code would need to be implemented in a proper way. Its more
>> like
>> implementing a CommitHook. If implemented in incorrect way it would cause
>> issues deadlocks etc. But then we assume that any one implementing that
>> interface would take proper care in implementation.
>>
>
> But a commit hook is an internal SPI. It is not advertised to the whole
> world as a public API.
>
>
>
>>  it limits implementation freedom and hinders further evolution
>>>
>> (chunking, de-duplication, content based addressing, compression, gc,
>> etc.)
>> for data stores.
>>
>> As mentioned earlier. Some part of API indicates a closer depend

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Chetan Mehrotra
On Mon, May 9, 2016 at 8:27 PM, Ian Boston  wrote:

> I thought the consumers of this api want things like the absolute path of
> the File in the BlobStore, or the bucket and key of the S3 Object, so that
> they could transmit it and use it for processing independently of Oak
> outside the callback ?
>

Most cases can still be done, just do it within the callback

blobStore.process("xxx", new BlobProcessor(){
void process(AdaptableBlob blob){
 File file = blob.adaptTo(File.class);
 transformImage(file);
}
});

Doing this within callback would allow Oak to enforce some safeguards (more
on that in next mail) and still allows the user to perform optimal binary
processing

Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Chetan Mehrotra
Some more points around the proposed callback based approach

1.Possible security or enforcing a read only access to the exposed file -
The file provided within the BlobProcessor callback can be a symlink
created with a os user account which only has read only access. The symlink
can be removed once the callback returns

2. S3 DataStore Security Concern - For S3 DataStore we would only be
exposing the S3 object identifier and the client code would still need the
aws credentials to connect to the bucket and perform required copy operation

3. Possibility of further optimization in S3DataStore processing -
Currently when reading a binary from S3DataStore the binary content are
*always* spooled to some local temporary file (in local cache) and then a
InputStream is opened on that file. So even if the code need to read
initial few bytes of stream the whole file would have to be read. This
happens because with current JCR Binary API we are not in control of
lifetime of exposed InputStream. So if say we expose the InputStream we
cannot determine untill when the backing S3 SDK resources need to be held

Also current S3DataStore always creates local copy - With a callback based
approach we can safely expose this file which would allow layers above to
avoid spooling the content again locally for processing. And with callback
boundary we can later do required cleanup


Chetan Mehrotra

On Mon, May 9, 2016 at 7:15 PM, Chetan Mehrotra 
wrote:

> Had an offline discussion with Michael on this and explained the usecase
> requirement in more details. One concern that has been raised is that such
> a generic adaptTo API is too inviting for improper use and Oak does not
> have any context around when this url is exposed for what time it is used.
>
> So instead of having a generic adaptTo API at JCR level we can have a
> BlobProcessor callback (Approach #B). Below is more of a strawman proposal.
> Once we have a consensus then we can go over the details
>
> interface BlobProcessor {
>void process(AdaptableBlob blob);
> }
>
> Where AdaptableBlob is
>
> public interface AdaptableBlob {
>  AdapterType adaptTo(Class type);
> }
>
> The BlobProcessor instance can be passed via BlobStore API. So client
> would look for a BlobStore service (so use the Oak level API) and pass it
> the ContentIdentity of JCR Binary aka blobId
>
> interface BlobStore{
>  void process(String blobId, BlobProcessor processor)
> }
>
> The approach ensures
>
> 1. That any blob handle exposed is only guaranteed for the duration
> of  'process' invocation
> 2. There is no guarantee on the utility of blob handle (File, S3 Object)
> beyond the callback. So one should not collect the passed File handle for
> later use
>
> Hopefully this should address some of the concerns raised in this thread.
> Looking forward to feedback :)
>
> Chetan Mehrotra
>
> On Mon, May 9, 2016 at 6:24 PM, Michael Dürig  wrote:
>
>>
>>
>> On 9.5.16 11:43 , Chetan Mehrotra wrote:
>>
>>> To highlight - As mentioned earlier the user of proposed api is tying
>>> itself to implementation details of Oak and if this changes later then
>>> that
>>> code would also need to be changed. Or as Ian summed it up
>>>
>>> if the API is introduced it should create an out of band agreement with
>>>>
>>> the consumers of the API to act responsibly.
>>>
>>
>> So what does "to act responsibly" actually means? Are we even in a
>> position to precisely specify this? Experience tells me that we only find
>> out about those semantics after the fact when dealing with painful and
>> expensive customer escalations.
>>
>> And even if we could, it would tie Oak into very tight constraints on how
>> it has to behave and how not. Constraints that would turn out prohibitively
>> expensive for future evolution. Furthermore a huge amount of resources
>> would be required to formalise such constraints via test coverage to guard
>> against regressions.
>>
>>
>>
>>> The method is to be used for those important case where you do rely on
>>> implementation detail to get optimal performance in very specific
>>> scenarios. Its like DocumentNodeStore making use of some Mongo specific
>>> API
>>> to perform some important critical operation to achieve better
>>> performance
>>> by checking if the underlying DocumentStore is Mongo based.
>>>
>>
>> Right, but the Mongo specific API is a (hopefully) well thought through
>> API where as with your proposal there are a lot of open questions and
>> concerns as per my last mail.
>>
>> Mongo (and any other COTS DB) for good reasons

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-11 Thread Chetan Mehrotra
> what guarantees do/can we give re. this file handle within this context.
Can it suddenly go away (e.g. because of gc or internal re-organisation)?
How do we establish, test and maintain (e.g. from regressions) such
guarantees?

Logically it should not go away suddenly. So GC logic should be aware of
such "inUse" instances (there is already such support for inUse cases).
Such a requirement can be validated via integration testcase

>  and more concerningly, how do we protect Oak from data corruption by
misbehaving clients? E.g. clients writing on that handle or removing it?
Again, if this is public API we need ways to test this.

Not sure by misbehaving client - Is it malicious (by design) or badly
written code. For later yes that might pose a problem but we can have some
defense. I would expect the code making use of the api to behave properly.
In addition as proposed above [1] for FileDataStore we can provide a
symlinked file reference which exposes a read only file handle. For
S3DataStore code should have access to aws credentials to perform any write
operation, which should be a sufficient defense

> In an earlier mail you quite fittingly compared this to commit hooks,
which for good reason are an internal SPI.

Bit of nit pick here ;) As per Jcr class [1] one can provide a CommitHook
instance so not sure if we can term it internal. However point that I
wanted to emphasize is that Oak does provide some critical extension point
and with a misbehaving code one can shoot himself at foot and as
implementation only so much can be done.

regards
Chetan
[1]
http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:237kzuhor5y3tpli+state:results
[2]
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-jcr/src/main/java/org/apache/jackrabbit/oak/jcr/Jcr.java#L190

Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-11 Thread Chetan Mehrotra
Hi Angela,

On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber  wrote:

> Quite frankly I would very much appreciate if took the time to collect
> and write down the required (i.e. currently known and expected)
> functionality.
>
> Then look at the requirements and look what is wrong with the current
> API that we can't meet those requirements:
> - is it just missing API extensions that can be added with moderate effort?
> - are there fundamental problems with the current API that we needed to
> address?
> - maybe we even have intrinsic issues with the way we think about the role
> of the repo?
>
> IMHO, sticking to kludges might look promising on a short term but
> I am convinced that we are better off with a fundamental analysis of
> the problems... after all the Binary topic comes up on a regular basis.
> That leaves me with the impression that yet another tiny extra and
> adaptables won't really address the core issues.
>

Makes sense.

Have a look in of the initial mail in the thread at [1] which talks about
the 2 usecase I know of. The image rendition usecase manifest itself in one
form or other, basically providing access to Native programs via file path
reference.

The approach proposed so far would be able to address them and hence closer
to "is it just missing API extensions that can be added with moderate
effort?". If there are any other approach we can address both of the
referred usecases then we implement them.

Let me know if more details are required. If required I can put it up on a
wiki page also.

Chetan Mehrotra
[1]
http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoegupd7l+state:results


Usecases around Binary handling in Oak

2016-06-01 Thread Chetan Mehrotra
Hi Team,

Recently we had a discussion around a new API proposal for binary access
[1]. From the discussion it was determined that we should first have a
collection of the kind of usecases which cannot be easily met by current
JCR Binary support in Oak so as to get better understanding of various
requirements. That would help us in coming up with a proper solution to
enable such usecases going forward

To move forward on that I have tried to collect the various usecases at [2]
which I have seen in the past.

UC1 - processing a binary in JCR with a native library that only has access
  to the file system
UC2 - Efficient replication across regions in S3
UC3 - Text Extraction without temporary File with Tika
UC4 - Spooling the binary content to socket output via NIO
UC5 - Transferring the file to FileDataStore with minimal overhead
UC6 - S3 import
UC7 - Random write access in binaries
UC8 - X-SendFile


I would like to get teams feedback on the various usecases and then come up
with the list of usecases which we would like to properly support in Oak.

Once that is determined we can discuss the possible solutions and decide on
how it gets finally implemented.

Kindly provide your feedback!

Chetan Mehrotra
[1] http://markmail.org/thread/6mq4je75p64c5nyn
[2] https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-06-01 Thread Chetan Mehrotra
I have started a new mail thread around "Usecases around Binary handling in
Oak" so as to first collect the kind of usecases we need to support. Once
we decide that we can discuss the possible solution.

So lets continue the discussion on that thread

Chetan Mehrotra

On Tue, May 17, 2016 at 12:31 PM, Angela Schreiber 
wrote:

> Hi Oak-Devs
>
> Just for the record: This topic has been discussed in a Adobe
> internal Oak-coordination call last Wednesday.
>
> Michael Marth first provided some background information and
> we discussed the various concerns mentioned in this thread
> and tried to identity the core issue(s).
>
> Marcel, Michael Duerig and Thomas proposed alternative approaches
> on how to address the original issues that lead to the API
> proposal, which all would avoid leaking out information about
> the internal blob handling.
>
> Unfortunately we ran out of time and didn't conclude the call
> with an agreement on how to proceed.
>
> From my perception the concerns raised here could not be resolved
> by the additional information.
>
> I would suggest that we try to continue the discussion here
> on the list. Maybe with a summary of the alternative proposals?
>
> Kind regards
> Angela
>
> On 11/05/16 15:38, "Ian Boston"  wrote:
>
> >Hi,
> >
> >On 11 May 2016 at 14:21, Marius Petria  wrote:
> >
> >> Hi,
> >>
> >> I would add another use case in the same area, even if it is more
> >> problematic from the point of view of security. To better support load
> >> spikes an application could return 302 redirects to  (signed) S3 urls
> >>such
> >> that binaries are fetched directly from S3.
> >>
> >
> >Perhaps that question exposes the underlying requirement for some
> >downstream users.
> >
> >This is a question, not a statement:
> >
> >If the application using Oak exposed a RESTfull API that had all the same
> >functionality as [1], and was able to perform at the scale of S3, and had
> >the same security semantics as Oak, would applications that are needing
> >direct access to S3 or a File based datastore be able to use that API in
> >preference ?
> >
> >Is this really about issues with scalability and performance rather than a
> >fundamental need to drill deep into the internals of Oak ? If so,
> >shouldn't
> >the scalability and performance be fixed ? (assuming its a real concern)
> >
> >
> >
> >
> >>
> >> (if this can already be done or you think is not really related to the
> >> other two please disregard).
> >>
> >
> >AFAIK this is not possible at the moment. If it was deployments could use
> >nginX X-SendFile and other request offloading mechanisms.
> >
> >Best Regards
> >Ian
> >
> >
> >1 http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html
> >
> >
> >>
> >> Marius
> >>
> >>
> >>
> >> On 5/11/16, 1:41 PM, "Angela Schreiber"  wrote:
> >>
> >> >Hi Chetan
> >> >
> >> >IMHO your original mail didn't write down the fundamental analysis
> >> >but instead presented the solution for every the 2 case I was
> >> >lacking the information _why_ this is needed.
> >> >
> >> >Both have been answered in private conversions only (1 today in
> >> >the oak call and 2 in a private discussion with tom). And
> >> >having heard didn't make me more confident that the solution
> >> >you propose is the right thing to do.
> >> >
> >> >Kind regards
> >> >Angela
> >> >
> >> >On 11/05/16 12:17, "Chetan Mehrotra" 
> wrote:
> >> >
> >> >>Hi Angela,
> >> >>
> >> >>On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber 
> >> >>wrote:
> >> >>
> >> >>> Quite frankly I would very much appreciate if took the time to
> >>collect
> >> >>> and write down the required (i.e. currently known and expected)
> >> >>> functionality.
> >> >>>
> >> >>> Then look at the requirements and look what is wrong with the
> >>current
> >> >>> API that we can't meet those requirements:
> >> >>> - is it just missing API extensions that can be added with moderate
> >> >>>effort?
> >> >>> - are there fundamental problems with the current API that we
> >>needed to
> >>

Requirement to support multiple NodeStore instance in same setup (OAK-4490)

2016-06-21 Thread Chetan Mehrotra
Hi Team,

As part of OAK-4180 feature around using another NodeStore as a local
cache for a remote Document store I would need to register another
NodeStore instance (for now a SegmentNodeStore - OAK-4490) with the
OSGi service registry.

This instance would then be used by SecondaryStoreCacheService to save
NodeState under certain paths locally and use it later for reads.

With this change we would have a situation where there would be
multiple NodeStore instance in same service registry. This can confuse
some component which have a dependency on NodeStore as a reference and
we need to ensure they bind to correct NodeStore instance.

Proposal A - Use a 'type' service property to distinguish
==

Register the NodeStore with a 'type' property. For now the value can
be 'primary' or 'secondary'. When any component registers the
NodeStore it also provides the type property.

On user side the reference needs to provide which type of NodeStore it
needs to bound

This would ensure that user of NodeStore get bound to correct type.

if we use service.ranking then it can cause a race condition where the
secondary instance may get bound untill primary comes up

Looking for feedback on what approach to take

Chetan Mehrotra


Re: Requirement to support multiple NodeStore instance in same setup (OAK-4490)

2016-06-22 Thread Chetan Mehrotra
On Tue, Jun 21, 2016 at 4:52 PM, Julian Sedding  wrote:
> Not exposing the secondary NodeStore in the service registry would be
> backwards compatible. Introducing the "type" property potentially
> breaks existing consumers, i.e. is not backwards compatible.

I had similar concern so proposed a new interface as part of OAK-4369.
However later with further discussion realized that we might have
similar requirement going forward i.e. presence of multiple NodeStore
impl so might be better to make setup handle such case.

So at this stage we have 2 options

1. Use a new interface to expose such "secondary" NodeStore
2. OR Use a new service property to distinguish between different roles

Not sure which one to go. May be we go for merged i.e. have a new
interface as in #1 but also mandate that it provides its "role/type"
as a service property to allow client to select correct one

Thoughts?

Chetan Mehrotra


Re: Requirement to support multiple NodeStore instance in same setup (OAK-4490)

2016-06-24 Thread Chetan Mehrotra
Okie would go with SecondaryNodeStoreProvider approach and also have a
role property for that. For now this interface would live in plugins
package and exported as it needs to be used in oak-segment and
oak-segment-tar. Later we can decide if we need to move it to SPI
package as supported extension point
Chetan Mehrotra


On Wed, Jun 22, 2016 at 4:44 PM, Stefan Egli  wrote:
> On 22/06/16 12:21, "Chetan Mehrotra"  wrote:
>
>>On Tue, Jun 21, 2016 at 4:52 PM, Julian Sedding 
>>wrote:
>>> Not exposing the secondary NodeStore in the service registry would be
>>> backwards compatible. Introducing the "type" property potentially
>>> breaks existing consumers, i.e. is not backwards compatible.
>>
>>I had similar concern so proposed a new interface as part of OAK-4369.
>>However later with further discussion realized that we might have
>>similar requirement going forward i.e. presence of multiple NodeStore
>>impl so might be better to make setup handle such case.
>>
>>So at this stage we have 2 options
>>
>>1. Use a new interface to expose such "secondary" NodeStore
>>2. OR Use a new service property to distinguish between different roles
>>
>>Not sure which one to go. May be we go for merged i.e. have a new
>>interface as in #1 but also mandate that it provides its "role/type"
>>as a service property to allow client to select correct one
>>
>>Thoughts?
>
> If the 'SecondaryNodeStoreProvider' is a non-public interface which can
> later 'easily' be replaced with another mechanism, then for me this would
> sound more straight forward at this stage as it would not break any
> existing consumers (as mentioned by Julian).
>
> Perhaps once those 'other use cases going forward' of multiple NodeStores
> become more clear, then it might be more obvious as to how the
> generalization into perhaps a type property should look like.
>
> my 2cents,
> Cheers,
> Stefan
>
>


Re: [VOTE] Release Apache Jackrabbit Oak 1.4.4

2016-06-26 Thread Chetan Mehrotra
On Mon, Jun 27, 2016 at 10:43 AM, Amit Jain  wrote:
[X] +1 Release this package as Apache Jackrabbit Oak 1.4.4

Chetan Mehrotra


Re: [Oak origin/1.4] Apache Jackrabbit Oak matrix - Build # 992 - Still Failing

2016-06-27 Thread Chetan Mehrotra
On Sat, Jun 25, 2016 at 10:24 AM, Apache Jenkins Server
 wrote:
> Caused by: java.lang.IllegalArgumentException: No enum constant 
> org.apache.jackrabbit.oak.commons.FixturesHelper.Fixture.SEGMENT_TAR
> at java.lang.Enum.valueOf(Enum.java:238)
> at 
> org.apache.jackrabbit.oak.commons.FixturesHelper$Fixture.valueOf(FixturesHelper.java:45)
> at 
> org.apache.jackrabbit.oak.commons.FixturesHelper.(FixturesHelper.java:58)

The test are failing due to above issue. Is this related to presence
of new segment-tar module in trunk but not in branch?

Chetan Mehrotra


Re: [Oak origin/1.4] Apache Jackrabbit Oak matrix - Build # 992 - Still Failing

2016-06-28 Thread Chetan Mehrotra
Thanks for the link. Would followup on the issue and have it fixed in branches
Chetan Mehrotra


On Mon, Jun 27, 2016 at 5:11 PM, Julian Reschke  wrote:
> On 2016-06-27 13:31, Chetan Mehrotra wrote:
>>
>> On Sat, Jun 25, 2016 at 10:24 AM, Apache Jenkins Server
>>  wrote:
>>>
>>> Caused by: java.lang.IllegalArgumentException: No enum constant
>>> org.apache.jackrabbit.oak.commons.FixturesHelper.Fixture.SEGMENT_TAR
>>> at java.lang.Enum.valueOf(Enum.java:238)
>>> at
>>> org.apache.jackrabbit.oak.commons.FixturesHelper$Fixture.valueOf(FixturesHelper.java:45)
>>> at
>>> org.apache.jackrabbit.oak.commons.FixturesHelper.(FixturesHelper.java:58)
>>
>>
>> The test are failing due to above issue. Is this related to presence
>> of new segment-tar module in trunk but not in branch?
>>
>> Chetan Mehrotra
>
>
> -> <https://issues.apache.org/jira/browse/OAK-4475>


[multiplex] - Review the proposed SPI interface MountInfoProvider and Mount for OAK-3404

2016-06-28 Thread Chetan Mehrotra
Hi Team,

As we start on integrating the work done related to multiplexing
support to trunk I would like your thoughts on new SPI interface
MountInfoProvider [1] being proposed as part of OAK-3404.

This would be used by various part of Oak to determine the Mount information.

Kindly provide your feedback on the issue.

Chetan Mehrotra
[1] 
https://github.com/rombert/jackrabbit-oak/tree/features/docstore-multiplex/oak-core/src/main/java/org/apache/jackrabbit/oak/spi/mount


Re: svn commit: r1750601 - in /jackrabbit/oak/trunk: oak-segment-tar/ oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/ oak-segment-tar/src/test/java/org/apache/jackrabbit/oak/segment/

2016-06-29 Thread Chetan Mehrotra
Hi Francesco,

On Wed, Jun 29, 2016 at 12:49 PM, Francesco Mari
 wrote:
> Please do not change the "oak.version" property to a snapshot version. If
> your change relies on code that is only available in the latest snapshot of
> Oak, please revert this commit and hold it back until a proper release of
> Oak is performed.

I can do that but want to understand the impact here if we switched to
SNAPSHOT version?

For e.g. in the past we had done some changes in jackrabbit which is
need in oak then we had switched to snapshot version of JR2 and later
reverted to released version once JR2 release is done. That has worked
fine so far and we did not had to hold the feature work for that. So
want to understand why it should be different here

Chetan Mehrotra


Re: svn commit: r1750601 - in /jackrabbit/oak/trunk: oak-segment-tar/ oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/ oak-segment-tar/src/test/java/org/apache/jackrabbit/oak/segment/

2016-06-29 Thread Chetan Mehrotra
On Wed, Jun 29, 2016 at 1:25 PM, Francesco Mari
 wrote:
> oak-segment-tar should be releasable at any time. If I had to launch a
quick patch release this morning, I would have to either revert your commit
or postpone my release until Oak is released.

Given the current release frequency on trunk (2 week) I do not think
it should be a big problem and holding of commits break the continuity
and increases work. But then that might be just an issue for me!

For now I have reverted the changes from oak-segment-tar

Chetan Mehrotra


OAK-4475 - CI failing on branches due to unknown fixture SEGMENT_TAR

2016-06-29 Thread Chetan Mehrotra
Hi Team,

Sometime back build was failing for branches because of new trunk only
fixture usage of SEGMENT_TAR. As this fixture was not present on the
branch it caused the build to fail.

My initial attempt to fix this was to ignore exception when
FixturesHelper resolves enum like SEGMENT_TAR on branch [1]. With this
build comes fine but I have a hunch that current fix would lead to all
fixtures getting activated and that would cause waste of time

A- Which solution to use


So have 2 options

1. Treat SEGMENT_TAR as SEGMENT_MK for branches - This would cause
test to run 2 times against SEGMENT_MK

2. Create separate build profile for branches

B - Use of nsfixtures system property
==

However before doing that I am trying to understand how the fixture
get set. From CI logs the command that gets fired is

---
/home/jenkins/tools/maven/apache-maven-3.2.1/bin/mvn
-Dnsfixtures=DOCUMENT_NS -Dlabel=Ubuntu -Djdk=jdk1.8.0_11
-Dprofile=integrationTesting clean verify -PintegrationTesting
-Dsurefire.skip.ut=true -Prdb-derby -DREMOVEMErdb.jdbc-
---

It sets system property 'nsfixtures' to required fixture. However in
our parent pom we rely on system property 'fixtures' which defaults to
SEGMENT_MK. And in no place we override 'fixtures' in our CI.  Looking
at all things it appears to me that currently all test are only
running against SEGMENT_MK fixture and other fixtures are not getting
used. But then exception should not have come with usage of
SEGMENT_TAR. So I am missing some connection here in the build process

>From my test it appears that if we specify a system property in mvn
command line and same property is configured in maven-surefire-plugin
then property specified in command line is used and one in pom.xml is
ignored. That would explain why settings in pom.xml are not used for
fixture

So what should we opt for #A?

My vote would be for A1!

Chetan Mehrotra

[1] 
https://github.com/apache/jackrabbit-oak/commit/319433e9400429592065d4b3997dd31f93b6c549
[2] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-parent/pom.xml#L289


maven-failsafe-plugin

${test.opts}

${known.issues}
${mongo.host}
${mongo.port}
${mongo.db}
${mongo.db2}
${fixtures}
${project.build.directory}/derby.log





Re: svn commit: r1750809 - /jackrabbit/oak/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LucenePropertyIndex.java

2016-06-30 Thread Chetan Mehrotra
Hi Tommaso,

On Thu, Jun 30, 2016 at 8:20 PM,   wrote:
> Modified:
> 
> jackrabbit/oak/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LucenePropertyIndex.java

Can we have some backing testcase for this? It would ensure future
refactoring does not break this requirement

Chetan Mehrotra


Re: multilingual content and indexing

2016-07-12 Thread Chetan Mehrotra
On Tue, Jul 12, 2016 at 3:53 PM, Lukas Kahwe Smith  wrote:
>> Alternatively, you can create different index definitions for each subtree 
>> (see [1]), e.g. Using the “includedPaths” property. This would lead to 
>> smaller indexes at the downside that you would have to create an index 
>> definition if you add a new language tree.

Another way would be to have your index definition under each node

/content/en/oak:index/fooIndex
/content/jp/oak:index/fooIndex

And have each index config analyzer configured as per the language.

Chetan Mehrotra


Re: svn commit: r1752601 - in /jackrabbit/oak/trunk/oak-segment-tar: pom.xml src/main/java/org/apache/jackrabbit/oak/segment/SegmentWriter.java

2016-07-14 Thread Chetan Mehrotra
On Thu, Jul 14, 2016 at 2:04 PM,   wrote:
>
> +commons-math3

commons-math is a 2.1 MB jar. Would it be possible to avoid embedding
it whole and only have some parts embedded/copied. (See [1] for an
example)

Chetan Mehrotra
[1] https://issues.apache.org/jira/browse/SLING-2361


[proposal] New oak:Resource nodetype as alternative to nt:resource

2016-07-15 Thread Chetan Mehrotra
In most cases where code uses JcrUtils.putFile [1] it leads to
creation of below content structure

+ foo.jpg (nt:file)
   + jcr:content (nt:resource)
   - jcr:data

Due to usage of nt:resource each nt:file node creates a entry in uuid
index as nt:resource is referenceable [2]. So if a system has 1M
nt:file nodes then we would have 1M entries in /oak:index/uuid as in
most cases the files are created via [1] and hence all such files are
referenceable

The nodetype defn for nt:file [3] does not mandate that the
requirement for jcr:content being nt:resource.

So should we register a new oak:Resource nodetype which is same as
nt:resource but not referenceable. This would be similar to
oak:Unstructured.

Also what should we do for [1]. Should we provide an overloaded method
which also accepts a nodetype for jcr:content node as it cannot use
oak:Resource

Chetan Mehrotra
[1] 
https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-jcr-commons/src/main/java/org/apache/jackrabbit/commons/JcrUtils.java#L1062

[2]
[nt:resource] > mix:lastModified, mix:mimeType, mix:referenceable
  primaryitem jcr:data
   - jcr:data (binary) mandatory

[3]

[nt:file] > nt:hierarchyNode
  primaryitem jcr:content
  + jcr:content (nt:base) mandatory


Re: [proposal] New oak:Resource nodetype as alternative to nt:resource

2016-07-18 Thread Chetan Mehrotra
Thanks for the feedback. Opened OAK-4567 to track the change


On Mon, Jul 18, 2016 at 12:14 PM, Angela Schreiber  wrote:
> Additionally or alternatively we could create a separate method (e.g.
> putOakFile
> or putOakResource or something explicitly mentioning the non-referenceable
> nature of the content) that uses 'oak:Resource' and state that it requires
> the
> node type to be registered and will fail otherwise... that would be as easy
> to use as 'putFile', which is IMO important.

@Angela - What about Justin's suggestion later around changing the
current putFile implementation. Have it use oak:Resource is present
otherwise fallback to nt:resource. This can lead to compatibility
issue though as javadoc of putFile says it would use nt:resource

Chetan Mehrotra


Specifying threadpool name for periodic scheduled jobs (OAK-4563)

2016-07-18 Thread Chetan Mehrotra
Hi Team,

While running Oak in Sling we rely on Sling Scheduler [1] to execute
the periodic jobs. By default Sling Scheduler uses a pool of 5 threads
to run all such periodic jobs in the system. Recently we saw an issue
OAK-4563 where due to some reason the pool got exhausted for long time
and that prevented the async indexing job to run for long time and
hence affected the query result.

To address that Sling now provides a new option (SLING-5831) where one
can specify the pool name to be used to execute a specific job. So we
can specify custom pool which can be used for Oak related jobs.

Now currently in Oak we use following types of periodic jobs

1. Async indexing - (Cluster Singleton)
2. Document Store - Journal GC (Cluster Singleton)
3. Document Store - LastRevRecovery
4. Statistic Collection - For timeseries data update in ChangeProcessor,
SegmentNodeStore GCMonitor

Now should we use

A - one single pool for all of the above
B - use the pool only for 1-3. The default pool would be of 5. So even
if #2 #3 are running
  it would not hamper #1

Assuming #4 is not that critical to run and may consist of lots of jobs.

My suggestion would be to go for #B

Chetan Mehrotra
[1] 
https://sling.apache.org/documentation/bundles/scheduler-service-commons-scheduler.html


Re: Specifying threadpool name for periodic scheduled jobs (OAK-4563)

2016-07-19 Thread Chetan Mehrotra
On Tue, Jul 19, 2016 at 12:54 PM, Michael Dürig  wrote:
> For blocking or time intensive tasks I would go for a dedicated thread pool.

So wrt current issue that means option #B ?

Chetan Mehrotra


Re: Specifying threadpool name for periodic scheduled jobs (OAK-4563)

2016-07-19 Thread Chetan Mehrotra
On Tue, Jul 19, 2016 at 1:21 PM, Michael Dürig  wrote:
> Not sure as I'm confused by your description of that option. I don't
> understand which of 1, 2, 3 and 4 would run in the "default pool" and which
> should run in its own dedicated pool.

#1, #2 and #3 would run in dedicated pool and each using same pool.
Pool name would be 'oak'. Also see OAK-4563 for the patch

While for #4 default pool would be used as those are non blocking and
short tasks

Chetan Mehrotra


Re: Specifying threadpool name for periodic scheduled jobs (OAK-4563)

2016-07-19 Thread Chetan Mehrotra
On Tue, Jul 19, 2016 at 1:44 PM, Stefan Egli  wrote:
> I'd go for #A to limit cross-effects between oak and other layers.

Note that for #4 there can be multiple task scheduled. So if a system
has 100 JCR Listeners than there would be 1 task/listener to manage
the time series stats. These should be quick and non blocking though.

All other task are much more critical for repository to function
properly. Hence thoughts to go for #B where we have a dedicated pool
for those 'n' tasks. Where n is much small i.e. number of async lanes
+ 2 from DocumentNodeStore so far. So its easy to size

Chetan Mehrotra


Re: Why is nt:resource referencable?

2016-07-20 Thread Chetan Mehrotra
On Wed, Jul 20, 2016 at 2:49 PM, Bertrand Delacretaz
 wrote:
> but the JCR spec (JSR 283 10 August 2009) only has
>
>   [nt:resource] > mix:mimeType, mix:lastModified
> primaryitem jcr:data
> - jcr:data (BINARY) mandatory

Thats interesting. Did not knew its not mandated in JCR 2.0. However
looks like for backward compatibility we need to support it. See [1]
where this was changed

@Marcel - I did not understood JCR-2170 properly. But any chance we
can switch to newer version of nt:resource and do not modify existing
nodes and let the new definition effect/enforced only on new node.

Chetan Mehrotra
[1] 
https://issues.apache.org/jira/browse/JCR-2170?focusedCommentId=12754941&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12754941


Re: Why is nt:resource referencable?

2016-07-20 Thread Chetan Mehrotra
On Wed, Jul 20, 2016 at 4:04 PM, Marcel Reutegger  wrote:
> Maybe we would keep the jcr:uuid property on the referenceable node and add
> the mixin?

What if we do not add any mixin and just have jcr:uuid property
present. The node would anyway be indexed so search would still work.
Not sure if API semantics require that nodes lookedup by UUID have to
be referenceable.

For now I think oak:Resource is safest way. But just exploring other
options if possible!


Chetan Mehrotra


Re: Why is nt:resource referencable?

2016-07-20 Thread Chetan Mehrotra
Thanks for all the details Marcel and Angela. That helps ... so look
like oak:Resource is the way to go


On Wed, Jul 20, 2016 at 6:17 PM, Angela Schreiber  wrote:
> I am pretty sure that there was good
> intention behind the change in nt-definition between JCR 1.0 and
> JCR 2.0... but maybe not fully thought through when it comes to
> backwards compatibility

Digging further it appears this concern was raised but not answered [1]

===
Since referenceable nodes are optional the following changes should be
made (decided at F2F):

nt:resource change to NOT referenceable
mix:simpleVerionable change to NOT referenceable
mix:versionable change to referenceable
nt:frozenNode property jcr:frozenUuid change to NOT mandatory
===


Chetan Mehrotra
[1] https://java.net/jira/browse/JSR_283-428


Using same index definition for both async and sync indexing

2016-08-02 Thread Chetan Mehrotra
Hi Team,

Currently one can set "async" flag on an index definition to indicate
wether given index should be effective for synchronous commit or to be
used for async indexing. For Hybrid Lucene indexing case [1] I need to
have a way where same index definition gets used in both.

So if a index definition at /oak:index/fooLuceneIndex is marked as
"hybrid" [2] then we need have LuceneIndexEditorProvider invoked for
both

1. Commit Time - Here the editor would just create Document and not add to index
2. Async indexing time - Here the current implemented approach of
indexing would happen

And in doing that the LuceneIndexEditorProvider needs to be informed
in which mode it is being invoked. So to support that we need some
enhancement in IndexUpdate logic where by same index definition is
used in both mode and editor knows the indexing mode.

Probably this would require a new interface for IndexEditorProvider.

So looking for thoughts on how this can be implemented!

Chetan Mehrotra
[1] 
https://issues.apache.org/jira/browse/OAK-4412?focusedCommentId=15405340&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15405340

[2] Naming convention to be decided/discussed


Way to capture metadata related to commit as part of CommitInfo from within CommitHook

2016-08-02 Thread Chetan Mehrotra
Hi Team,

Currently as part of commit the caller can provide a CommitInfo
instance which captures some metadata related to commit being
performed. Note that CommitInfo instance passed to NodeStore is
immutable.

For some usecases we need a way to add some more metadata to on going
commit from within the CommitHook

OAK-4586 - Collect affected node types on commit

Here we need to record nodetypes of nodes which got modified as part
of current commit

OAK-4412 - Lucene hybrid index

Here we want to generate Documents for modified nodestate (per index
definition) and "attach" it to current commit

This meta information would later be used by Observer. Currently there
no std way in API to achieve that.

#A -Probably we can introduce a new type CommitAttributes which can be
attached to CommitInfo and which can be modified by the CommitHooks.
The CommitAttributes can then later be accessed by Observer

OR

#B - We can just add a mutable attribute map to the CommitInfo
instance and that can be populated by CommitHooks

Thoughts on which approach to go forward

Chetan Mehrotra


Re: Using same index definition for both async and sync indexing

2016-08-03 Thread Chetan Mehrotra
On Wed, Aug 3, 2016 at 2:23 PM, Alex Parvulescu
 wrote:
> extend the current index definition
> for the 'async' property and allow multiple values.

That should work and looks like natural extension of the flag. Just
that having empty value in array does not look good (might confuse
people in ui). So we can have a marker value to indicate empty

>What about overloading the 'IndexUpdateCallback' with a 'isSync()' method
> coming from the 'IndexUpdate' component. This will reduce the change
> footprint and only components that need to know this information will use
> it.

That can be done. Going forward we also need to pass in CommitInfo or
something like that (see other mail).

Another option can be to have a new interface for IndexEditorProvider
(on same line as AdvancedQueryIndex > QueryIndex). So the editor
implementing new interface would have the extra params passed in. And
there we introduce something like IndexingContext which folds in
IndexUpdateCallback, indexing mode, index path, CommitInfo etc

Chetan Mehrotra


Re: Way to capture metadata related to commit as part of CommitInfo from within CommitHook

2016-08-03 Thread Chetan Mehrotra
So would it be ok to make the map within CommitInfo mutable ?
Chetan Mehrotra


On Wed, Aug 3, 2016 at 7:29 PM, Michael Dürig  wrote:
>
>>
>> #A -Probably we can introduce a new type CommitAttributes which can be
>> attached to CommitInfo and which can be modified by the CommitHooks.
>> The CommitAttributes can then later be accessed by Observer
>
>
> This is already present via the CommitInfo.info map. It is even used in a
> similar way. See CommitInfo.getPath() and its usages. AFAIU the only part
> where your cases would differ is that the information is assembled by some
> commit hooks instead of being provided at the point the commit was
> initiated.
>
>
> Michael


Re: Using same index definition for both async and sync indexing

2016-08-03 Thread Chetan Mehrotra
On Wed, Aug 3, 2016 at 7:52 PM, Alex Parvulescu
 wrote:
> sounds interesting, this looks like a good option.
>

Now comes the hard part ... what should be the name of this new
interface ;) ContextualIndexEditorProvider?

Chetan Mehrotra


Re: Way to capture metadata related to commit as part of CommitInfo from within CommitHook

2016-08-03 Thread Chetan Mehrotra
That would depend on the CommitHook impl which client code would not
be aware of. And commit hook would also know only as commit traversal
is done. So it needs to be some mutable state
Chetan Mehrotra


On Wed, Aug 3, 2016 at 8:27 PM, Michael Dürig  wrote:
>
> Couldn't we keep the map immutable and instead add some "WhateverCollector"
> instances as values? E.g. add a AffectedNodeTypeCollector right from the
> beginning?
>
> Michael
>
>
>
> On 3.8.16 4:06 , Chetan Mehrotra wrote:
>>
>> So would it be ok to make the map within CommitInfo mutable ?
>> Chetan Mehrotra
>>
>>
>> On Wed, Aug 3, 2016 at 7:29 PM, Michael Dürig  wrote:
>>>
>>>
>>>>
>>>> #A -Probably we can introduce a new type CommitAttributes which can be
>>>> attached to CommitInfo and which can be modified by the CommitHooks.
>>>> The CommitAttributes can then later be accessed by Observer
>>>
>>>
>>>
>>> This is already present via the CommitInfo.info map. It is even used in a
>>> similar way. See CommitInfo.getPath() and its usages. AFAIU the only part
>>> where your cases would differ is that the information is assembled by
>>> some
>>> commit hooks instead of being provided at the point the commit was
>>> initiated.
>>>
>>>
>>> Michael


Re: Way to capture metadata related to commit as part of CommitInfo from within CommitHook

2016-08-03 Thread Chetan Mehrotra
On Wed, Aug 3, 2016 at 8:57 PM, Michael Dürig  wrote:
> I would suggest to add an new, internal mechanism to CommitInfo for your
> purpose.

So introduce a new CommitAttributes instance which would be returned
by CommitInfo ... ?

Chetan Mehrotra


Re: Way to capture metadata related to commit as part of CommitInfo from within CommitHook

2016-08-03 Thread Chetan Mehrotra
Opened OAK-4640 to track this
Chetan Mehrotra


On Wed, Aug 3, 2016 at 9:36 PM, Michael Dürig  wrote:
>
>
> On 3.8.16 5:58 , Chetan Mehrotra wrote:
>>
>> On Wed, Aug 3, 2016 at 8:57 PM, Michael Dürig  wrote:
>>>
>>> I would suggest to add an new, internal mechanism to CommitInfo for your
>>> purpose.
>>
>>
>> So introduce a new CommitAttributes instance which would be returned
>> by CommitInfo ... ?
>
>
> Probably the best of all ugly solutions yes ;-) (Meaning I don't have a
> better idea neither...)
>
> Michael
>
>>
>> Chetan Mehrotra
>>
>


Re: Using same index definition for both async and sync indexing

2016-08-03 Thread Chetan Mehrotra
Opened OAK-4641 for this enhancement
Chetan Mehrotra


On Wed, Aug 3, 2016 at 8:00 PM, Chetan Mehrotra
 wrote:
> On Wed, Aug 3, 2016 at 7:52 PM, Alex Parvulescu
>  wrote:
>> sounds interesting, this looks like a good option.
>>
>
> Now comes the hard part ... what should be the name of this new
> interface ;) ContextualIndexEditorProvider?
>
> Chetan Mehrotra


Provide a way to pass indexing related state to IndexEditorProvider (OAK-4642)

2016-08-04 Thread Chetan Mehrotra
Hi Team,

As a follow up to previous mail around "Using same index definition
for both async and sync indexing" wanted to discuss the next step. We
need to provide a way to pass indexing related state to
IndexEditorProvider (OAK-4642)

Over the period of time I have seen need for extra state like

1. reindexing - Currently the index implementation use some heuristic
like check before root state being empty to determine if they are
running in reindexing mode
2. indexing mode - sync or async
3. index path of the index (see OAK-4152)
4. CommitInfo (see OAK-4640)

For #1 and #3 we have done some kind of workaround but it would be
better to have a first class support for that.

So we would need to introduce some sort of IndexingContext and have
the api for IndexEditorProvider like below

=
@CheckForNull
Editor getIndexEditor(
@Nonnull String type, @Nonnull NodeBuilder definition,
@Nonnull NodeState root,
@Nonnull IndexingContext context) throws CommitFailedException;
=

To introduce such a change I see 3 options

* O1 - Introduce a new interface which takes an {{IndexingContext}}
instance which provide access to such datapoints. This would require
some broader change
** Whereever the IndexEditorProvider is invoked it would need to check
if the instance implements new interface. If yes then new method needs
to be used

Overall it introduces noise.

* O2 - Here we can introduce such data points as part of callback
interface. With this we would need to implement such methods in places
where code constructs the callback

* O3 - Make a backward incompatible change and just modify the
existing interface and adapt the various implementation

I am in favour of going for O3 and make this backward compatible change

Thoughts?

Chetan Mehrotra


Re: Provide a way to pass indexing related state to IndexEditorProvider (OAK-4642)

2016-08-04 Thread Chetan Mehrotra
I have updated OAK-4642 with one more option.

===
O4 - Similar to O2 but here instead of modifying the existing
IndexUpdateCallback we can introduce a new interface
ContextualCallback which extends IndexUpdateCallback and provide
access to IndexingContext. Editor provider implementation can then
check if the callback implements this new interface and then cast it
and access the context. So only those client which are interested in
new capability make use of this
===

So provide your feedback there or in this thread
Chetan Mehrotra


On Thu, Aug 4, 2016 at 12:35 PM, Chetan Mehrotra
 wrote:
> Hi Team,
>
> As a follow up to previous mail around "Using same index definition
> for both async and sync indexing" wanted to discuss the next step. We
> need to provide a way to pass indexing related state to
> IndexEditorProvider (OAK-4642)
>
> Over the period of time I have seen need for extra state like
>
> 1. reindexing - Currently the index implementation use some heuristic
> like check before root state being empty to determine if they are
> running in reindexing mode
> 2. indexing mode - sync or async
> 3. index path of the index (see OAK-4152)
> 4. CommitInfo (see OAK-4640)
>
> For #1 and #3 we have done some kind of workaround but it would be
> better to have a first class support for that.
>
> So we would need to introduce some sort of IndexingContext and have
> the api for IndexEditorProvider like below
>
> =
> @CheckForNull
> Editor getIndexEditor(
> @Nonnull String type, @Nonnull NodeBuilder definition,
> @Nonnull NodeState root,
> @Nonnull IndexingContext context) throws CommitFailedException;
> =
>
> To introduce such a change I see 3 options
>
> * O1 - Introduce a new interface which takes an {{IndexingContext}}
> instance which provide access to such datapoints. This would require
> some broader change
> ** Whereever the IndexEditorProvider is invoked it would need to check
> if the instance implements new interface. If yes then new method needs
> to be used
>
> Overall it introduces noise.
>
> * O2 - Here we can introduce such data points as part of callback
> interface. With this we would need to implement such methods in places
> where code constructs the callback
>
> * O3 - Make a backward incompatible change and just modify the
> existing interface and adapt the various implementation
>
> I am in favour of going for O3 and make this backward compatible change
>
> Thoughts?
>
> Chetan Mehrotra


Re: Property index replacement / evolution

2016-08-07 Thread Chetan Mehrotra
Would add one more

4. Write throughput degradation - For non unique property index which
make use of ContentMirrorStoreStrategy we have seen a loss in
throughput due to contention which arise due to conflicts while
entries are made in index. (OAK-2673, OAK-3380)


Chetan Mehrotra


On Fri, Aug 5, 2016 at 10:34 PM, Michael Marth  wrote:
> Hi,
>
> I have noticed OAK-4638 and OAK-4412 – which both deal with particular 
> problematic aspects of property indexes. I realise that both issues deal with 
> slightly different problems and hence come to different suggested solutions.
> But still I felt it would be good to take a holistic view on the different 
> problems with property indexes. Maybe there is a unified approach we can take.
>
> To my knowledge there are 3 areas where property indexes are problematic or 
> not ideal:
>
> 1. Number of nodes: Property indexes can create a large number of nodes. For 
> properties that are very common the number of index nodes can be almost as 
> large as the number of the content nodes. A large number of nodes is not 
> necessarily a problem in itself, but if the underlying persistence is e.g. 
> MongoDB then those index nodes (i.e. MongoDB documents) cause pressure on 
> MongoDB’s mmap architecture which in turn affects reading content nodes.
>
> 2. Write performance: when the persistence (i.e. MongoDB) and Oak are “far 
> away from each other” (i.e. high network latency or low throughput) then 
> synchronous property indexes affect the write throughput as they may cause 
> the payload to double in size.
>
> 3. I have no data on this one – but think it might be a topic: property index 
> updates usually cause commits to have / as the commit root. This results on 
> pressure on the root document.
>
> Please correct me if I got anything wrong  or inaccurate in the above.
>
> My point is, however, that at the very least we should have clarity which one 
> go the items above we intend to tackle with Oak improvements. Ideally we 
> would have a unified approach.
> (I realize that property indexes come in various flavours like unique index 
> or not, which makes the discussion more complex)
>
> my2c
> Michael


Re: Usecases around Binary handling in Oak

2016-08-10 Thread Chetan Mehrotra
This can be done at Sling level yes. But then any code which makes use
of JCR API would not be able to access the binary. One way to have it
implemented at Oak level would be to introduce some sort of
'ExternalBinary' and open up an extension in BlobStore implementation
to delegate binary lookup call to some provider. Just that it needs to
honor the contract of Binary and Blob API

That part is easy.

The problem comes in management side where you need to decide on GC.
Probably Oak would need to expose an API to provide list (iterator) of
all such external binaries it refers to and then the external system
can manage the GC
Chetan Mehrotra


On Wed, Aug 10, 2016 at 3:26 PM, Ian Boston  wrote:
> Hi,
>
> On 10 August 2016 at 10:29, Bertrand Delacretaz 
> wrote:
>
>> Hi,
>>
>> On Tue, Jul 26, 2016 at 4:36 PM, Bertrand Delacretaz
>>  wrote:
>> > ...I've thought about adding an "adopt-a-binary" feature to Sling
>> > recently, to allow it to serve existing (disk or cloud) binaries along
>> > with those stored in Oak
>>
>> I just noticed that the Git Large File Storage project uses a similar
>> approach, it "replaces large files such as audio samples, videos,
>> datasets, and graphics with text pointers inside Git, while storing
>> the file contents on a remote server". Maybe there are ideas to
>> steal^H^H^H^H^H borrow from there.
>>
>
> Would that be something to do at the Sling level on upload of a large file?
>
> I am working on a patch to use the Commons File Upload streaming API in
> Sling servlets/post as a Operation impl.
> I know this is oak-dev, so the question might not be appropriate here.
>
> Best Regards
> Ian
>
>
>>
>> -Bertrand
>>
>> [1] https://git-lfs.github.com/
>>


Re: Oak Indexing. Was Re: Property index replacement / evolution

2016-08-11 Thread Chetan Mehrotra
Couple of points around the motivation, target usecase around Hybrid
Indexing and Oak indexing in general.

Based on my understanding of various deployments. Any application
based on Oak has 2 type of query requirements

QR1. Application Query - These mostly involve some property
restrictions and are invoked by code itself to perform some operation.
The property involved here in most cases would be sparse i.e. present
in small subset of whole repository content. Such queries need to be
very fast and they might be invoked very frequently. Such queries
should also be more accurate and result should not lag repository
state much.

QR2. User provided query - These queries would consist of both or
either of property restriction and fulltext constraints. The target
nodes may form majority part of overall repository content. Such
queries need to be fast but given user driven need not be very fast.

Note that speed criteria is very subjective and relative here.

Further Oak needs to support deployments

1. On single setup - For dev, prod on SegmentNodeStore
2. Cluster Setup on premise
3. Deployment in some DataCenter

So Oak should enable deployments where for smaller setups it does not
require any thirdparty system while still allow plugging in a dedicate
system like ES/Solr if need arises. So both usecases need to be
supported.

And further even if it has access to such third party server it might
be fine to rely on embedded Lucene for #QR1 and just delegate queries
under #QR2 to remote. This would ensure that query results are still
fast for usage falling under #QR1.

Hybrid Index Usecase
-

So far for #QR1 we only had property indexes and to an extent Lucene
based property index where results lag repository state and lag might
be significant depending on load.

Hybrid index aim to support queries under  #QR1 and can be seen as
replacement for existing non unique property indexes. Such indexes
would have lower storage requirement and would not put much load on
remote storage for execution. Its not meant as a replacement for
ES/Solr but then intends to address different type of usage

Very large Indexes
-

For deployments having very large repository Solr or ES based indexes
would be preferable and there oak-solr can be used (some day oak-es!)

So in brief Oak should be self sufficient for smaller deployment and
still allow plugging in Solr/ES for large deployment and there also
provide a choice to admin to configure a sub set of index for such
usage depending on the size.






Chetan Mehrotra


On Thu, Aug 11, 2016 at 1:59 PM, Ian Boston  wrote:
> Hi,
>
> On 11 August 2016 at 09:14, Michael Marth  wrote:
>
>> Hi Ian,
>>
>> No worries - good discussion.
>>
>> I should point out though that my reply to Davide was based on a
>> comparison of the current design vs the Jackrabbit 2 design (in which
>> indexes were stored locally). Maybe I misunderstood Davide’s comment.
>>
>> I will split my answer to your mail in 2 parts:
>>
>>
>> >
>> >Full text extraction should be separated from indexing, as the DS blobs
>> are
>> >immutable, so is the full text. There is code to do this in the Oak
>> >indexer, but it's not used to write to the DS at present. It should be
>> done
>> >in a Job, distributed to all nodes, run only once per item. Full text
>> >extraction is hugely expensive.
>>
>> My understanding is that Oak currently:
>> A) runs full text extraction in a separate thread (separate form the
>> “other” indexer)
>> B) runs it only once per cluster
>> If that is correct then the difference to what you mention above would be
>> that you would like the FT indexing not be pinned to one instance but
>> rather be distributed, say round-robin.
>> Right?
>>
>
>
> Yes.
>
>
>>
>>
>> >Building the same index on every node doesn't scale for the reasons you
>> >point out, and eventually hits a brick wall.
>> >http://lucene.apache.org/core/6_1_0/core/org/apache/
>> lucene/codecs/lucene60/package-summary.html#Limitations.
>> >(Int32 on Document ID per index). One of the reasons for the Hybrid
>> >approach was the number of Oak documents in some repositories will exceed
>> >that limit.
>>
>> I am not sure what you are arguing for with this comment…
>> It sounds like an argument in favour of the current design - which is
>> probably not what you mean… Could you explain, please?
>>
>
> I didn't communicate that very well.
>
> Currently Lucene (6.1) has a limit of Int32 to the number of documents it
> can store in an index, IIUC There is a long term desire to increase that
> but using Int64 but no long term commitment as its probably significant
> work given arrays in Java are indexed with Int32.
>
> The Hybrid approach doesn't help the potential Lucene brick wall, but one
> motivation for looking at it was the number of Oak Documents including
> those under /oak:index which is, in some cases, approaching that limit.
>
>
>
>>
>>
>> Thanks!
>> Michael
>>


Re: Oak Indexing. Was Re: Property index replacement / evolution

2016-08-11 Thread Chetan Mehrotra
On Thu, Aug 11, 2016 at 3:03 PM, Ian Boston  wrote:
> Both Solr Cloud and ES address this by sharding and
> replicating the indexes, so that all commits are soft, instant and real
> time. That introduces problems.
...
> Both Solr Cloud and ES address this by sharding and
> replicating the indexes, so that all commits are soft, instant and real
> time.

This would really be useful. However I have couple of aspects to clear

Index Update Gurantee


Lets say if commit succeeds and then we update the index and index
update fails for some reason. Then would that update be missed or
there can be some mechanism to recover. I am not very sure about WAL
here that may be the answer here but still confirming.

In Oak with the way async index update works based on checkpoint its
ensured that index would "eventually" contain the right data and no
update would be lost. if there is a failure in index update then that
would fail and next cycle would start again from same base state

Order of index update
-

Lets say I have 2 cluster nodes where same node is being performed

Original state /a {x:1}

Cluster Node N1 - /a {x:1, y:2}
Cluster Node N2 - /a {x:1, z:3}

End State /a {x:1, y:2, z:3}

At Oak level both the commits would succeed as there is no conflict.
However N1 and N2 would not be seeing each other updates immediately
and that would depend on background read. So in this case how would
index update would look like.

1. Would index update for specific paths go to some master which would
order the update
2. Or it would end up with with either of {x:1, y:2} or {x:1, z:3}

Here current async index update logic ensures that it sees the
eventually expected order of changes and hence would be consistent
with repository state.

Backup and Restore
---

Would the backup now involve backup of ES index files from each
cluster node. Or assuming full replication it would involve backup of
files from any one of the nodes. Would the back be in sync with last
changes done in repository (assuming sudden shutdown where changes got
committed to repository but not yet to any index)

Here current approach of storing index files as part of MVCC storage
ensures that index state is consistent to some "checkpointed" state in
repository. And post restart it would eventually catch up with the
current repository state and hence would not require complete rebuild
of index in case of unclean shutdowns


Chetan Mehrotra


Re: Oak Indexing. Was Re: Property index replacement / evolution

2016-08-11 Thread Chetan Mehrotra
> https://github.com/ieb/oak-es

btw this looks interesting and something we can build upon. This can
benefit from a refactoring of LuceneIndexEditor to separate the logic
of interpreting the Oak indexing config during editor invocation from
constructing Lucene document. If we decouple that logic then it would
be possible to plugin in a ES Editor which just converts those
properties per ES requirement. Hence it gets all benefits of
aggregation, relative property implementation etc (which is very Oak
specific stuff). This effort has been discussed but we never got time
to do that so far. Something on the lines which you are doing at [2]

Another approach - With recent refactoring done in  OAK-4566 my plan
was to plugin a ES based LuceneIndexWriter (ignore the name for now!)
and convert the Lucene Document to some ES Document counterpart. And
then provide just the query implementation. This would also allow to
reuse most of testcase we have in oak-lucene

Chetan Mehrotra
[2] 
https://github.com/ieb/oak-es/blob/master/src/main/java/org/apache/jackrabbit/oak/plusing/index/es/index/take2/ESIndexEditorContext.java

On Thu, Aug 11, 2016 at 3:40 PM, Chetan Mehrotra
 wrote:
> On Thu, Aug 11, 2016 at 3:03 PM, Ian Boston  wrote:
>> Both Solr Cloud and ES address this by sharding and
>> replicating the indexes, so that all commits are soft, instant and real
>> time. That introduces problems.
> ...
>> Both Solr Cloud and ES address this by sharding and
>> replicating the indexes, so that all commits are soft, instant and real
>> time.
>
> This would really be useful. However I have couple of aspects to clear
>
> Index Update Gurantee
> 
>
> Lets say if commit succeeds and then we update the index and index
> update fails for some reason. Then would that update be missed or
> there can be some mechanism to recover. I am not very sure about WAL
> here that may be the answer here but still confirming.
>
> In Oak with the way async index update works based on checkpoint its
> ensured that index would "eventually" contain the right data and no
> update would be lost. if there is a failure in index update then that
> would fail and next cycle would start again from same base state
>
> Order of index update
> -
>
> Lets say I have 2 cluster nodes where same node is being performed
>
> Original state /a {x:1}
>
> Cluster Node N1 - /a {x:1, y:2}
> Cluster Node N2 - /a {x:1, z:3}
>
> End State /a {x:1, y:2, z:3}
>
> At Oak level both the commits would succeed as there is no conflict.
> However N1 and N2 would not be seeing each other updates immediately
> and that would depend on background read. So in this case how would
> index update would look like.
>
> 1. Would index update for specific paths go to some master which would
> order the update
> 2. Or it would end up with with either of {x:1, y:2} or {x:1, z:3}
>
> Here current async index update logic ensures that it sees the
> eventually expected order of changes and hence would be consistent
> with repository state.
>
> Backup and Restore
> ---
>
> Would the backup now involve backup of ES index files from each
> cluster node. Or assuming full replication it would involve backup of
> files from any one of the nodes. Would the back be in sync with last
> changes done in repository (assuming sudden shutdown where changes got
> committed to repository but not yet to any index)
>
> Here current approach of storing index files as part of MVCC storage
> ensures that index state is consistent to some "checkpointed" state in
> repository. And post restart it would eventually catch up with the
> current repository state and hence would not require complete rebuild
> of index in case of unclean shutdowns
>
>
> Chetan Mehrotra


Re: Oak Indexing. Was Re: Property index replacement / evolution

2016-08-11 Thread Chetan Mehrotra
On Thu, Aug 11, 2016 at 5:19 PM, Ian Boston  wrote:
> correct.
> Documents are shared by ID so all updates hit the same shard.
> That may result in network traffic if the shard is not local.

Focusing on ordering part as that is the most critical aspect compared
to other. (BAckup and Restore with sharded index is a separate problem
to discuss but later)

So even if there is a single master for a given path how would it
order the changes. Given local changes only give partial view of end
state.

Also in such a setup would each query need to consider multiple shards
for final result or each node would "eventually" sync index changes
from other nodes (complete replication) and query would only use local
index

For me ensuring consistency in how index updates are sent to ES wrt
Oak view of changes was kind of blocking feature to enable
parallelization of indexing process. It needs to be ensured that for
concurrent commit end result in index is in sync with repository
state.

Current single thread async index update avoid all such race condition.

Chetan Mehrotra


Re: Oak Indexing. Was Re: Property index replacement / evolution

2016-08-12 Thread Chetan Mehrotra
On Thu, Aug 11, 2016 at 7:33 PM, Ian Boston  wrote:
> That probably means the queue should only
> contain pointers to Documents and only index the Document as retrieved. I
> dont know if that can ever work.

That would not work as what document look like across cluster node
would wary and what is to be considered valid entries is also not
defined at that level

> Run a single thread on the master, that indexes into a co-located ES
cluster.

While keeping things simple that looks like the safe way

> BTW, how does Hybrid manage to parallelise the indexing and maintain
consistency ?

Hybrid indexes does not affect async indexes. Under this each cluster
node maintain there local indexes which only contain local changes
[1]. These indexes are not aware about similar index on other cluster
node. Further the indexes are supposed to only contain entry from last
async indexing cycle. Older entries are purged [2]. The query would
then be consulting both indexes (IndexSearcher backed via MultiReader
, 1 reader from async index and 1 (or 2) from local index).

Also note that QueryEngine would enforce and reevaluate the property
restrictions. So even if index has an entry based on old state QE
would filter it out if it does not match the criteria per current
repository state. So aim here is to have index provide a super set of
result set.

In all this async index logic remains same (single threaded) and based
on diff. So it would remain consistent with repository state

Chetan Mehrotra
[1] They might also contain entries which are determined based on
external diff. Read [3] for details
[2] Purge here is done my maintaining different local index copy for
each async indexing cycle. At max only 2 indexes are retained and
older indexes are removed. This keeps index small
[3] 
https://issues.apache.org/jira/browse/OAK-4412?focusedCommentId=15405340&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15405340


Re: normalising the rdb database schema

2016-08-16 Thread Chetan Mehrotra
Hi Tomek,

I like the idea of revisiting our current schema based on usage so
far. However couple of points around potential issue with such a
normalized approach

- This approach would lead to a thin and lng table. As noted in
[1] in a small repo ~14 M nodes we have ~26 M properties. With
multiple revisions (GC takes some time) this can go higher. This would
then increase the memory requirement for id index. Memory consumption
increases further with id+key+revision index. For any db to perform
optimally the index should fit in ram. So such such a design would
possibly reduce the max size of repository which can be supported
(compared to older one) for given memory

- The read for specific id can be done in 1 remote call. But that
would involve select across multiple rows which might increase the
time taken as it would involve 'm' index lookup and then 'm' reads of
row data for any node having 'n' properties (m > n assuming multiple
revision for property present)

May be we should explore the json support being introduced in multiple
dbs.  DB2 [2], SQL Server [3], Oracle [4], Postgres [5], MySql [6].
Problem here is that we would need DB specific implementation and also
increases the testing effort!

> we can better use the database features, as now the DBE is aware about the 
> document internal structure (it’s not a blob anymore). Eg. we can fetch only 
> a few properties.

In most cases the kind of properties stored in blob part of db row are
always read as a whole.

Chetan Mehrotra
[1] https://issues.apache.org/jira/browse/OAK-4471
[2] 
http://www.ibm.com/developerworks/data/library/techarticle/dm-1306nosqlforjson1/
[3] https://msdn.microsoft.com/en-in/library/dn921897.aspx
[4] https://docs.oracle.com/database/121/ADXDB/json.htm
[5] https://www.postgresql.org/docs/9.3/static/functions-json.html
[6] https://dev.mysql.com/doc/refman/5.7/en/json.html


On Wed, Aug 17, 2016 at 7:19 AM, Michael Marth  wrote:
> Hi Tomek,
>
> I like the idea (agree with Vikas’ comments / cautions as well).
>
> You are hinting at expected performance differences (maybe faster or slower 
> than the current approach). That would probably be worthwhile to investigate 
> in order to assess your idea.
>
> One more (hypothetical at this point) advantage of your approach: we could 
> utilise DB-native indexes as a replacement for property indexes.
>
> Cheers
> Michael
>
>
>
> On 16/08/16 07:42, "Tomek Rekawek"  wrote:
>
>>Hi Vikas,
>>
>>thanks for the reply.
>>
>>> On 16 Aug 2016, at 14:38, Vikas Saurabh  wrote:
>>
>>> * It'd incur a very heavy migration impact on upgrade or RDB setups -
>>> that, most probably, would translate to us having to support both
>>> schemas. I don't feel that it'd easy to flip the switch for existing
>>> setups.
>>
>>That’s true. I think we should take a similar approach here as with the 
>>segment / segment-tar implementations (and we can use oak-upgrade to convert 
>>between them). At least for now.
>>
>>> * DocumentNodeStore implementation very freely touches prop:rev=value
>>> for a given id… […] I think this would get
>>> expensive for index (_id+propName+rev) maintenance.
>>
>>Indeed, probably we’ll have to analyse the indexing capabilities offered by 
>>different database engines more closely, choosing the one that offers good 
>>writing speed.
>>
>>Best regards,
>>Tomek
>>
>>--
>>Tomek Rękawek | Adobe Research | www.adobe.com
>>reka...@adobe.com


Re: Help with unit tests for JMX stats for S3DataStore

2016-08-18 Thread Chetan Mehrotra
Hi Matt,

It would be easier if you can open an issue and provide your patch
there so that one can have better understanding of what needs to be
tested.

In general we have you can use with MemoryDocumentStore (default used
by DocumentMK builder) and then possibly use Sling OSGi mocks to pick
the registered MBean services. For an example have a look at
SegmentNodeStoreServiceTest which uses OSGi mocks to activate the
service and then pick up the registered services to do the assertion
Chetan Mehrotra


On Fri, Aug 19, 2016 at 6:14 AM, Matt Ryan  wrote:
> Hi,
>
> I’m working on a patch for Oak that would add some JMX stats for
> S3DataStore.  I’m adding code to register a new Mbean in
> DocumentNodeStoreService (also SegmentNodeStoreService, but let’s just
> worry about the first one for now).
>
> I wanted to create some unit tests to verify that my new JMX stats are
> available via JMX.  The idea I had would be that I would simply instantiate
> a DocumentNodeStoreService, create an S3DataStore, wrap it in a
> DataStoreBlobStore, and bind that in the DocumentNodeStoreService.  Then
> with a JMX connection I could check that my Mbean had been registered,
> which it should have been by this time.
>
>
> This was all going relatively fine until I hit a roadblock in
> DocumentNodeStoreService::registerNodeStore().  The DocumentMKBuilder uses
> a DocumentNodeStore object that I need to mock in order to do the test, and
> I cannot mock DocumentNodeStore because it is a final class.  I tried
> working around that, but ended up hitting another road block in the
> DocumentNodeStore constructor where I then needed to mock a NodeDocument -
> again, can’t mock it because it is a final class.
>
>
> I realize it is theoretically possible to mock final classes using
> PowerMock, although by this point I am starting to wonder if all this
> effort is a good way to use my time or if I should just test my code
> manually.
>
>
> Is it important that DocumentNodeStore be a final class?  If not, how would
> we feel about me simply making the class non-final?  If so, what
> suggestions do you have to help me unit test this thing?  I feel that it
> should be easier to unit test new code than this, so maybe I’m missing
> something.
>
>
> Thanks
>
>
> -Matt Ryan


RepositorySidegrade and commit hooks

2016-08-18 Thread Chetan Mehrotra
Hi,

Does RepositorySidegrade runs all the commit hooks required for
getting a consistent JCR level state like permission editor, property
editor etc

I can such hooks configured for RepositoryUpgrade but not seeing any
such hook configured for RepositorySidegrade

Probably we should also configure same set of hooks?

Chetan Mehrotra


Re: RepositorySidegrade and commit hooks

2016-08-19 Thread Chetan Mehrotra
For complete migration yes all bits are there. However people also use
this for partial incremental migration from source system to target
system. In that case include paths are provide for those paths whose
content need to be updated. In such a case it can happen that derived
content for those paths (property index, permission store entries) do
not get updated and that would result in inconsistent state
Chetan Mehrotra


On Fri, Aug 19, 2016 at 1:59 PM, Alex Parvulescu
 wrote:
> Hi,
>
> I don't think any extra hooks are needed here. Sidegrade is just a change
> in persistence format, all the bits should be there already in the old
> repository.
>
> best,
> alex
>
> On Fri, Aug 19, 2016 at 6:45 AM, Chetan Mehrotra 
> wrote:
>
>> Hi,
>>
>> Does RepositorySidegrade runs all the commit hooks required for
>> getting a consistent JCR level state like permission editor, property
>> editor etc
>>
>> I can such hooks configured for RepositoryUpgrade but not seeing any
>> such hook configured for RepositorySidegrade
>>
>> Probably we should also configure same set of hooks?
>>
>> Chetan Mehrotra
>>


Re: RepositorySidegrade and commit hooks

2016-08-19 Thread Chetan Mehrotra
Thanks Tomek for confirmation. Opened OAK-4684 to track that
Chetan Mehrotra


On Fri, Aug 19, 2016 at 3:52 PM, Tomek Rekawek  wrote:
> Hi Chetan,
>
> yes, it seems that this has been overlooked in the OAK-3239 (porting the 
> —include-paths support from RepositoryUpgrade). Feel free to create an issue 
> / commit a patch or let me know if you want me to do it.
>
> Best regards,
> Tomek
>
> --
> Tomek Rękawek | Adobe Research | www.adobe.com
> reka...@adobe.com
>
>> On 19 Aug 2016, at 10:38, Chetan Mehrotra  wrote:
>>
>> For complete migration yes all bits are there. However people also use
>> this for partial incremental migration from source system to target
>> system. In that case include paths are provide for those paths whose
>> content need to be updated. In such a case it can happen that derived
>> content for those paths (property index, permission store entries) do
>> not get updated and that would result in inconsistent state
>> Chetan Mehrotra
>>
>>
>> On Fri, Aug 19, 2016 at 1:59 PM, Alex Parvulescu
>>  wrote:
>>> Hi,
>>>
>>> I don't think any extra hooks are needed here. Sidegrade is just a change
>>> in persistence format, all the bits should be there already in the old
>>> repository.
>>>
>>> best,
>>> alex
>>>
>>> On Fri, Aug 19, 2016 at 6:45 AM, Chetan Mehrotra 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Does RepositorySidegrade runs all the commit hooks required for
>>>> getting a consistent JCR level state like permission editor, property
>>>> editor etc
>>>>
>>>> I can such hooks configured for RepositoryUpgrade but not seeing any
>>>> such hook configured for RepositorySidegrade
>>>>
>>>> Probably we should also configure same set of hooks?
>>>>
>>>> Chetan Mehrotra
>>>>
>


  1   2   3   4   5   6   7   8   >