Re: Safe read after write
Does the above setup involves multiple Oak instances (say in separate JVM) or message producer and consumer of queue are use same Oak instance? Chetan Mehrotra On Thu, Dec 3, 2015 at 8:33 PM, wrote: > I am using Oak with a DocumentNodeStore. I am storing content then adding a > message onto a queue. The consumer of the message uses and id to retrieve > the content. I am seeing frequent failures in the consumer (node not > available/does not exist). If I add a Thread.sleep after I store the node I > do not see these failures. My initial thought was this was related to the > default Mongo WriteConcern of Acnknowledged, so I changed my code: > > public Repository getRepository() throws ClassNotFoundException, > RepositoryException { > DB db = new MongoClient(mongoHost, mongoPort).getDB(mongoOakDbName); > db.setWriteConcern(WriteConcern.JOURNALED); // I also tried using > FSYNC > DocumentNodeStore ns = new > DocumentMK.Builder().setMongoDB(db).getNodeStore(); > return new Jcr(new Oak(ns)).createRepository(); > } > > but I still see the problem. Am I missing something? > > Thanks
Re: Lucent index speed
Hi Jim, How does the indexing performs if you say just run a single webapp node? Chetan Mehrotra On Sat, Dec 5, 2015 at 7:18 AM, Jim.Tully wrote: > We are using Oak embedded in a web application, and are now experiencing > significant delays in async indexing. New nodes added are sometimes not > available by query for up to an hour. I’m hoping you can identify areas I > might explore to improve this performance. > > We have multiple instances of the web application running with the same > Mongodb cluster connected via SSL. Our Repository constructor is: > > > > ns = new DocumentMK.Builder().setMongoDB(createMongoDB()).getNodeStore(); > > > Oak oak = new Oak(ns); > > > LuceneIndexProvider provider = new LuceneIndexProvider(); > > Jcr jcr = new Jcr(oak).with((QueryIndexProvider) provider).with((Observer) > provider) > > .with(new LuceneIndexEditorProvider()).withAsyncIndexing(); > > repository = jcr.createRepository(); > > > The web application creates the repository at start up, and disposes of it as > shutdown. We have no observers registered at all, but do have 6 lucene > indexes defined. The index that is currently giving me heartburn looks like > below. Where would I start to find what is dragging performance down so > drastically? > > > > > http://www.jcp.org/jcr/sv/1.0"; sv:name="PageIndex"> > > > > oak:QueryIndexDefinition > > > > > > 2 > > > > > > 2 > > > > > > lucene > > > > > > async > > > > > > PageIndex > > > > > > /pages/oak:index/PageIndex > > > > > > > > > > nt:unstructured > > > > > > > > nt:unstructured > > > > > > > > nt:unstructured > > > > > > > > nt:unstructured > > > > > > Date > > > > > > true > > > > > > true > > > > > > > > > > nt:unstructured > > > > > > Date > > > > > > true > > > > > > true > > > > > > > > > > nt:unstructured > > > > > > true > > > > > > > > > > nt:unstructured > > > > > > true > > > > > > > > > > > > > > > > > > org.apache.lucene.analysis.standard.StandardAnalyzer > > > > > > LUCENE_47 > > > > > > > > Standard > > > > > > > > > > > > Thanks, > > Jim >
Re: New Oak Example - Standalone Runnable Example based on Spring Boot
On Fri, Dec 4, 2015 at 11:33 AM, Torgeir Veimo wrote: > Do you have a similar example on how to configure with an actual embedded > osgi container running? Not yet as thats not a very common usecase! Feel free to open an issue for that. If there is a wider demand for such a usecase then it can be looked into Chetan Mehrotra
Re: Safe read after write
Hi David, To elaborate a bit on what Vikas and Davide said Oak has a MVCC storage model which is eventually consistent. So any change made on one cluster node would not be immediately visible on other cluster nodes. Instead each node would periodically poll for changes in the backend store (Mongo for above case) and then updates it head revision. After that only changes made in those revisions would be "visible" to that cluster node. So in above setup if on cluster node N1 you add a node and then that information is communicated to other cluster node N2 outside of Oak (here a message queue) and then other cluster nodes reacts to that then there is a chance that such a change would not have became visible on that cluster node. Currently there is no deterministic way other than introducing polling as part of queue consumer logic Chetan Mehrotra On Mon, Dec 7, 2015 at 7:20 PM, Vikas Saurabh wrote: > On Mon, Dec 7, 2015 at 7:14 PM, David Marginian wrote: >> Yes, each node however is referencing the same mongo instance. Is there a >> way to tell jackrabbit to grab the document from mongo instead of using the >> cluster cache (assuming that is what's going on). > > Each cluster node has a thread (background read thread) which, in > crude sense, absorbs changes from other nodes. Although, simultaneous > conflicting writes are avoided but a state of node that's visible to > layers above (Jcr, etc) don't get to see changes from other nodes > until background read is done with absorbing changes from other nodes. > > Thanks, > Vikas
Re: Safe read after write
On Mon, Dec 7, 2015 at 8:28 PM, wrote: > Are you recommending that my consumer attempts to retrieve the node until it > is present? Kind of. One approach I can think of 1. If your code is adding node under specific path say /workItems then have a JCR Listener registered to monitor changes under those path 2. The queue consumer upon getting message can check if node is present or not. If not it waits on a lock 3. The listener upon receiving any event (specifically external event [1]) would then notify such listeners. 4. Listener checks if the required node is found. if not it goes for sleep again. Such retry can be done for 'n' times before giving up Chetan Mehrotra [1] https://jackrabbit.apache.org/api/2.1/org/apache/jackrabbit/api/observation/JackrabbitEvent.html#isExternal()
Re: Lucene index speed
On Mon, Dec 7, 2015 at 9:06 PM, Jim.Tully wrote: > When running locally with similar data, the indexing is nearly > instantaneous. Okie thats what I was expecting. The problem here is that AsyncIndexer job is to be run as a singleton in a cluster. This is done at [1]. This is undocumented dependency on Sling way of scheduling things (SLING-2979) which allows one to schedule jobs as singleton in a cluster. The default scheduler used by Oak (outside of Sling) does not honor this contract which causes this job to be executed concurrently on each cluster node and that causes conflict/retries etc. So in a way Oak is outsourcing the job execution in cluster to embedding application. Would be good to document this aspect (if you can open an issue that would be helpful) Given the recent work on DocumentDiscoveryLiteService it might be possible for Oak to manage such thing on its own (@Stefan thoughts?). But as of now this is not possible. So only way out currently is to provide your own Whiteboard implementation which can handle such kind of singleton scheduled jobs. Doing this is certainly non trivial! Chetan Mehrotra [1] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/spi/whiteboard/WhiteboardUtils.java#L59
Re: Lucene index speed
Hi Jim, Proper way to do this would be to have your own Whiteboard implementation and implement the logic as preset in Oak whiteboard [1] and the modify the logic around scheduling. However given currently there is only AsyncIndexing task which requires to be run as a singleton you can disable the default async indexing and trigger on your own. Just use IndexMBeanRegistration > Is there an optimal frequency for indexing that you would recommend? Default is 5 sec which so far we have seen works fine > Why doesn’t the checkpoints prevent resource contention? It would appear to > me that they should. Checkpoint are not meant to prevent contention. AsyncIndexer has an inbuilt "lease" support to prevent concurrent runs but there have been some issues like OAK-3436 which can result in complete reindexing at times! They should be addressed soon Chetan Mehrotra [1] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/Oak.java#L247 On Tue, Dec 8, 2015 at 8:45 AM, Jim.Tully wrote: > Chetan, > > It appears that I can at least trigger the async indexing from within my > application. That leaves me with two questions that I hope you can find > time to answer: > > 1. Is there an optimal frequency for indexing that you would recommend? > 2. Why doesn’t the checkpoints prevent resource contention? It would > appear to me that they should. > > Many thanks, > > Jim Tully > > > > > > On 12/7/15, 10:50 AM, "Jim.Tully" wrote: > >>Chetan, >> >>I really appreciate the quick response. Our application is capable of >>running singleton scheduled jobs already, so I believe I can take care of >>that aspect. Would it be as simple as omitting the withAsyncIndexing() >>argument to the constructor, and then >> >>- create an AsyncIndexUpdate instance >>- schedule the instance to invoke it¹s run() method >> >> >>Jim >> >> >> >> >> >>On 12/7/15, 9:50 AM, "Chetan Mehrotra" wrote: >> >>>On Mon, Dec 7, 2015 at 9:06 PM, Jim.Tully wrote: >>>> When running locally with similar data, the indexing is nearly >>>> instantaneous. >>> >>>Okie thats what I was expecting. The problem here is that AsyncIndexer >>>job is to be run as a singleton in a cluster. This is done at [1]. >>>This is undocumented dependency on Sling way of scheduling things >>>(SLING-2979) which allows one to schedule jobs as singleton in a >>>cluster. >>> >>>The default scheduler used by Oak (outside of Sling) does not honor >>>this contract which causes this job to be executed concurrently on >>>each cluster node and that causes conflict/retries etc. So in a way >>>Oak is outsourcing the job execution in cluster to embedding >>>application. Would be good to document this aspect (if you can open an >>>issue that would be helpful) >>> >>>Given the recent work on DocumentDiscoveryLiteService it might be >>>possible for Oak to manage such thing on its own (@Stefan thoughts?). >>>But as of now this is not possible. So only way out currently is to >>>provide your own Whiteboard implementation which can handle such kind >>>of singleton scheduled jobs. Doing this is certainly non trivial! >>> >>>Chetan Mehrotra >>>[1] >>>https://github.com/apache/jackrabbit-oak/blob/trunk/oak-core/src/main/jav >>>a >>>/org/apache/jackrabbit/oak/spi/whiteboard/WhiteboardUtils.java#L59 >>> >> >
Re: [VOTE] Release Apache Jackrabbit Oak 1.0.25
On Tue, Dec 8, 2015 at 10:43 AM, Amit Jain wrote: > [X] +1 Release this package as Apache Jackrabbit Oak 1.0.25 Chetan Mehrotra
Re: fixVersions in jira
On Tue, Dec 8, 2015 at 9:36 PM, Julian Reschke wrote: > So what's the correct JIRA state for something that has been fixed in 1.3.x, > which is intended to be backported to 1.2, but hasn't been backported yet? > Can I still set that to "resolved"? So far the practice some of us follow is to add label candidate_oak_1_0 or candidate_oak_1_2. See [1] for some earlier discussion around this Chetan Mehrotra [1] http://markmail.org/thread/7sbse6lpgxaqgplv
Re: svn commit: r1718848 - /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/spi/state/NodeStateUtils.java
On Wed, Dec 9, 2015 at 6:36 PM, wrote: > +private static String multiplier(String in, int times) { > +StringBuilder sb = new StringBuilder(); > +for (int i = 0; i < times; i++) { > +sb.append(in); > +} > +return sb.toString(); > +} > + May be use com.google.common.base.Strings#repeat? Chetan Mehrotra
Re: Oak crypto API
On Tue, Dec 1, 2015 at 6:21 PM, Timothée Maret wrote: > The API would > > - Cover encrypting/decrypting data using an instance global master key > - Make no assumption regarding how to get the master key This looks useful!. I think having just the decryption part should be sufficient as part of API which needs to be used by Oak code. The encryption method can be part of implementation such that testcase can make use of that to create test data. Depending on embedding application how the encrypted data is created might vary so making that method part of API may pose some problems Chetan Mehrotra
Remove/Disable ordered property indexes in trunk
Given that ordered property indexes have been deprecated long time back it would be better if we remove the corresponding code or at least disable the OSGi components for it I would prefer removal of code as otherwise we need to take care of that code also if any cross cutting refactoring is to be performed (say some change which touch all indexes) Thoughts? Chetan Mehrotra
Re: Oak crypto API
Hi Timothée, On Thu, Dec 10, 2015 at 4:59 PM, Timothée Maret wrote: > However, I think that encryption and decryption go in pair (use the same > algo) and maybe it would be best to reflect it in the API. >From what I understand Its main usecase is to allow components in Oak to make use of encrypted credentials which interacting with third party services. For e.g. in LDAP the password to access LDAP server can be encrypted and Oak LDAP logic would need to use some API to decrypt it. How it is encrypted and what is the encryption algo is not the concern of this logic. The encryption algo might be encoded in the encrypted config itself. So just having support for following call would be sufficient - //one can use a byte[] also as argument type but keeping it string //as the key would be provided via some property file/OSGi //config and hence would be expected to be encoded say in base64 byte[] decrypt(String cipherText) - And this is the api which can be used in other places also (say decrypting Mongo connection credentials) Further the implementation might vary quite a bit 1. Credentials obtained from third party service - cipherText might be a logical name of some credential config - Say prod1LdapPwd. And in that deployment there is support for some third party credential storage server which can provide the credentials at runtime. In such deployment even the encrypted key would not be present in local system and the crypto implementation would use that service SDK to fetch the credential at runtime (using some off band authentication to that service) 2. cipherText having algo encoded - For some impl the cipherText would be like '{AES/CBC/PKCS5Padding}' - Implementation can then decode the value as per requirement So how the encrypted key is created and managed is not a concern for Oak logic. For Oak it just need a way to get plain text credential given some opaque key data. Any method related to encrypting would not be used by other part of Oak so need not be part of API which we expose as extension point Chetan Mehrotra
Re: Remove/Disable ordered property indexes in trunk
On Thu, Dec 10, 2015 at 7:43 PM, Davide Giannella wrote: > Can any of you please file an issue and assign it to myself? Done with OAK-3768 Chetan Mehrotra
Re: Missing SessionStatistics Mbeans
Hi Marc, Thanks for reporting this. It looks like a regression due to changes done for OAK-3477 (affect 1.3.11). Opened OAK-3802 for that. Chetan Mehrotra On Wed, Dec 16, 2015 at 9:51 PM, Marc Pfaff wrote: > Hi > > Using oak-1.3.11.r1716789, I have a situation, where I see the session > counter, as per RepositoryStats#SessionCount, constantly increasing over > time. > > This makes me wonder if I stumbled over a session leak. So far, I > consulted the SessionStatistics beans in the system console in those cases > to get an idea of suspicious sessions, by looking at > SessionStatistics#InitStackTrace. But it looks like there are no > SessionStatistics mbeans no more in the system console. > > Now I wonder where have the SessionStatistics mbeans gone? Or is there an > issue in the value reported by RepositoryStats#SessionCount and I don¹t > have a session leak at all? What other options do I have to find a session > leak in my code? > > The last checkpoint where I still have the SessionStatistics beans is with > oak-1.3.10.r1713699. > > Thanks a lot. > > Regards > Marc > > >
Re: [Oak origin/trunk] Apache Jackrabbit Oak matrix - Build # 634 - Still Failing
On Thu, Dec 17, 2015 at 9:45 AM, Apache Jenkins Server wrote: > Stack Trace: > junit.framework.ComparisonFailure: expected: hallo (1)], text:[hallo (1), hello (1), oh hallo (1)], text:[hallo (1), hello > (1), oh hallo (1)]]> but was: > at junit.framework.Assert.assertEquals(Assert.java:100) > at junit.framework.Assert.assertEquals(Assert.java:107) > at junit.framework.TestCase.assertEquals(TestCase.java:269) > at > org.apache.jackrabbit.oak.jcr.query.FacetTest.testFacetRetrievalWithAnonymousUser(FacetTest.java:102) Looks like most failures are in new Facet tests @Tommaso - Can you have a look Chetan Mehrotra
Re: [Oak origin/trunk] Apache Jackrabbit Oak matrix - Build # 634 - Still Failing
On Thu, Dec 17, 2015 at 12:07 PM, Tommaso Teofili wrote: > can you guys reproduce locally? I tried but it passes. Looking ta failure [1] it appears to be coming in oak-solr-core (and not in oak-lucene). Just wondering if there is some async behavior involved due to Solr. In such a case if we make any commit it might happen that changes made to index are not yet reflected to index readers. Also I remember quite a few failures in Spell check support (OAK-3355) also which might have same root cause. So may be Solr support for facets, spell check and suggestor might have some race condition involve Chetan Mehrotra [1] https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/634/jdk=jdk1.8.0_11,label=Ubuntu,nsfixtures=SEGMENT_MK,profile=unittesting/console
JIRA issue not showing associated commits
Hi Team, Earlier the JIRA for Oak used to show commits related to the issues via fisheye integration. Now there is no such link. This makes it difficult to determine what changes were done for that issue. Any idea on how to get that integration back? Probably when we moved to Epic based model it got lost. Chetan Mehrotra
Re: svn commit: r1722496 - /jackrabbit/oak/trunk/oak-jcr/src/main/java/org/apache/jackrabbit/oak/jcr/xml/ImporterImpl.java
On Fri, Jan 1, 2016 at 7:56 PM, wrote: > } > -} else if (getDefinition(parent).isProtected()) { > -if (pnImporter != null) { > -pnImporter.end(parent); > -// and reset the pnImporter field waiting for the next > protected > -// parent -> selecting again from available importers > -pnImporter = null; > -} > +} else if ((pnImporter != null) && > getDefinition(parent).isProtected()) { > +pnImporter.end(parent); > +// and reset the pnImporter field waiting for the next > protected > +// parent -> selecting again from available importers > +pnImporter = null; > } > Above change is causing couple of test failures in CUG === Failed tests: testNestedCug(org.apache.jackrabbit.oak.spi.security.authorization.cug.impl.CugImportIgnoreTest) testNestedCug(org.apache.jackrabbit.oak.spi.security.authorization.cug.impl.CugImportAbortTest) testNestedCug(org.apache.jackrabbit.oak.spi.security.authorization.cug.impl.CugImportBesteffortTest) === It happens because `getDefinition(parent).isProtected()` has a side of effect of triggering an exception. With above code change that call is not made if 'pnImporter' is null and thus causes a change in behaviour. So better to revert that change. Chetan Mehrotra
Re: [Oak origin/trunk] Apache Jackrabbit Oak matrix - Build # 653 - Failure
On Wed, Jan 6, 2016 at 11:13 AM, Apache Jenkins Server wrote: > Stack Trace: > java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils > at > org.apache.jackrabbit.oak.fixture.DocumentRdbFixture.toString(DocumentRdbFixture.java:82) Looks like commons-lang is not available in oak-jcr. Added it as a test dependency to see if this gets resolved Chetan Mehrotra
Re: svn commit: r1724598 - in /jackrabbit/oak/trunk/oak-core/src: main/java/org/apache/jackrabbit/oak/api/ main/java/org/apache/jackrabbit/oak/plugins/document/rdb/ main/java/org/apache/jackrabbit/oak
On Thu, Jan 14, 2016 at 6:40 PM, wrote: > > jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/api/Blob.java > > jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/rdb/RDBDocumentStore.java > > jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/value/BinaryImpl.java I see some changes to Blob/BinaryImpl. Are those change related to this issue? Most likely look like a noise but just wanted to confirm Chetan Mehrotra
Re: svn commit: r1725250 - in /jackrabbit/oak/trunk: oak-core/src/main/java/org/apache/jackrabbit/oak/ oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/atomic/ oak-core/src/test/java/org/apach
Hi Davide, On Mon, Jan 18, 2016 at 5:46 PM, wrote: > + */ > +public AtomicCounterEditorProvider() { > +clusterSupplier = new Supplier() { > +@Override > +public Clusterable get() { > +return cluster.get(); > +} > +}; > +schedulerSupplier = new Supplier() { > +@Override > +public ScheduledExecutorService get() { > +return scheduler.get(); > +} > +}; > +storeSupplier = new Supplier() { > +@Override > +public NodeStore get() { > +return store.get(); > +} > +}; > +wbSupplier = new Supplier() { > +@Override > +public Whiteboard get() { > +return whiteboard.get(); > +} > +}; > +} Just curious about use of above approach. Is it for keeping the dependencies as non static or using final instance variable? If you mark references as static then all those bind and unbind method would not be required as by the time component is active the dependencies would be set. Chetan Mehrotra
Re: Restructure docs
On Wed, Jan 20, 2016 at 2:46 PM, Davide Giannella wrote: > When you change/add/remove an item from the left-hand menu, you'll have > to redeploy the whole site as it will be hardcoded within the html of > each page. Deploying the whole website is a long process. Therefore > limiting the changes over there make things faster. I mostly do partial commit i.e. only the modified page and it ha worked well. Changing of left side menu is not a very frequent task and for that I think doing full deploy of site is fine for now Chetan Mehrotra
Re: Issue using the text extraction with lucene
On Sat, Jan 23, 2016 at 9:34 PM, Stephan Becker wrote: > Exception in thread "main" java.lang.NoSuchMethodError: > org.apache.commons.csv.CSVFormat.withIgnoreSurroundingSpaces()Lorg/apache/commons/csv/CSVFormat; Looks like tika-app-1.11 is using commons-csv 1.0 [1] while Oak uses 1.1 and CSVFormat.withIgnoreSurroundingSpaces is added in v1.1. We tested it earlier with Tika 1.6. So you can try adding commons-csv jar as the first one in the classpath java -cp commons-csv-1.1.jar:tika-app-1.11.jar:oak-run-1.2.4.jar Chetan Mehrotra [1]http://svn.apache.org/viewvc/tika/tags/1.11-rc1/tika-parsers/pom.xml?view=markup#l328
Re: Issue using the text extraction with lucene
On Sun, Jan 24, 2016 at 2:28 AM, Stephan Becker wrote: > How does it then further extract the > text from added documents? Currently the extracted text support does not allow updates i.e. it only has extracted text at the time when extraction is done via the tool. Later extracted text would not be added. The primary aim was to speed up indexing time in migration. Chetan Mehrotra
Re: JUnit tests with FileDataStore
To make use of FileDataStore you would need to configure a SegmentNodeStore as MemoryNodeStore does not allow plugging in custom BlobStore Have a look at snippet [1] for a possible approach Chetan Mehrotra [1] https://gist.github.com/chetanmeh/6242d0a7fe421955d456 On Wed, Jan 27, 2016 at 6:42 AM, Tobias Bocanegra wrote: > Hi, > > I have some tests in filevault that I want to run with the > FileDataStore, but I couldn't figure out how to setup the repository > correctly here [0]. I also looked at the tests in oak, but I couldn't > find a valid reference. > > The reason for this is to test the binary references, which afaik only > work with the FileDataStore. > at least my test [1] works with jackrabbit, but not for oak. > > thanks. > regards, toby > > [0] > https://github.com/apache/jackrabbit-filevault/blob/trunk/vault-core/src/test/java/org/apache/jackrabbit/vault/packaging/integration/IntegrationTestBase.java#L118-L120 > [1] > https://github.com/apache/jackrabbit-filevault/blob/trunk/vault-core/src/test/java/org/apache/jackrabbit/vault/packaging/integration/TestBinarylessExport.java
Re: svn commit: r1727311 - in /jackrabbit/oak/trunk/oak-core/src: main/java/org/apache/jackrabbit/oak/osgi/OsgiWhiteboard.java test/java/org/apache/jackrabbit/oak/osgi/OsgiWhiteboardTest.java
On Fri, Jan 29, 2016 at 4:08 PM, Michael Dürig wrote: > > Shouldn't we make this volatile? Ack. Would do that Chetan Mehrotra
Re: svn commit: r1728341 - /jackrabbit/oak/trunk/oak-segment/src/main/java/org/apache/jackrabbit/oak/plugins/segment/SegmentGraph.java
On Wed, Feb 3, 2016 at 10:17 PM, wrote: > +private static String toString(Throwable e) { > +StringWriter sw = new StringWriter(); > +PrintWriter pw = new PrintWriter(sw, true); > +try { > +e.printStackTrace(pw); > +return sw.toString(); > +} finally { > +pw.close(); > +} > } > + May be use com.google.common.base.Throwables#getStackTraceAsString Chetan Mehrotra
Re: svn commit: r1728341 - /jackrabbit/oak/trunk/oak-segment/src/main/java/org/apache/jackrabbit/oak/plugins/segment/SegmentGraph.java
On Fri, Feb 5, 2016 at 2:54 PM, Michael Dürig wrote: > There's always another library ;-) For utility stuff well almost ! Chetan Mehrotra
Re: R: info about jackrabbitoak.
On Wed, Feb 24, 2016 at 2:46 PM, Ancona Francesco wrote: > that the project depends on felix (osgi) dependency. It does not depend on Felix framework but some modules from Felix project. There is a webapp example [1] where you can deploy the war on Tomcat/WebContainer and have your code in the war access repository instance Chetan Mehrotra [1] https://github.com/apache/jackrabbit-oak/tree/trunk/oak-examples/webapp
Re: testing blob equality
On Mon, Feb 29, 2016 at 6:42 PM, Tomek Rekawek wrote: > I wonder if we can switch the order of length and identity comparison in > AbstractBlob#equal() method. Is there any case in which the > getContentIdentity() method will be slower than length()? That can be switched but I am afraid that it would not work as expected. In JackrabbitNodeState#createBlob determining the contentIdentity involves determining the length. You can give org.apache.jackrabbit.oak.upgrade.blob.LengthCachingDataStore a try (See OAK-2882 for details) Chetan Mehrotra
Re: [1.4.0][blocked] oak-examples and circular dependencies on oak itself
On Tue, Mar 1, 2016 at 10:51 PM, Davide Giannella wrote: > I'm kind-of stuck in the release process as oak-examples contains > dependencies to oak-1.4-SNAPSHOT. The problem is the -SNAPSHOT bit. Wondering how it used to work so far for various 1.3.x releases One approach we can try is get rid of oak.version and make use of project.version. Such a way should be similar to how oak-lucene depends on oak-core and hence should work Chetan Mehrotra
Re: oak-resilience
Cool stuff Tomek! This was something which was discussed in last Oakathon so great to have a way to do resilience testing programatically. Would give it a try Chetan Mehrotra On Mon, Mar 7, 2016 at 1:49 PM, Stefan Egli wrote: > Hi Tomek, > > Would also be interesting to see the effect on the leases and thus > discovery-lite under high memory load and network problems. > > Cheers, > Stefan > > On 04/03/16 11:13, "Tomek Rekawek" wrote: > >>Hello, >> >>For some time I've worked on a little project called oak-resilience. It >>aims to be a resilience testing framework for the Oak. It uses >>virtualisation to run Java code in a controlled environment, that can be >>spoilt in different ways, by: >> >>* resetting the machine, >>* filling the JVM memory, >>* filling the disk, >>* breaking or deteriorating the network. >> >>I described currently supported features in the README file [1]. >> >>Now, once I have a hammer I'm looking for a nail. Could you share your >>thoughts on areas/features in Oak which may benefit from being >>systematically tested for the resilience in the way described above? >> >>Best regards, >>Tomek >> >>[1] >>https://github.com/trekawek/jackrabbit-oak/tree/resilience/oak-resilience >> >>-- >>Tomek Rękawek | Adobe Research | www.adobe.com >>reka...@adobe.com >> > >
Re: [VOTE] Release Apache Jackrabbit Oak 1.4.0 (take 3)
On Mon, Mar 7, 2016 at 4:21 PM, Davide Giannella wrote: > [ ] +1 Release this package as Apache Jackrabbit Oak 1.4.0 All check ok including integration tes [1] Chetan Mehrotra [1] Run check-release.sh with following mvn command mvn verify -fn -PintegrationTesting,unittesting,rdb-derby -Drdb.jdbc-url=jdbc:derby:foo\;create=true
Re: parent pom env.OAK_INTEGRATION_TESTING
On Tue, Mar 22, 2016 at 9:49 PM, Davide Giannella wrote: > I can't really recall why and if we use this. Its referred to in main README.md so as to allow a developer to always enable running of integration test Chetan Mehrotra
Re: [VOTE] Release Apache Jackrabbit Oak 1.4.1
On Thu, Mar 24, 2016 at 8:02 PM, Davide Giannella wrote: > [ ] +1 Release this package as Apache Jackrabbit Oak 1.4.1 +1 (ALL CHECKS OK) Chetan Mehrotra
Re: Extracting subpaths from a DocumentStore repo
Hi Robert, On Mon, Mar 28, 2016 at 7:59 PM, Robert Munteanu wrote: > - create a repository (R1) , populate /foo and /bar with some content > - extract data for /foo and /bar from R1 > - pre-populate a DS 'storage area' ( MongoDB collection or RDB table ) > with the data extracted above > - configure a new repository (R2) to mount /foo and /bar with the data > from above Instead of relying on DocumentStore API for "cloning" certain path it might be easier to use Repository Sidegrade [1] sort of logic which works at NodeState level. In that case you would not need to rely on Document details Chetan Mehrotra [1] https://jackrabbit.apache.org/oak/docs/migration.html
Re: svn commit: r1737349 - /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/rdb/RDBConnectionHandler.java
Hi Julian, On Fri, Apr 1, 2016 at 5:19 PM, wrote: > +@Nonnull > +private Connection getConnection() throws IllegalStateException, > SQLException { > +long ts = System.currentTimeMillis(); > +Connection c = getDataSource().getConnection(); > +if (LOG.isDebugEnabled()) { > +long elapsed = System.currentTimeMillis() - ts; > +if (elapsed >= 100) { > +LOG.debug("Obtaining a new connection from " + this.ds + " > took " + elapsed + "ms"); > +} > +} > +return c; > +} You can also use PerfLogger here which is also used in other places in DocumentNodeStore --- final PerfLogger PERFLOG = new PerfLogger( LoggerFactory.getLogger(DocumentNodeStore.class.getName() + ".perf")); final long start = PERFLOG.start(); Connection c = getDataSource().getConnection(); PERFLOG.end(start, 100, "Obtaining a new connection from {} ", ds); --- This would also avoid the call to System.currentTimeMillis() if debug log is not enabled Chetan Mehrotra
Re: svn commit: r1737349 - /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/rdb/RDBConnectionHandler.java
On Fri, Apr 1, 2016 at 6:40 PM, Julian Reschke wrote: > Did you benchmark System.currentTimeMillis() as opposed to checking the log > level? Well time taken by single isDebugEnabled would always be less than System.currentTimeMillis() + isDebugEnabled! In this case it anyway does not matter much as remote call would have much more overhead. Suggestion here was more to have a consistent way of doing such things but not a hard requirement per se ... Chetan Mehrotra
Re: [VOTE] Release Apache Jackrabbit Oak 1.2.14
On Wed, Apr 20, 2016 at 10:25 AM, Amit Jain wrote: > [ ] +1 Release this package as Apache Jackrabbit Oak 1.2.14 All checks ok Chetan Mehrotra
Re: [VOTE] Please vote for the final name of oak-segment-next
Missed sending nomination on earlier thread. If not late then one more proposal oak-segment-v2 This is somewhat similar to names used in Mongo mmapv1 and mmapv2. Chetan Mehrotra On Tue, Apr 26, 2016 at 2:32 PM, Tommaso Teofili wrote: > oak-segment-store +1 > > Regards, > Tommaso > > Il giorno lun 25 apr 2016 alle ore 16:52 Vikas Saurabh < > vikas.saur...@gmail.com> ha scritto: > > > > oak-embedded-store +1 > > > > > > Thanks, > > Vikas > > >
API proposal for - Expose URL for Blob source (OAK-1963)
Hi Team, For OAK-1963 we need to allow access to actaul Blob location say in form File instance or S3 object id etc. This access is need to perform optimized IO operation around binary object e.g. 1. The File object can be used to spool the file content with zero copy using NIO by accessing the File Channel directly [1] 2. Client code can efficiently replicate a binary stored in S3 by having direct access to S3 object using copy operation To allow such access we would need a new API in the form of AdaptableBinary. API === public interface AdaptableBinary { /** * Adapts the binary to another type like File, URL etc * * @param The generic type to which this binary is adapted *to * @param type The Class object of the target type, such as *File.class * @return The adapter target or null if the binary cannot * adapt to the requested type */ AdapterType adaptTo(Class type); } Usage = Binary binProp = node.getProperty("jcr:data").getBinary(); //Check if Binary is of type AdaptableBinary if (binProp instanceof AdaptableBinary){ AdaptableBinary adaptableBinary = (AdaptableBinary) binProp; //Adapt it to File instance File file = adaptableBinary.adaptTo(File.class); } The Binary instance returned by Oak i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then implement this interface and calling code can then check the type and cast it and then adapt it Key Points 1. Depending on backing BlobStore the binary can be adapted to various types. For FileDataStore it can be adapted to File. For S3DataStore it can either be adapted to URL or some S3DataStore specific type. 2. Security - Thomas suggested that for better security the ability to adapt should be restricted based on session permissions. So if the user has required permission then only adaptation would work otherwise null would be returned. 3. Adaptation proposal is based on Sling Adaptable [2] 4. This API is for now exposed only at JCR level. Not sure should we do it at Oak level as Blob instance are currently not bound to any session. So proposal is to place this in 'org.apache.jackrabbit.oak.api' package Kindly provide your feedback! Also any suggestion/guidance around how the access control be implemented Chetan Mehrotra [1] http://www.ibm.com/developerworks/library/j-zerocopy/ [2] https://sling.apache.org/apidocs/sling5/org/apache/sling/api/adapter/Adaptable.html
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On Wed, May 4, 2016 at 10:07 PM, Ian Boston wrote: > If the File or URL is writable, will writing to the location cause issues > for Oak ? > Yes that would cause problem. Expectation here is that code using a direct location needs to behave responsibly. Chetan Mehrotra
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On Thu, May 5, 2016 at 1:31 PM, Davide Giannella wrote: > Would it be possible to avoid the `instaceof`? Which means, in my > opinion, all our binaries should be Adaptable. > The Binary interface is part of JCR API so cannot be be modified to extend Adaptable. Hence the client code would need to cast and special case it > Plus I would add anyhow an oak.api interface Adaptable so that we can then, if needed, apply the same concept anywhere else. That can also be done. For now I was being conservative in the API being introduced. If later we find that Adaptable kind of support is need for other place that can be introduced as a first class api Chetan Mehrotra
Re: API proposal for - Expose URL for Blob source (OAK-1963)
> This proposal introduces a huge leak of abstractions and has deep security implications. I understand the leak of abstractions concern. However would like to understand the security concern bit more. One way I can think of that it can cause security concern is you have some malicious code running in same jvm which can then do bad things with the file handle. Do note that the File handle would not get exposed via any remoting api we currently support. Now in this case if malicious code is already running in same jvm then security is breached and code can anyway make use of reflection to access internal details. So if there is any other possible security concern then would like to discuss. Coming to usecases Usecase A - Image rendition generation - We have some bigger deployments where lots of images gets uploaded to the repository and there are some conversions (rendition generation) which are performed by OS specific native executables. Such programs work directly on file handle. Without this change currently we need to first spool the file content into some temporary location and then pass that to the other program. This add unnecessary overhead and something which can be avoided in case there is a FileDataStore being used where we can provide a direct access to the file Usecase B - Efficient replication across regions in S3 -- This for AEM based setup which is running on Oak with S3DataStore. There we have global deployment where author instance is running in 1 region and binary content is to be distributed to publish instances running in different regions. The DataStore size is huge say 100TB and for efficient operation we need to use Binary less replication. In most cases only a very small subset of binary content would need to be present in other regions. Current way (via shared DataStore) to support that would involve synchronizing the S3 bucket across all such regions which would increase the storage cost considerable. Instead of that plan is to replicate the specific assets via s3 copy operation. This would ensure that big assets can be copied efficiently at S3 level and that would require direct access to the S3 object. Again in all such cases one can always resort to current level support i.e. copy over all the content via inputstream into some temporary store and then use that. But that would add considerable overhead when assets are of 100MB sizes or more. So the approach proposed would allow client code to this efficiently depending on the underlying storage capability > To me sounds like breaching the JCR and NodeState layers to directly > manipulate NodeStore binaries (from the DataStore), e.g. to perform smart > replication across different instances, but imho the right way to address > that is extending one of the current DataStore implementations or create a > new one. The original proposed approach in OAK-1963 was like that i.e. introduce this access method on BlobStore which works on reference. But in that case client code would need to deal with BlobStore API. In either case access to actual binary storage data would be required Chetan Mehrotra On Thu, May 5, 2016 at 2:49 PM, Tommaso Teofili wrote: > +1 to Francesco's concerns, exposing the location of a binary at the > application level doesn't sound good from a security perspective. > To me sounds like breaching the JCR and NodeState layers to directly > manipulate NodeStore binaries (from the DataStore), e.g. to perform smart > replication across different instances, but imho the right way to address > that is extending one of the current DataStore implementations or create a > new one. > I am also concerned that this Adaptable pattern would open room for other > such hacks into the stack. > > My 2 cents, > Tommaso > > > Il giorno gio 5 mag 2016 alle ore 11:00 Francesco Mari < > mari.france...@gmail.com> ha scritto: > > > This proposal introduces a huge leak of abstractions and has deep > security > > implications. > > > > I guess that the reason for this proposal is that some users of Oak would > > like to perform some operations on binaries in a more performant way by > > leveraging the way those binaries are stored. If this is the case, I > > suggest those users to evaluate an applicative solution implemented on > top > > of the JCR API. > > > > If a user needs to store some important binary data (files, images, etc.) > > in an S3 bucket or on the file system for performance reasons, this > > shouldn't affect how Oak handles blobs internally. If some assets are of > > special interest for the user, then the user should bypass Oak and take > > care of the storage of those assets directly. Oak can be used to store
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On Thu, May 5, 2016 at 4:38 PM, Francesco Mari wrote: > The security concern is quite easy to explain: it's a bypass of our > security model. Imagine that, using a session with the appropriate > privileges, a user accesses a Blob and adapts it to a file handle, an S3 > bucket or a URL. This code passes this reference to another piece of code > that modifies the data directly even if - in the same deployment - it > shouldn't be able to access the Blob instance to begin with. > How is this different from the case where a code obtains a Node via an admin session and passes that Node instance to another code which say deletes important content via it. In the end we have to trust the client code to do correct thing when given appropriate rights. So in current proposal the code can only adapt the binary if the session has expected permissions. Post that we need to trust the code to behave properly. > In both the use case, the customer is coupling the data with the most > appropriate storage solution for his business case. In this case, customer > code - and not Oak - should be responsible for the management of that data. Well then it means that customer implements its very own DataStore like solution and all the application code do not make use of JCR Binary and instead use another service to resolve the references. This would greatly reduce the usefulness of JCR for asset heavy application which use JCR to manage binary content along with its metadata Chetan Mehrotra
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On Thu, May 5, 2016 at 5:07 PM, Francesco Mari wrote: > > This is a totally different thing. The change to the node will be committed > with the privileges of the session that retrieved the node. If the session > doesn't have enough privileges to delete that node, the node will be > deleted, There is no escape from the security model. A "bad code" when passes a node backed via admin session can still do bad thing as admin session has all the privileges. In same way if a bad code is passed a file handle then it can cause issue. So I am still not sure on the attack vector which we are defending against. Chetan Mehrotra
Re: API proposal for - Expose URL for Blob source (OAK-1963)
To highlight - As mentioned earlier the user of proposed api is tying itself to implementation details of Oak and if this changes later then that code would also need to be changed. Or as Ian summed it up > if the API is introduced it should create an out of band agreement with the consumers of the API to act responsibly. The method is to be used for those important case where you do rely on implementation detail to get optimal performance in very specific scenarios. Its like DocumentNodeStore making use of some Mongo specific API to perform some important critical operation to achieve better performance by checking if the underlying DocumentStore is Mongo based. I have seen discussion of JCR-3534 and other related issue but still do not see any conclusion on how to answer such queries where direct access to blobs is required for performance aspect. This issue is not about exposing the blob reference for remote access but more about optimal path for in VM access > who owns the resource? Who coordinates (concurrent) access to it and how? What are the correctness and performance implications here (races, deadlock, corruptions, JCR semantics)? The client code would need to be implemented in a proper way. Its more like implementing a CommitHook. If implemented in incorrect way it would cause issues deadlocks etc. But then we assume that any one implementing that interface would take proper care in implementation. > it limits implementation freedom and hinders further evolution (chunking, de-duplication, content based addressing, compression, gc, etc.) for data stores. As mentioned earlier. Some part of API indicates a closer dependency on how things work (like SPI, or ConsumerType AP on OSGi terms). By using such API client code definitely ties itself to Oak implementation detail but it should not limit how Oak implementation detail evolve. So when it changes client code need to adapt itself accordingly. Oak can express that by increment the minor version of exported package to indicate change in behavior. > bypassing JCR's security model I yet do not see the attack vector which we need to defend differently here. Again the blob url is not being exposed say as part of webdav or any other remote call. So would like to understand the security concern better here (unless it defending against a malicious , badly implemented client code which we discussed above) > Can't we come up with an API that allows the blobs to stay under control of Oak? The code need to work either at OS level say file handle or say S3 object. So I do not see a way where it can work without having access to those details FWIW there is code out there which reverse engineers the blobId to access the actual binary. People do it so as to get decent throughput in image rendition logic for large scale deployment. The proposal here was to formalize that approach by providing a proper api. If we do not provide such an API then the only way for them would be to continue relying on reverse engineering the blobId! > If not, this is probably an indication that those blobs shouldn't go into Oak but just references to it as Francesco already proposed. Anything else is whether fish nor fowl: you can't have the JCR goodies but at the same time access underlying resources at will. Thats a fine argument to make. But then users here have real problem to solve which we should not ignore. Oak based systems are being proposed for large asset deployment where one of the primary requirement is asset handling/processing of 100 of TB of binary data. So we would then have to recommend for such cases to not use JCR Binary abstraction and manage the binaries on your own. That would then solve both the problems (that might though break lots of tooling build on top of JCR API to manage those binaries)! Thinking more - Another approach that I can then suggest it people implement there own BlobStore (may be by extending ours) and provide this API there i.e. say which takes Blob id and provide the required details. This way we "outsource" the problem. Would that be acceptable? Chetan Mehrotra On Mon, May 9, 2016 at 2:28 PM, Michael Dürig wrote: > > Hi, > > I very much share Francesco's concerns here. Unconditionally exposing > access to operation system resources underlying Oak's inner working is > troublesome for various reasons: > > - who owns the resource? Who coordinates (concurrent) access to it and > how? What are the correctness and performance implications here (races, > deadlock, corruptions, JCR semantics)? > > - it limits implementation freedom and hinders further evolution > (chunking, de-duplication, content based addressing, compression, gc, etc.) > for data stores. > > - bypassing JCR's security model > > Pretty much all of this has been discussed in the scope of > https://issues.apache.org/jira/browse/JCR-3534 and
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Had an offline discussion with Michael on this and explained the usecase requirement in more details. One concern that has been raised is that such a generic adaptTo API is too inviting for improper use and Oak does not have any context around when this url is exposed for what time it is used. So instead of having a generic adaptTo API at JCR level we can have a BlobProcessor callback (Approach #B). Below is more of a strawman proposal. Once we have a consensus then we can go over the details interface BlobProcessor { void process(AdaptableBlob blob); } Where AdaptableBlob is public interface AdaptableBlob { AdapterType adaptTo(Class type); } The BlobProcessor instance can be passed via BlobStore API. So client would look for a BlobStore service (so use the Oak level API) and pass it the ContentIdentity of JCR Binary aka blobId interface BlobStore{ void process(String blobId, BlobProcessor processor) } The approach ensures 1. That any blob handle exposed is only guaranteed for the duration of 'process' invocation 2. There is no guarantee on the utility of blob handle (File, S3 Object) beyond the callback. So one should not collect the passed File handle for later use Hopefully this should address some of the concerns raised in this thread. Looking forward to feedback :) Chetan Mehrotra On Mon, May 9, 2016 at 6:24 PM, Michael Dürig wrote: > > > On 9.5.16 11:43 , Chetan Mehrotra wrote: > >> To highlight - As mentioned earlier the user of proposed api is tying >> itself to implementation details of Oak and if this changes later then >> that >> code would also need to be changed. Or as Ian summed it up >> >> if the API is introduced it should create an out of band agreement with >>> >> the consumers of the API to act responsibly. >> > > So what does "to act responsibly" actually means? Are we even in a > position to precisely specify this? Experience tells me that we only find > out about those semantics after the fact when dealing with painful and > expensive customer escalations. > > And even if we could, it would tie Oak into very tight constraints on how > it has to behave and how not. Constraints that would turn out prohibitively > expensive for future evolution. Furthermore a huge amount of resources > would be required to formalise such constraints via test coverage to guard > against regressions. > > > >> The method is to be used for those important case where you do rely on >> implementation detail to get optimal performance in very specific >> scenarios. Its like DocumentNodeStore making use of some Mongo specific >> API >> to perform some important critical operation to achieve better performance >> by checking if the underlying DocumentStore is Mongo based. >> > > Right, but the Mongo specific API is a (hopefully) well thought through > API where as with your proposal there are a lot of open questions and > concerns as per my last mail. > > Mongo (and any other COTS DB) for good reasons also don't give you direct > access to its internal file handles. > > > >> I have seen discussion of JCR-3534 and other related issue but still do >> not >> see any conclusion on how to answer such queries where direct access to >> blobs is required for performance aspect. This issue is not about exposing >> the blob reference for remote access but more about optimal path for in VM >> access >> > > One bottom line of the discussions in that issue is that we came to a > conclusion after clarifying the specifics of the use case. Something I'm > still missing here. The case you brought forward is too general to serve as > a guideline for a solution. Quite to the contrary, to me it looks like a > solution to some problem (I'm trying to understand). > > > >> who owns the resource? Who coordinates (concurrent) access to it and how? >>> >> What are the correctness and performance implications here (races, >> deadlock, corruptions, JCR semantics)? >> >> The client code would need to be implemented in a proper way. Its more >> like >> implementing a CommitHook. If implemented in incorrect way it would cause >> issues deadlocks etc. But then we assume that any one implementing that >> interface would take proper care in implementation. >> > > But a commit hook is an internal SPI. It is not advertised to the whole > world as a public API. > > > >> it limits implementation freedom and hinders further evolution >>> >> (chunking, de-duplication, content based addressing, compression, gc, >> etc.) >> for data stores. >> >> As mentioned earlier. Some part of API indicates a closer depend
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On Mon, May 9, 2016 at 8:27 PM, Ian Boston wrote: > I thought the consumers of this api want things like the absolute path of > the File in the BlobStore, or the bucket and key of the S3 Object, so that > they could transmit it and use it for processing independently of Oak > outside the callback ? > Most cases can still be done, just do it within the callback blobStore.process("xxx", new BlobProcessor(){ void process(AdaptableBlob blob){ File file = blob.adaptTo(File.class); transformImage(file); } }); Doing this within callback would allow Oak to enforce some safeguards (more on that in next mail) and still allows the user to perform optimal binary processing Chetan Mehrotra
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Some more points around the proposed callback based approach 1.Possible security or enforcing a read only access to the exposed file - The file provided within the BlobProcessor callback can be a symlink created with a os user account which only has read only access. The symlink can be removed once the callback returns 2. S3 DataStore Security Concern - For S3 DataStore we would only be exposing the S3 object identifier and the client code would still need the aws credentials to connect to the bucket and perform required copy operation 3. Possibility of further optimization in S3DataStore processing - Currently when reading a binary from S3DataStore the binary content are *always* spooled to some local temporary file (in local cache) and then a InputStream is opened on that file. So even if the code need to read initial few bytes of stream the whole file would have to be read. This happens because with current JCR Binary API we are not in control of lifetime of exposed InputStream. So if say we expose the InputStream we cannot determine untill when the backing S3 SDK resources need to be held Also current S3DataStore always creates local copy - With a callback based approach we can safely expose this file which would allow layers above to avoid spooling the content again locally for processing. And with callback boundary we can later do required cleanup Chetan Mehrotra On Mon, May 9, 2016 at 7:15 PM, Chetan Mehrotra wrote: > Had an offline discussion with Michael on this and explained the usecase > requirement in more details. One concern that has been raised is that such > a generic adaptTo API is too inviting for improper use and Oak does not > have any context around when this url is exposed for what time it is used. > > So instead of having a generic adaptTo API at JCR level we can have a > BlobProcessor callback (Approach #B). Below is more of a strawman proposal. > Once we have a consensus then we can go over the details > > interface BlobProcessor { >void process(AdaptableBlob blob); > } > > Where AdaptableBlob is > > public interface AdaptableBlob { > AdapterType adaptTo(Class type); > } > > The BlobProcessor instance can be passed via BlobStore API. So client > would look for a BlobStore service (so use the Oak level API) and pass it > the ContentIdentity of JCR Binary aka blobId > > interface BlobStore{ > void process(String blobId, BlobProcessor processor) > } > > The approach ensures > > 1. That any blob handle exposed is only guaranteed for the duration > of 'process' invocation > 2. There is no guarantee on the utility of blob handle (File, S3 Object) > beyond the callback. So one should not collect the passed File handle for > later use > > Hopefully this should address some of the concerns raised in this thread. > Looking forward to feedback :) > > Chetan Mehrotra > > On Mon, May 9, 2016 at 6:24 PM, Michael Dürig wrote: > >> >> >> On 9.5.16 11:43 , Chetan Mehrotra wrote: >> >>> To highlight - As mentioned earlier the user of proposed api is tying >>> itself to implementation details of Oak and if this changes later then >>> that >>> code would also need to be changed. Or as Ian summed it up >>> >>> if the API is introduced it should create an out of band agreement with >>>> >>> the consumers of the API to act responsibly. >>> >> >> So what does "to act responsibly" actually means? Are we even in a >> position to precisely specify this? Experience tells me that we only find >> out about those semantics after the fact when dealing with painful and >> expensive customer escalations. >> >> And even if we could, it would tie Oak into very tight constraints on how >> it has to behave and how not. Constraints that would turn out prohibitively >> expensive for future evolution. Furthermore a huge amount of resources >> would be required to formalise such constraints via test coverage to guard >> against regressions. >> >> >> >>> The method is to be used for those important case where you do rely on >>> implementation detail to get optimal performance in very specific >>> scenarios. Its like DocumentNodeStore making use of some Mongo specific >>> API >>> to perform some important critical operation to achieve better >>> performance >>> by checking if the underlying DocumentStore is Mongo based. >>> >> >> Right, but the Mongo specific API is a (hopefully) well thought through >> API where as with your proposal there are a lot of open questions and >> concerns as per my last mail. >> >> Mongo (and any other COTS DB) for good reasons
Re: API proposal for - Expose URL for Blob source (OAK-1963)
> what guarantees do/can we give re. this file handle within this context. Can it suddenly go away (e.g. because of gc or internal re-organisation)? How do we establish, test and maintain (e.g. from regressions) such guarantees? Logically it should not go away suddenly. So GC logic should be aware of such "inUse" instances (there is already such support for inUse cases). Such a requirement can be validated via integration testcase > and more concerningly, how do we protect Oak from data corruption by misbehaving clients? E.g. clients writing on that handle or removing it? Again, if this is public API we need ways to test this. Not sure by misbehaving client - Is it malicious (by design) or badly written code. For later yes that might pose a problem but we can have some defense. I would expect the code making use of the api to behave properly. In addition as proposed above [1] for FileDataStore we can provide a symlinked file reference which exposes a read only file handle. For S3DataStore code should have access to aws credentials to perform any write operation, which should be a sufficient defense > In an earlier mail you quite fittingly compared this to commit hooks, which for good reason are an internal SPI. Bit of nit pick here ;) As per Jcr class [1] one can provide a CommitHook instance so not sure if we can term it internal. However point that I wanted to emphasize is that Oak does provide some critical extension point and with a misbehaving code one can shoot himself at foot and as implementation only so much can be done. regards Chetan [1] http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:237kzuhor5y3tpli+state:results [2] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-jcr/src/main/java/org/apache/jackrabbit/oak/jcr/Jcr.java#L190 Chetan Mehrotra
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi Angela, On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber wrote: > Quite frankly I would very much appreciate if took the time to collect > and write down the required (i.e. currently known and expected) > functionality. > > Then look at the requirements and look what is wrong with the current > API that we can't meet those requirements: > - is it just missing API extensions that can be added with moderate effort? > - are there fundamental problems with the current API that we needed to > address? > - maybe we even have intrinsic issues with the way we think about the role > of the repo? > > IMHO, sticking to kludges might look promising on a short term but > I am convinced that we are better off with a fundamental analysis of > the problems... after all the Binary topic comes up on a regular basis. > That leaves me with the impression that yet another tiny extra and > adaptables won't really address the core issues. > Makes sense. Have a look in of the initial mail in the thread at [1] which talks about the 2 usecase I know of. The image rendition usecase manifest itself in one form or other, basically providing access to Native programs via file path reference. The approach proposed so far would be able to address them and hence closer to "is it just missing API extensions that can be added with moderate effort?". If there are any other approach we can address both of the referred usecases then we implement them. Let me know if more details are required. If required I can put it up on a wiki page also. Chetan Mehrotra [1] http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoegupd7l+state:results
Usecases around Binary handling in Oak
Hi Team, Recently we had a discussion around a new API proposal for binary access [1]. From the discussion it was determined that we should first have a collection of the kind of usecases which cannot be easily met by current JCR Binary support in Oak so as to get better understanding of various requirements. That would help us in coming up with a proper solution to enable such usecases going forward To move forward on that I have tried to collect the various usecases at [2] which I have seen in the past. UC1 - processing a binary in JCR with a native library that only has access to the file system UC2 - Efficient replication across regions in S3 UC3 - Text Extraction without temporary File with Tika UC4 - Spooling the binary content to socket output via NIO UC5 - Transferring the file to FileDataStore with minimal overhead UC6 - S3 import UC7 - Random write access in binaries UC8 - X-SendFile I would like to get teams feedback on the various usecases and then come up with the list of usecases which we would like to properly support in Oak. Once that is determined we can discuss the possible solutions and decide on how it gets finally implemented. Kindly provide your feedback! Chetan Mehrotra [1] http://markmail.org/thread/6mq4je75p64c5nyn [2] https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase
Re: API proposal for - Expose URL for Blob source (OAK-1963)
I have started a new mail thread around "Usecases around Binary handling in Oak" so as to first collect the kind of usecases we need to support. Once we decide that we can discuss the possible solution. So lets continue the discussion on that thread Chetan Mehrotra On Tue, May 17, 2016 at 12:31 PM, Angela Schreiber wrote: > Hi Oak-Devs > > Just for the record: This topic has been discussed in a Adobe > internal Oak-coordination call last Wednesday. > > Michael Marth first provided some background information and > we discussed the various concerns mentioned in this thread > and tried to identity the core issue(s). > > Marcel, Michael Duerig and Thomas proposed alternative approaches > on how to address the original issues that lead to the API > proposal, which all would avoid leaking out information about > the internal blob handling. > > Unfortunately we ran out of time and didn't conclude the call > with an agreement on how to proceed. > > From my perception the concerns raised here could not be resolved > by the additional information. > > I would suggest that we try to continue the discussion here > on the list. Maybe with a summary of the alternative proposals? > > Kind regards > Angela > > On 11/05/16 15:38, "Ian Boston" wrote: > > >Hi, > > > >On 11 May 2016 at 14:21, Marius Petria wrote: > > > >> Hi, > >> > >> I would add another use case in the same area, even if it is more > >> problematic from the point of view of security. To better support load > >> spikes an application could return 302 redirects to (signed) S3 urls > >>such > >> that binaries are fetched directly from S3. > >> > > > >Perhaps that question exposes the underlying requirement for some > >downstream users. > > > >This is a question, not a statement: > > > >If the application using Oak exposed a RESTfull API that had all the same > >functionality as [1], and was able to perform at the scale of S3, and had > >the same security semantics as Oak, would applications that are needing > >direct access to S3 or a File based datastore be able to use that API in > >preference ? > > > >Is this really about issues with scalability and performance rather than a > >fundamental need to drill deep into the internals of Oak ? If so, > >shouldn't > >the scalability and performance be fixed ? (assuming its a real concern) > > > > > > > > > >> > >> (if this can already be done or you think is not really related to the > >> other two please disregard). > >> > > > >AFAIK this is not possible at the moment. If it was deployments could use > >nginX X-SendFile and other request offloading mechanisms. > > > >Best Regards > >Ian > > > > > >1 http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html > > > > > >> > >> Marius > >> > >> > >> > >> On 5/11/16, 1:41 PM, "Angela Schreiber" wrote: > >> > >> >Hi Chetan > >> > > >> >IMHO your original mail didn't write down the fundamental analysis > >> >but instead presented the solution for every the 2 case I was > >> >lacking the information _why_ this is needed. > >> > > >> >Both have been answered in private conversions only (1 today in > >> >the oak call and 2 in a private discussion with tom). And > >> >having heard didn't make me more confident that the solution > >> >you propose is the right thing to do. > >> > > >> >Kind regards > >> >Angela > >> > > >> >On 11/05/16 12:17, "Chetan Mehrotra" > wrote: > >> > > >> >>Hi Angela, > >> >> > >> >>On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber > >> >>wrote: > >> >> > >> >>> Quite frankly I would very much appreciate if took the time to > >>collect > >> >>> and write down the required (i.e. currently known and expected) > >> >>> functionality. > >> >>> > >> >>> Then look at the requirements and look what is wrong with the > >>current > >> >>> API that we can't meet those requirements: > >> >>> - is it just missing API extensions that can be added with moderate > >> >>>effort? > >> >>> - are there fundamental problems with the current API that we > >>needed to > >>
Requirement to support multiple NodeStore instance in same setup (OAK-4490)
Hi Team, As part of OAK-4180 feature around using another NodeStore as a local cache for a remote Document store I would need to register another NodeStore instance (for now a SegmentNodeStore - OAK-4490) with the OSGi service registry. This instance would then be used by SecondaryStoreCacheService to save NodeState under certain paths locally and use it later for reads. With this change we would have a situation where there would be multiple NodeStore instance in same service registry. This can confuse some component which have a dependency on NodeStore as a reference and we need to ensure they bind to correct NodeStore instance. Proposal A - Use a 'type' service property to distinguish == Register the NodeStore with a 'type' property. For now the value can be 'primary' or 'secondary'. When any component registers the NodeStore it also provides the type property. On user side the reference needs to provide which type of NodeStore it needs to bound This would ensure that user of NodeStore get bound to correct type. if we use service.ranking then it can cause a race condition where the secondary instance may get bound untill primary comes up Looking for feedback on what approach to take Chetan Mehrotra
Re: Requirement to support multiple NodeStore instance in same setup (OAK-4490)
On Tue, Jun 21, 2016 at 4:52 PM, Julian Sedding wrote: > Not exposing the secondary NodeStore in the service registry would be > backwards compatible. Introducing the "type" property potentially > breaks existing consumers, i.e. is not backwards compatible. I had similar concern so proposed a new interface as part of OAK-4369. However later with further discussion realized that we might have similar requirement going forward i.e. presence of multiple NodeStore impl so might be better to make setup handle such case. So at this stage we have 2 options 1. Use a new interface to expose such "secondary" NodeStore 2. OR Use a new service property to distinguish between different roles Not sure which one to go. May be we go for merged i.e. have a new interface as in #1 but also mandate that it provides its "role/type" as a service property to allow client to select correct one Thoughts? Chetan Mehrotra
Re: Requirement to support multiple NodeStore instance in same setup (OAK-4490)
Okie would go with SecondaryNodeStoreProvider approach and also have a role property for that. For now this interface would live in plugins package and exported as it needs to be used in oak-segment and oak-segment-tar. Later we can decide if we need to move it to SPI package as supported extension point Chetan Mehrotra On Wed, Jun 22, 2016 at 4:44 PM, Stefan Egli wrote: > On 22/06/16 12:21, "Chetan Mehrotra" wrote: > >>On Tue, Jun 21, 2016 at 4:52 PM, Julian Sedding >>wrote: >>> Not exposing the secondary NodeStore in the service registry would be >>> backwards compatible. Introducing the "type" property potentially >>> breaks existing consumers, i.e. is not backwards compatible. >> >>I had similar concern so proposed a new interface as part of OAK-4369. >>However later with further discussion realized that we might have >>similar requirement going forward i.e. presence of multiple NodeStore >>impl so might be better to make setup handle such case. >> >>So at this stage we have 2 options >> >>1. Use a new interface to expose such "secondary" NodeStore >>2. OR Use a new service property to distinguish between different roles >> >>Not sure which one to go. May be we go for merged i.e. have a new >>interface as in #1 but also mandate that it provides its "role/type" >>as a service property to allow client to select correct one >> >>Thoughts? > > If the 'SecondaryNodeStoreProvider' is a non-public interface which can > later 'easily' be replaced with another mechanism, then for me this would > sound more straight forward at this stage as it would not break any > existing consumers (as mentioned by Julian). > > Perhaps once those 'other use cases going forward' of multiple NodeStores > become more clear, then it might be more obvious as to how the > generalization into perhaps a type property should look like. > > my 2cents, > Cheers, > Stefan > >
Re: [VOTE] Release Apache Jackrabbit Oak 1.4.4
On Mon, Jun 27, 2016 at 10:43 AM, Amit Jain wrote: [X] +1 Release this package as Apache Jackrabbit Oak 1.4.4 Chetan Mehrotra
Re: [Oak origin/1.4] Apache Jackrabbit Oak matrix - Build # 992 - Still Failing
On Sat, Jun 25, 2016 at 10:24 AM, Apache Jenkins Server wrote: > Caused by: java.lang.IllegalArgumentException: No enum constant > org.apache.jackrabbit.oak.commons.FixturesHelper.Fixture.SEGMENT_TAR > at java.lang.Enum.valueOf(Enum.java:238) > at > org.apache.jackrabbit.oak.commons.FixturesHelper$Fixture.valueOf(FixturesHelper.java:45) > at > org.apache.jackrabbit.oak.commons.FixturesHelper.(FixturesHelper.java:58) The test are failing due to above issue. Is this related to presence of new segment-tar module in trunk but not in branch? Chetan Mehrotra
Re: [Oak origin/1.4] Apache Jackrabbit Oak matrix - Build # 992 - Still Failing
Thanks for the link. Would followup on the issue and have it fixed in branches Chetan Mehrotra On Mon, Jun 27, 2016 at 5:11 PM, Julian Reschke wrote: > On 2016-06-27 13:31, Chetan Mehrotra wrote: >> >> On Sat, Jun 25, 2016 at 10:24 AM, Apache Jenkins Server >> wrote: >>> >>> Caused by: java.lang.IllegalArgumentException: No enum constant >>> org.apache.jackrabbit.oak.commons.FixturesHelper.Fixture.SEGMENT_TAR >>> at java.lang.Enum.valueOf(Enum.java:238) >>> at >>> org.apache.jackrabbit.oak.commons.FixturesHelper$Fixture.valueOf(FixturesHelper.java:45) >>> at >>> org.apache.jackrabbit.oak.commons.FixturesHelper.(FixturesHelper.java:58) >> >> >> The test are failing due to above issue. Is this related to presence >> of new segment-tar module in trunk but not in branch? >> >> Chetan Mehrotra > > > -> <https://issues.apache.org/jira/browse/OAK-4475>
[multiplex] - Review the proposed SPI interface MountInfoProvider and Mount for OAK-3404
Hi Team, As we start on integrating the work done related to multiplexing support to trunk I would like your thoughts on new SPI interface MountInfoProvider [1] being proposed as part of OAK-3404. This would be used by various part of Oak to determine the Mount information. Kindly provide your feedback on the issue. Chetan Mehrotra [1] https://github.com/rombert/jackrabbit-oak/tree/features/docstore-multiplex/oak-core/src/main/java/org/apache/jackrabbit/oak/spi/mount
Re: svn commit: r1750601 - in /jackrabbit/oak/trunk: oak-segment-tar/ oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/ oak-segment-tar/src/test/java/org/apache/jackrabbit/oak/segment/
Hi Francesco, On Wed, Jun 29, 2016 at 12:49 PM, Francesco Mari wrote: > Please do not change the "oak.version" property to a snapshot version. If > your change relies on code that is only available in the latest snapshot of > Oak, please revert this commit and hold it back until a proper release of > Oak is performed. I can do that but want to understand the impact here if we switched to SNAPSHOT version? For e.g. in the past we had done some changes in jackrabbit which is need in oak then we had switched to snapshot version of JR2 and later reverted to released version once JR2 release is done. That has worked fine so far and we did not had to hold the feature work for that. So want to understand why it should be different here Chetan Mehrotra
Re: svn commit: r1750601 - in /jackrabbit/oak/trunk: oak-segment-tar/ oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/ oak-segment-tar/src/test/java/org/apache/jackrabbit/oak/segment/
On Wed, Jun 29, 2016 at 1:25 PM, Francesco Mari wrote: > oak-segment-tar should be releasable at any time. If I had to launch a quick patch release this morning, I would have to either revert your commit or postpone my release until Oak is released. Given the current release frequency on trunk (2 week) I do not think it should be a big problem and holding of commits break the continuity and increases work. But then that might be just an issue for me! For now I have reverted the changes from oak-segment-tar Chetan Mehrotra
OAK-4475 - CI failing on branches due to unknown fixture SEGMENT_TAR
Hi Team, Sometime back build was failing for branches because of new trunk only fixture usage of SEGMENT_TAR. As this fixture was not present on the branch it caused the build to fail. My initial attempt to fix this was to ignore exception when FixturesHelper resolves enum like SEGMENT_TAR on branch [1]. With this build comes fine but I have a hunch that current fix would lead to all fixtures getting activated and that would cause waste of time A- Which solution to use So have 2 options 1. Treat SEGMENT_TAR as SEGMENT_MK for branches - This would cause test to run 2 times against SEGMENT_MK 2. Create separate build profile for branches B - Use of nsfixtures system property == However before doing that I am trying to understand how the fixture get set. From CI logs the command that gets fired is --- /home/jenkins/tools/maven/apache-maven-3.2.1/bin/mvn -Dnsfixtures=DOCUMENT_NS -Dlabel=Ubuntu -Djdk=jdk1.8.0_11 -Dprofile=integrationTesting clean verify -PintegrationTesting -Dsurefire.skip.ut=true -Prdb-derby -DREMOVEMErdb.jdbc- --- It sets system property 'nsfixtures' to required fixture. However in our parent pom we rely on system property 'fixtures' which defaults to SEGMENT_MK. And in no place we override 'fixtures' in our CI. Looking at all things it appears to me that currently all test are only running against SEGMENT_MK fixture and other fixtures are not getting used. But then exception should not have come with usage of SEGMENT_TAR. So I am missing some connection here in the build process >From my test it appears that if we specify a system property in mvn command line and same property is configured in maven-surefire-plugin then property specified in command line is used and one in pom.xml is ignored. That would explain why settings in pom.xml are not used for fixture So what should we opt for #A? My vote would be for A1! Chetan Mehrotra [1] https://github.com/apache/jackrabbit-oak/commit/319433e9400429592065d4b3997dd31f93b6c549 [2] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-parent/pom.xml#L289 maven-failsafe-plugin ${test.opts} ${known.issues} ${mongo.host} ${mongo.port} ${mongo.db} ${mongo.db2} ${fixtures} ${project.build.directory}/derby.log
Re: svn commit: r1750809 - /jackrabbit/oak/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LucenePropertyIndex.java
Hi Tommaso, On Thu, Jun 30, 2016 at 8:20 PM, wrote: > Modified: > > jackrabbit/oak/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LucenePropertyIndex.java Can we have some backing testcase for this? It would ensure future refactoring does not break this requirement Chetan Mehrotra
Re: multilingual content and indexing
On Tue, Jul 12, 2016 at 3:53 PM, Lukas Kahwe Smith wrote: >> Alternatively, you can create different index definitions for each subtree >> (see [1]), e.g. Using the “includedPaths” property. This would lead to >> smaller indexes at the downside that you would have to create an index >> definition if you add a new language tree. Another way would be to have your index definition under each node /content/en/oak:index/fooIndex /content/jp/oak:index/fooIndex And have each index config analyzer configured as per the language. Chetan Mehrotra
Re: svn commit: r1752601 - in /jackrabbit/oak/trunk/oak-segment-tar: pom.xml src/main/java/org/apache/jackrabbit/oak/segment/SegmentWriter.java
On Thu, Jul 14, 2016 at 2:04 PM, wrote: > > +commons-math3 commons-math is a 2.1 MB jar. Would it be possible to avoid embedding it whole and only have some parts embedded/copied. (See [1] for an example) Chetan Mehrotra [1] https://issues.apache.org/jira/browse/SLING-2361
[proposal] New oak:Resource nodetype as alternative to nt:resource
In most cases where code uses JcrUtils.putFile [1] it leads to creation of below content structure + foo.jpg (nt:file) + jcr:content (nt:resource) - jcr:data Due to usage of nt:resource each nt:file node creates a entry in uuid index as nt:resource is referenceable [2]. So if a system has 1M nt:file nodes then we would have 1M entries in /oak:index/uuid as in most cases the files are created via [1] and hence all such files are referenceable The nodetype defn for nt:file [3] does not mandate that the requirement for jcr:content being nt:resource. So should we register a new oak:Resource nodetype which is same as nt:resource but not referenceable. This would be similar to oak:Unstructured. Also what should we do for [1]. Should we provide an overloaded method which also accepts a nodetype for jcr:content node as it cannot use oak:Resource Chetan Mehrotra [1] https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-jcr-commons/src/main/java/org/apache/jackrabbit/commons/JcrUtils.java#L1062 [2] [nt:resource] > mix:lastModified, mix:mimeType, mix:referenceable primaryitem jcr:data - jcr:data (binary) mandatory [3] [nt:file] > nt:hierarchyNode primaryitem jcr:content + jcr:content (nt:base) mandatory
Re: [proposal] New oak:Resource nodetype as alternative to nt:resource
Thanks for the feedback. Opened OAK-4567 to track the change On Mon, Jul 18, 2016 at 12:14 PM, Angela Schreiber wrote: > Additionally or alternatively we could create a separate method (e.g. > putOakFile > or putOakResource or something explicitly mentioning the non-referenceable > nature of the content) that uses 'oak:Resource' and state that it requires > the > node type to be registered and will fail otherwise... that would be as easy > to use as 'putFile', which is IMO important. @Angela - What about Justin's suggestion later around changing the current putFile implementation. Have it use oak:Resource is present otherwise fallback to nt:resource. This can lead to compatibility issue though as javadoc of putFile says it would use nt:resource Chetan Mehrotra
Specifying threadpool name for periodic scheduled jobs (OAK-4563)
Hi Team, While running Oak in Sling we rely on Sling Scheduler [1] to execute the periodic jobs. By default Sling Scheduler uses a pool of 5 threads to run all such periodic jobs in the system. Recently we saw an issue OAK-4563 where due to some reason the pool got exhausted for long time and that prevented the async indexing job to run for long time and hence affected the query result. To address that Sling now provides a new option (SLING-5831) where one can specify the pool name to be used to execute a specific job. So we can specify custom pool which can be used for Oak related jobs. Now currently in Oak we use following types of periodic jobs 1. Async indexing - (Cluster Singleton) 2. Document Store - Journal GC (Cluster Singleton) 3. Document Store - LastRevRecovery 4. Statistic Collection - For timeseries data update in ChangeProcessor, SegmentNodeStore GCMonitor Now should we use A - one single pool for all of the above B - use the pool only for 1-3. The default pool would be of 5. So even if #2 #3 are running it would not hamper #1 Assuming #4 is not that critical to run and may consist of lots of jobs. My suggestion would be to go for #B Chetan Mehrotra [1] https://sling.apache.org/documentation/bundles/scheduler-service-commons-scheduler.html
Re: Specifying threadpool name for periodic scheduled jobs (OAK-4563)
On Tue, Jul 19, 2016 at 12:54 PM, Michael Dürig wrote: > For blocking or time intensive tasks I would go for a dedicated thread pool. So wrt current issue that means option #B ? Chetan Mehrotra
Re: Specifying threadpool name for periodic scheduled jobs (OAK-4563)
On Tue, Jul 19, 2016 at 1:21 PM, Michael Dürig wrote: > Not sure as I'm confused by your description of that option. I don't > understand which of 1, 2, 3 and 4 would run in the "default pool" and which > should run in its own dedicated pool. #1, #2 and #3 would run in dedicated pool and each using same pool. Pool name would be 'oak'. Also see OAK-4563 for the patch While for #4 default pool would be used as those are non blocking and short tasks Chetan Mehrotra
Re: Specifying threadpool name for periodic scheduled jobs (OAK-4563)
On Tue, Jul 19, 2016 at 1:44 PM, Stefan Egli wrote: > I'd go for #A to limit cross-effects between oak and other layers. Note that for #4 there can be multiple task scheduled. So if a system has 100 JCR Listeners than there would be 1 task/listener to manage the time series stats. These should be quick and non blocking though. All other task are much more critical for repository to function properly. Hence thoughts to go for #B where we have a dedicated pool for those 'n' tasks. Where n is much small i.e. number of async lanes + 2 from DocumentNodeStore so far. So its easy to size Chetan Mehrotra
Re: Why is nt:resource referencable?
On Wed, Jul 20, 2016 at 2:49 PM, Bertrand Delacretaz wrote: > but the JCR spec (JSR 283 10 August 2009) only has > > [nt:resource] > mix:mimeType, mix:lastModified > primaryitem jcr:data > - jcr:data (BINARY) mandatory Thats interesting. Did not knew its not mandated in JCR 2.0. However looks like for backward compatibility we need to support it. See [1] where this was changed @Marcel - I did not understood JCR-2170 properly. But any chance we can switch to newer version of nt:resource and do not modify existing nodes and let the new definition effect/enforced only on new node. Chetan Mehrotra [1] https://issues.apache.org/jira/browse/JCR-2170?focusedCommentId=12754941&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12754941
Re: Why is nt:resource referencable?
On Wed, Jul 20, 2016 at 4:04 PM, Marcel Reutegger wrote: > Maybe we would keep the jcr:uuid property on the referenceable node and add > the mixin? What if we do not add any mixin and just have jcr:uuid property present. The node would anyway be indexed so search would still work. Not sure if API semantics require that nodes lookedup by UUID have to be referenceable. For now I think oak:Resource is safest way. But just exploring other options if possible! Chetan Mehrotra
Re: Why is nt:resource referencable?
Thanks for all the details Marcel and Angela. That helps ... so look like oak:Resource is the way to go On Wed, Jul 20, 2016 at 6:17 PM, Angela Schreiber wrote: > I am pretty sure that there was good > intention behind the change in nt-definition between JCR 1.0 and > JCR 2.0... but maybe not fully thought through when it comes to > backwards compatibility Digging further it appears this concern was raised but not answered [1] === Since referenceable nodes are optional the following changes should be made (decided at F2F): nt:resource change to NOT referenceable mix:simpleVerionable change to NOT referenceable mix:versionable change to referenceable nt:frozenNode property jcr:frozenUuid change to NOT mandatory === Chetan Mehrotra [1] https://java.net/jira/browse/JSR_283-428
Using same index definition for both async and sync indexing
Hi Team, Currently one can set "async" flag on an index definition to indicate wether given index should be effective for synchronous commit or to be used for async indexing. For Hybrid Lucene indexing case [1] I need to have a way where same index definition gets used in both. So if a index definition at /oak:index/fooLuceneIndex is marked as "hybrid" [2] then we need have LuceneIndexEditorProvider invoked for both 1. Commit Time - Here the editor would just create Document and not add to index 2. Async indexing time - Here the current implemented approach of indexing would happen And in doing that the LuceneIndexEditorProvider needs to be informed in which mode it is being invoked. So to support that we need some enhancement in IndexUpdate logic where by same index definition is used in both mode and editor knows the indexing mode. Probably this would require a new interface for IndexEditorProvider. So looking for thoughts on how this can be implemented! Chetan Mehrotra [1] https://issues.apache.org/jira/browse/OAK-4412?focusedCommentId=15405340&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15405340 [2] Naming convention to be decided/discussed
Way to capture metadata related to commit as part of CommitInfo from within CommitHook
Hi Team, Currently as part of commit the caller can provide a CommitInfo instance which captures some metadata related to commit being performed. Note that CommitInfo instance passed to NodeStore is immutable. For some usecases we need a way to add some more metadata to on going commit from within the CommitHook OAK-4586 - Collect affected node types on commit Here we need to record nodetypes of nodes which got modified as part of current commit OAK-4412 - Lucene hybrid index Here we want to generate Documents for modified nodestate (per index definition) and "attach" it to current commit This meta information would later be used by Observer. Currently there no std way in API to achieve that. #A -Probably we can introduce a new type CommitAttributes which can be attached to CommitInfo and which can be modified by the CommitHooks. The CommitAttributes can then later be accessed by Observer OR #B - We can just add a mutable attribute map to the CommitInfo instance and that can be populated by CommitHooks Thoughts on which approach to go forward Chetan Mehrotra
Re: Using same index definition for both async and sync indexing
On Wed, Aug 3, 2016 at 2:23 PM, Alex Parvulescu wrote: > extend the current index definition > for the 'async' property and allow multiple values. That should work and looks like natural extension of the flag. Just that having empty value in array does not look good (might confuse people in ui). So we can have a marker value to indicate empty >What about overloading the 'IndexUpdateCallback' with a 'isSync()' method > coming from the 'IndexUpdate' component. This will reduce the change > footprint and only components that need to know this information will use > it. That can be done. Going forward we also need to pass in CommitInfo or something like that (see other mail). Another option can be to have a new interface for IndexEditorProvider (on same line as AdvancedQueryIndex > QueryIndex). So the editor implementing new interface would have the extra params passed in. And there we introduce something like IndexingContext which folds in IndexUpdateCallback, indexing mode, index path, CommitInfo etc Chetan Mehrotra
Re: Way to capture metadata related to commit as part of CommitInfo from within CommitHook
So would it be ok to make the map within CommitInfo mutable ? Chetan Mehrotra On Wed, Aug 3, 2016 at 7:29 PM, Michael Dürig wrote: > >> >> #A -Probably we can introduce a new type CommitAttributes which can be >> attached to CommitInfo and which can be modified by the CommitHooks. >> The CommitAttributes can then later be accessed by Observer > > > This is already present via the CommitInfo.info map. It is even used in a > similar way. See CommitInfo.getPath() and its usages. AFAIU the only part > where your cases would differ is that the information is assembled by some > commit hooks instead of being provided at the point the commit was > initiated. > > > Michael
Re: Using same index definition for both async and sync indexing
On Wed, Aug 3, 2016 at 7:52 PM, Alex Parvulescu wrote: > sounds interesting, this looks like a good option. > Now comes the hard part ... what should be the name of this new interface ;) ContextualIndexEditorProvider? Chetan Mehrotra
Re: Way to capture metadata related to commit as part of CommitInfo from within CommitHook
That would depend on the CommitHook impl which client code would not be aware of. And commit hook would also know only as commit traversal is done. So it needs to be some mutable state Chetan Mehrotra On Wed, Aug 3, 2016 at 8:27 PM, Michael Dürig wrote: > > Couldn't we keep the map immutable and instead add some "WhateverCollector" > instances as values? E.g. add a AffectedNodeTypeCollector right from the > beginning? > > Michael > > > > On 3.8.16 4:06 , Chetan Mehrotra wrote: >> >> So would it be ok to make the map within CommitInfo mutable ? >> Chetan Mehrotra >> >> >> On Wed, Aug 3, 2016 at 7:29 PM, Michael Dürig wrote: >>> >>> >>>> >>>> #A -Probably we can introduce a new type CommitAttributes which can be >>>> attached to CommitInfo and which can be modified by the CommitHooks. >>>> The CommitAttributes can then later be accessed by Observer >>> >>> >>> >>> This is already present via the CommitInfo.info map. It is even used in a >>> similar way. See CommitInfo.getPath() and its usages. AFAIU the only part >>> where your cases would differ is that the information is assembled by >>> some >>> commit hooks instead of being provided at the point the commit was >>> initiated. >>> >>> >>> Michael
Re: Way to capture metadata related to commit as part of CommitInfo from within CommitHook
On Wed, Aug 3, 2016 at 8:57 PM, Michael Dürig wrote: > I would suggest to add an new, internal mechanism to CommitInfo for your > purpose. So introduce a new CommitAttributes instance which would be returned by CommitInfo ... ? Chetan Mehrotra
Re: Way to capture metadata related to commit as part of CommitInfo from within CommitHook
Opened OAK-4640 to track this Chetan Mehrotra On Wed, Aug 3, 2016 at 9:36 PM, Michael Dürig wrote: > > > On 3.8.16 5:58 , Chetan Mehrotra wrote: >> >> On Wed, Aug 3, 2016 at 8:57 PM, Michael Dürig wrote: >>> >>> I would suggest to add an new, internal mechanism to CommitInfo for your >>> purpose. >> >> >> So introduce a new CommitAttributes instance which would be returned >> by CommitInfo ... ? > > > Probably the best of all ugly solutions yes ;-) (Meaning I don't have a > better idea neither...) > > Michael > >> >> Chetan Mehrotra >> >
Re: Using same index definition for both async and sync indexing
Opened OAK-4641 for this enhancement Chetan Mehrotra On Wed, Aug 3, 2016 at 8:00 PM, Chetan Mehrotra wrote: > On Wed, Aug 3, 2016 at 7:52 PM, Alex Parvulescu > wrote: >> sounds interesting, this looks like a good option. >> > > Now comes the hard part ... what should be the name of this new > interface ;) ContextualIndexEditorProvider? > > Chetan Mehrotra
Provide a way to pass indexing related state to IndexEditorProvider (OAK-4642)
Hi Team, As a follow up to previous mail around "Using same index definition for both async and sync indexing" wanted to discuss the next step. We need to provide a way to pass indexing related state to IndexEditorProvider (OAK-4642) Over the period of time I have seen need for extra state like 1. reindexing - Currently the index implementation use some heuristic like check before root state being empty to determine if they are running in reindexing mode 2. indexing mode - sync or async 3. index path of the index (see OAK-4152) 4. CommitInfo (see OAK-4640) For #1 and #3 we have done some kind of workaround but it would be better to have a first class support for that. So we would need to introduce some sort of IndexingContext and have the api for IndexEditorProvider like below = @CheckForNull Editor getIndexEditor( @Nonnull String type, @Nonnull NodeBuilder definition, @Nonnull NodeState root, @Nonnull IndexingContext context) throws CommitFailedException; = To introduce such a change I see 3 options * O1 - Introduce a new interface which takes an {{IndexingContext}} instance which provide access to such datapoints. This would require some broader change ** Whereever the IndexEditorProvider is invoked it would need to check if the instance implements new interface. If yes then new method needs to be used Overall it introduces noise. * O2 - Here we can introduce such data points as part of callback interface. With this we would need to implement such methods in places where code constructs the callback * O3 - Make a backward incompatible change and just modify the existing interface and adapt the various implementation I am in favour of going for O3 and make this backward compatible change Thoughts? Chetan Mehrotra
Re: Provide a way to pass indexing related state to IndexEditorProvider (OAK-4642)
I have updated OAK-4642 with one more option. === O4 - Similar to O2 but here instead of modifying the existing IndexUpdateCallback we can introduce a new interface ContextualCallback which extends IndexUpdateCallback and provide access to IndexingContext. Editor provider implementation can then check if the callback implements this new interface and then cast it and access the context. So only those client which are interested in new capability make use of this === So provide your feedback there or in this thread Chetan Mehrotra On Thu, Aug 4, 2016 at 12:35 PM, Chetan Mehrotra wrote: > Hi Team, > > As a follow up to previous mail around "Using same index definition > for both async and sync indexing" wanted to discuss the next step. We > need to provide a way to pass indexing related state to > IndexEditorProvider (OAK-4642) > > Over the period of time I have seen need for extra state like > > 1. reindexing - Currently the index implementation use some heuristic > like check before root state being empty to determine if they are > running in reindexing mode > 2. indexing mode - sync or async > 3. index path of the index (see OAK-4152) > 4. CommitInfo (see OAK-4640) > > For #1 and #3 we have done some kind of workaround but it would be > better to have a first class support for that. > > So we would need to introduce some sort of IndexingContext and have > the api for IndexEditorProvider like below > > = > @CheckForNull > Editor getIndexEditor( > @Nonnull String type, @Nonnull NodeBuilder definition, > @Nonnull NodeState root, > @Nonnull IndexingContext context) throws CommitFailedException; > = > > To introduce such a change I see 3 options > > * O1 - Introduce a new interface which takes an {{IndexingContext}} > instance which provide access to such datapoints. This would require > some broader change > ** Whereever the IndexEditorProvider is invoked it would need to check > if the instance implements new interface. If yes then new method needs > to be used > > Overall it introduces noise. > > * O2 - Here we can introduce such data points as part of callback > interface. With this we would need to implement such methods in places > where code constructs the callback > > * O3 - Make a backward incompatible change and just modify the > existing interface and adapt the various implementation > > I am in favour of going for O3 and make this backward compatible change > > Thoughts? > > Chetan Mehrotra
Re: Property index replacement / evolution
Would add one more 4. Write throughput degradation - For non unique property index which make use of ContentMirrorStoreStrategy we have seen a loss in throughput due to contention which arise due to conflicts while entries are made in index. (OAK-2673, OAK-3380) Chetan Mehrotra On Fri, Aug 5, 2016 at 10:34 PM, Michael Marth wrote: > Hi, > > I have noticed OAK-4638 and OAK-4412 – which both deal with particular > problematic aspects of property indexes. I realise that both issues deal with > slightly different problems and hence come to different suggested solutions. > But still I felt it would be good to take a holistic view on the different > problems with property indexes. Maybe there is a unified approach we can take. > > To my knowledge there are 3 areas where property indexes are problematic or > not ideal: > > 1. Number of nodes: Property indexes can create a large number of nodes. For > properties that are very common the number of index nodes can be almost as > large as the number of the content nodes. A large number of nodes is not > necessarily a problem in itself, but if the underlying persistence is e.g. > MongoDB then those index nodes (i.e. MongoDB documents) cause pressure on > MongoDB’s mmap architecture which in turn affects reading content nodes. > > 2. Write performance: when the persistence (i.e. MongoDB) and Oak are “far > away from each other” (i.e. high network latency or low throughput) then > synchronous property indexes affect the write throughput as they may cause > the payload to double in size. > > 3. I have no data on this one – but think it might be a topic: property index > updates usually cause commits to have / as the commit root. This results on > pressure on the root document. > > Please correct me if I got anything wrong or inaccurate in the above. > > My point is, however, that at the very least we should have clarity which one > go the items above we intend to tackle with Oak improvements. Ideally we > would have a unified approach. > (I realize that property indexes come in various flavours like unique index > or not, which makes the discussion more complex) > > my2c > Michael
Re: Usecases around Binary handling in Oak
This can be done at Sling level yes. But then any code which makes use of JCR API would not be able to access the binary. One way to have it implemented at Oak level would be to introduce some sort of 'ExternalBinary' and open up an extension in BlobStore implementation to delegate binary lookup call to some provider. Just that it needs to honor the contract of Binary and Blob API That part is easy. The problem comes in management side where you need to decide on GC. Probably Oak would need to expose an API to provide list (iterator) of all such external binaries it refers to and then the external system can manage the GC Chetan Mehrotra On Wed, Aug 10, 2016 at 3:26 PM, Ian Boston wrote: > Hi, > > On 10 August 2016 at 10:29, Bertrand Delacretaz > wrote: > >> Hi, >> >> On Tue, Jul 26, 2016 at 4:36 PM, Bertrand Delacretaz >> wrote: >> > ...I've thought about adding an "adopt-a-binary" feature to Sling >> > recently, to allow it to serve existing (disk or cloud) binaries along >> > with those stored in Oak >> >> I just noticed that the Git Large File Storage project uses a similar >> approach, it "replaces large files such as audio samples, videos, >> datasets, and graphics with text pointers inside Git, while storing >> the file contents on a remote server". Maybe there are ideas to >> steal^H^H^H^H^H borrow from there. >> > > Would that be something to do at the Sling level on upload of a large file? > > I am working on a patch to use the Commons File Upload streaming API in > Sling servlets/post as a Operation impl. > I know this is oak-dev, so the question might not be appropriate here. > > Best Regards > Ian > > >> >> -Bertrand >> >> [1] https://git-lfs.github.com/ >>
Re: Oak Indexing. Was Re: Property index replacement / evolution
Couple of points around the motivation, target usecase around Hybrid Indexing and Oak indexing in general. Based on my understanding of various deployments. Any application based on Oak has 2 type of query requirements QR1. Application Query - These mostly involve some property restrictions and are invoked by code itself to perform some operation. The property involved here in most cases would be sparse i.e. present in small subset of whole repository content. Such queries need to be very fast and they might be invoked very frequently. Such queries should also be more accurate and result should not lag repository state much. QR2. User provided query - These queries would consist of both or either of property restriction and fulltext constraints. The target nodes may form majority part of overall repository content. Such queries need to be fast but given user driven need not be very fast. Note that speed criteria is very subjective and relative here. Further Oak needs to support deployments 1. On single setup - For dev, prod on SegmentNodeStore 2. Cluster Setup on premise 3. Deployment in some DataCenter So Oak should enable deployments where for smaller setups it does not require any thirdparty system while still allow plugging in a dedicate system like ES/Solr if need arises. So both usecases need to be supported. And further even if it has access to such third party server it might be fine to rely on embedded Lucene for #QR1 and just delegate queries under #QR2 to remote. This would ensure that query results are still fast for usage falling under #QR1. Hybrid Index Usecase - So far for #QR1 we only had property indexes and to an extent Lucene based property index where results lag repository state and lag might be significant depending on load. Hybrid index aim to support queries under #QR1 and can be seen as replacement for existing non unique property indexes. Such indexes would have lower storage requirement and would not put much load on remote storage for execution. Its not meant as a replacement for ES/Solr but then intends to address different type of usage Very large Indexes - For deployments having very large repository Solr or ES based indexes would be preferable and there oak-solr can be used (some day oak-es!) So in brief Oak should be self sufficient for smaller deployment and still allow plugging in Solr/ES for large deployment and there also provide a choice to admin to configure a sub set of index for such usage depending on the size. Chetan Mehrotra On Thu, Aug 11, 2016 at 1:59 PM, Ian Boston wrote: > Hi, > > On 11 August 2016 at 09:14, Michael Marth wrote: > >> Hi Ian, >> >> No worries - good discussion. >> >> I should point out though that my reply to Davide was based on a >> comparison of the current design vs the Jackrabbit 2 design (in which >> indexes were stored locally). Maybe I misunderstood Davide’s comment. >> >> I will split my answer to your mail in 2 parts: >> >> >> > >> >Full text extraction should be separated from indexing, as the DS blobs >> are >> >immutable, so is the full text. There is code to do this in the Oak >> >indexer, but it's not used to write to the DS at present. It should be >> done >> >in a Job, distributed to all nodes, run only once per item. Full text >> >extraction is hugely expensive. >> >> My understanding is that Oak currently: >> A) runs full text extraction in a separate thread (separate form the >> “other” indexer) >> B) runs it only once per cluster >> If that is correct then the difference to what you mention above would be >> that you would like the FT indexing not be pinned to one instance but >> rather be distributed, say round-robin. >> Right? >> > > > Yes. > > >> >> >> >Building the same index on every node doesn't scale for the reasons you >> >point out, and eventually hits a brick wall. >> >http://lucene.apache.org/core/6_1_0/core/org/apache/ >> lucene/codecs/lucene60/package-summary.html#Limitations. >> >(Int32 on Document ID per index). One of the reasons for the Hybrid >> >approach was the number of Oak documents in some repositories will exceed >> >that limit. >> >> I am not sure what you are arguing for with this comment… >> It sounds like an argument in favour of the current design - which is >> probably not what you mean… Could you explain, please? >> > > I didn't communicate that very well. > > Currently Lucene (6.1) has a limit of Int32 to the number of documents it > can store in an index, IIUC There is a long term desire to increase that > but using Int64 but no long term commitment as its probably significant > work given arrays in Java are indexed with Int32. > > The Hybrid approach doesn't help the potential Lucene brick wall, but one > motivation for looking at it was the number of Oak Documents including > those under /oak:index which is, in some cases, approaching that limit. > > > >> >> >> Thanks! >> Michael >>
Re: Oak Indexing. Was Re: Property index replacement / evolution
On Thu, Aug 11, 2016 at 3:03 PM, Ian Boston wrote: > Both Solr Cloud and ES address this by sharding and > replicating the indexes, so that all commits are soft, instant and real > time. That introduces problems. ... > Both Solr Cloud and ES address this by sharding and > replicating the indexes, so that all commits are soft, instant and real > time. This would really be useful. However I have couple of aspects to clear Index Update Gurantee Lets say if commit succeeds and then we update the index and index update fails for some reason. Then would that update be missed or there can be some mechanism to recover. I am not very sure about WAL here that may be the answer here but still confirming. In Oak with the way async index update works based on checkpoint its ensured that index would "eventually" contain the right data and no update would be lost. if there is a failure in index update then that would fail and next cycle would start again from same base state Order of index update - Lets say I have 2 cluster nodes where same node is being performed Original state /a {x:1} Cluster Node N1 - /a {x:1, y:2} Cluster Node N2 - /a {x:1, z:3} End State /a {x:1, y:2, z:3} At Oak level both the commits would succeed as there is no conflict. However N1 and N2 would not be seeing each other updates immediately and that would depend on background read. So in this case how would index update would look like. 1. Would index update for specific paths go to some master which would order the update 2. Or it would end up with with either of {x:1, y:2} or {x:1, z:3} Here current async index update logic ensures that it sees the eventually expected order of changes and hence would be consistent with repository state. Backup and Restore --- Would the backup now involve backup of ES index files from each cluster node. Or assuming full replication it would involve backup of files from any one of the nodes. Would the back be in sync with last changes done in repository (assuming sudden shutdown where changes got committed to repository but not yet to any index) Here current approach of storing index files as part of MVCC storage ensures that index state is consistent to some "checkpointed" state in repository. And post restart it would eventually catch up with the current repository state and hence would not require complete rebuild of index in case of unclean shutdowns Chetan Mehrotra
Re: Oak Indexing. Was Re: Property index replacement / evolution
> https://github.com/ieb/oak-es btw this looks interesting and something we can build upon. This can benefit from a refactoring of LuceneIndexEditor to separate the logic of interpreting the Oak indexing config during editor invocation from constructing Lucene document. If we decouple that logic then it would be possible to plugin in a ES Editor which just converts those properties per ES requirement. Hence it gets all benefits of aggregation, relative property implementation etc (which is very Oak specific stuff). This effort has been discussed but we never got time to do that so far. Something on the lines which you are doing at [2] Another approach - With recent refactoring done in OAK-4566 my plan was to plugin a ES based LuceneIndexWriter (ignore the name for now!) and convert the Lucene Document to some ES Document counterpart. And then provide just the query implementation. This would also allow to reuse most of testcase we have in oak-lucene Chetan Mehrotra [2] https://github.com/ieb/oak-es/blob/master/src/main/java/org/apache/jackrabbit/oak/plusing/index/es/index/take2/ESIndexEditorContext.java On Thu, Aug 11, 2016 at 3:40 PM, Chetan Mehrotra wrote: > On Thu, Aug 11, 2016 at 3:03 PM, Ian Boston wrote: >> Both Solr Cloud and ES address this by sharding and >> replicating the indexes, so that all commits are soft, instant and real >> time. That introduces problems. > ... >> Both Solr Cloud and ES address this by sharding and >> replicating the indexes, so that all commits are soft, instant and real >> time. > > This would really be useful. However I have couple of aspects to clear > > Index Update Gurantee > > > Lets say if commit succeeds and then we update the index and index > update fails for some reason. Then would that update be missed or > there can be some mechanism to recover. I am not very sure about WAL > here that may be the answer here but still confirming. > > In Oak with the way async index update works based on checkpoint its > ensured that index would "eventually" contain the right data and no > update would be lost. if there is a failure in index update then that > would fail and next cycle would start again from same base state > > Order of index update > - > > Lets say I have 2 cluster nodes where same node is being performed > > Original state /a {x:1} > > Cluster Node N1 - /a {x:1, y:2} > Cluster Node N2 - /a {x:1, z:3} > > End State /a {x:1, y:2, z:3} > > At Oak level both the commits would succeed as there is no conflict. > However N1 and N2 would not be seeing each other updates immediately > and that would depend on background read. So in this case how would > index update would look like. > > 1. Would index update for specific paths go to some master which would > order the update > 2. Or it would end up with with either of {x:1, y:2} or {x:1, z:3} > > Here current async index update logic ensures that it sees the > eventually expected order of changes and hence would be consistent > with repository state. > > Backup and Restore > --- > > Would the backup now involve backup of ES index files from each > cluster node. Or assuming full replication it would involve backup of > files from any one of the nodes. Would the back be in sync with last > changes done in repository (assuming sudden shutdown where changes got > committed to repository but not yet to any index) > > Here current approach of storing index files as part of MVCC storage > ensures that index state is consistent to some "checkpointed" state in > repository. And post restart it would eventually catch up with the > current repository state and hence would not require complete rebuild > of index in case of unclean shutdowns > > > Chetan Mehrotra
Re: Oak Indexing. Was Re: Property index replacement / evolution
On Thu, Aug 11, 2016 at 5:19 PM, Ian Boston wrote: > correct. > Documents are shared by ID so all updates hit the same shard. > That may result in network traffic if the shard is not local. Focusing on ordering part as that is the most critical aspect compared to other. (BAckup and Restore with sharded index is a separate problem to discuss but later) So even if there is a single master for a given path how would it order the changes. Given local changes only give partial view of end state. Also in such a setup would each query need to consider multiple shards for final result or each node would "eventually" sync index changes from other nodes (complete replication) and query would only use local index For me ensuring consistency in how index updates are sent to ES wrt Oak view of changes was kind of blocking feature to enable parallelization of indexing process. It needs to be ensured that for concurrent commit end result in index is in sync with repository state. Current single thread async index update avoid all such race condition. Chetan Mehrotra
Re: Oak Indexing. Was Re: Property index replacement / evolution
On Thu, Aug 11, 2016 at 7:33 PM, Ian Boston wrote: > That probably means the queue should only > contain pointers to Documents and only index the Document as retrieved. I > dont know if that can ever work. That would not work as what document look like across cluster node would wary and what is to be considered valid entries is also not defined at that level > Run a single thread on the master, that indexes into a co-located ES cluster. While keeping things simple that looks like the safe way > BTW, how does Hybrid manage to parallelise the indexing and maintain consistency ? Hybrid indexes does not affect async indexes. Under this each cluster node maintain there local indexes which only contain local changes [1]. These indexes are not aware about similar index on other cluster node. Further the indexes are supposed to only contain entry from last async indexing cycle. Older entries are purged [2]. The query would then be consulting both indexes (IndexSearcher backed via MultiReader , 1 reader from async index and 1 (or 2) from local index). Also note that QueryEngine would enforce and reevaluate the property restrictions. So even if index has an entry based on old state QE would filter it out if it does not match the criteria per current repository state. So aim here is to have index provide a super set of result set. In all this async index logic remains same (single threaded) and based on diff. So it would remain consistent with repository state Chetan Mehrotra [1] They might also contain entries which are determined based on external diff. Read [3] for details [2] Purge here is done my maintaining different local index copy for each async indexing cycle. At max only 2 indexes are retained and older indexes are removed. This keeps index small [3] https://issues.apache.org/jira/browse/OAK-4412?focusedCommentId=15405340&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15405340
Re: normalising the rdb database schema
Hi Tomek, I like the idea of revisiting our current schema based on usage so far. However couple of points around potential issue with such a normalized approach - This approach would lead to a thin and lng table. As noted in [1] in a small repo ~14 M nodes we have ~26 M properties. With multiple revisions (GC takes some time) this can go higher. This would then increase the memory requirement for id index. Memory consumption increases further with id+key+revision index. For any db to perform optimally the index should fit in ram. So such such a design would possibly reduce the max size of repository which can be supported (compared to older one) for given memory - The read for specific id can be done in 1 remote call. But that would involve select across multiple rows which might increase the time taken as it would involve 'm' index lookup and then 'm' reads of row data for any node having 'n' properties (m > n assuming multiple revision for property present) May be we should explore the json support being introduced in multiple dbs. DB2 [2], SQL Server [3], Oracle [4], Postgres [5], MySql [6]. Problem here is that we would need DB specific implementation and also increases the testing effort! > we can better use the database features, as now the DBE is aware about the > document internal structure (it’s not a blob anymore). Eg. we can fetch only > a few properties. In most cases the kind of properties stored in blob part of db row are always read as a whole. Chetan Mehrotra [1] https://issues.apache.org/jira/browse/OAK-4471 [2] http://www.ibm.com/developerworks/data/library/techarticle/dm-1306nosqlforjson1/ [3] https://msdn.microsoft.com/en-in/library/dn921897.aspx [4] https://docs.oracle.com/database/121/ADXDB/json.htm [5] https://www.postgresql.org/docs/9.3/static/functions-json.html [6] https://dev.mysql.com/doc/refman/5.7/en/json.html On Wed, Aug 17, 2016 at 7:19 AM, Michael Marth wrote: > Hi Tomek, > > I like the idea (agree with Vikas’ comments / cautions as well). > > You are hinting at expected performance differences (maybe faster or slower > than the current approach). That would probably be worthwhile to investigate > in order to assess your idea. > > One more (hypothetical at this point) advantage of your approach: we could > utilise DB-native indexes as a replacement for property indexes. > > Cheers > Michael > > > > On 16/08/16 07:42, "Tomek Rekawek" wrote: > >>Hi Vikas, >> >>thanks for the reply. >> >>> On 16 Aug 2016, at 14:38, Vikas Saurabh wrote: >> >>> * It'd incur a very heavy migration impact on upgrade or RDB setups - >>> that, most probably, would translate to us having to support both >>> schemas. I don't feel that it'd easy to flip the switch for existing >>> setups. >> >>That’s true. I think we should take a similar approach here as with the >>segment / segment-tar implementations (and we can use oak-upgrade to convert >>between them). At least for now. >> >>> * DocumentNodeStore implementation very freely touches prop:rev=value >>> for a given id… […] I think this would get >>> expensive for index (_id+propName+rev) maintenance. >> >>Indeed, probably we’ll have to analyse the indexing capabilities offered by >>different database engines more closely, choosing the one that offers good >>writing speed. >> >>Best regards, >>Tomek >> >>-- >>Tomek Rękawek | Adobe Research | www.adobe.com >>reka...@adobe.com
Re: Help with unit tests for JMX stats for S3DataStore
Hi Matt, It would be easier if you can open an issue and provide your patch there so that one can have better understanding of what needs to be tested. In general we have you can use with MemoryDocumentStore (default used by DocumentMK builder) and then possibly use Sling OSGi mocks to pick the registered MBean services. For an example have a look at SegmentNodeStoreServiceTest which uses OSGi mocks to activate the service and then pick up the registered services to do the assertion Chetan Mehrotra On Fri, Aug 19, 2016 at 6:14 AM, Matt Ryan wrote: > Hi, > > I’m working on a patch for Oak that would add some JMX stats for > S3DataStore. I’m adding code to register a new Mbean in > DocumentNodeStoreService (also SegmentNodeStoreService, but let’s just > worry about the first one for now). > > I wanted to create some unit tests to verify that my new JMX stats are > available via JMX. The idea I had would be that I would simply instantiate > a DocumentNodeStoreService, create an S3DataStore, wrap it in a > DataStoreBlobStore, and bind that in the DocumentNodeStoreService. Then > with a JMX connection I could check that my Mbean had been registered, > which it should have been by this time. > > > This was all going relatively fine until I hit a roadblock in > DocumentNodeStoreService::registerNodeStore(). The DocumentMKBuilder uses > a DocumentNodeStore object that I need to mock in order to do the test, and > I cannot mock DocumentNodeStore because it is a final class. I tried > working around that, but ended up hitting another road block in the > DocumentNodeStore constructor where I then needed to mock a NodeDocument - > again, can’t mock it because it is a final class. > > > I realize it is theoretically possible to mock final classes using > PowerMock, although by this point I am starting to wonder if all this > effort is a good way to use my time or if I should just test my code > manually. > > > Is it important that DocumentNodeStore be a final class? If not, how would > we feel about me simply making the class non-final? If so, what > suggestions do you have to help me unit test this thing? I feel that it > should be easier to unit test new code than this, so maybe I’m missing > something. > > > Thanks > > > -Matt Ryan
RepositorySidegrade and commit hooks
Hi, Does RepositorySidegrade runs all the commit hooks required for getting a consistent JCR level state like permission editor, property editor etc I can such hooks configured for RepositoryUpgrade but not seeing any such hook configured for RepositorySidegrade Probably we should also configure same set of hooks? Chetan Mehrotra
Re: RepositorySidegrade and commit hooks
For complete migration yes all bits are there. However people also use this for partial incremental migration from source system to target system. In that case include paths are provide for those paths whose content need to be updated. In such a case it can happen that derived content for those paths (property index, permission store entries) do not get updated and that would result in inconsistent state Chetan Mehrotra On Fri, Aug 19, 2016 at 1:59 PM, Alex Parvulescu wrote: > Hi, > > I don't think any extra hooks are needed here. Sidegrade is just a change > in persistence format, all the bits should be there already in the old > repository. > > best, > alex > > On Fri, Aug 19, 2016 at 6:45 AM, Chetan Mehrotra > wrote: > >> Hi, >> >> Does RepositorySidegrade runs all the commit hooks required for >> getting a consistent JCR level state like permission editor, property >> editor etc >> >> I can such hooks configured for RepositoryUpgrade but not seeing any >> such hook configured for RepositorySidegrade >> >> Probably we should also configure same set of hooks? >> >> Chetan Mehrotra >>
Re: RepositorySidegrade and commit hooks
Thanks Tomek for confirmation. Opened OAK-4684 to track that Chetan Mehrotra On Fri, Aug 19, 2016 at 3:52 PM, Tomek Rekawek wrote: > Hi Chetan, > > yes, it seems that this has been overlooked in the OAK-3239 (porting the > —include-paths support from RepositoryUpgrade). Feel free to create an issue > / commit a patch or let me know if you want me to do it. > > Best regards, > Tomek > > -- > Tomek Rękawek | Adobe Research | www.adobe.com > reka...@adobe.com > >> On 19 Aug 2016, at 10:38, Chetan Mehrotra wrote: >> >> For complete migration yes all bits are there. However people also use >> this for partial incremental migration from source system to target >> system. In that case include paths are provide for those paths whose >> content need to be updated. In such a case it can happen that derived >> content for those paths (property index, permission store entries) do >> not get updated and that would result in inconsistent state >> Chetan Mehrotra >> >> >> On Fri, Aug 19, 2016 at 1:59 PM, Alex Parvulescu >> wrote: >>> Hi, >>> >>> I don't think any extra hooks are needed here. Sidegrade is just a change >>> in persistence format, all the bits should be there already in the old >>> repository. >>> >>> best, >>> alex >>> >>> On Fri, Aug 19, 2016 at 6:45 AM, Chetan Mehrotra >>> wrote: >>> >>>> Hi, >>>> >>>> Does RepositorySidegrade runs all the commit hooks required for >>>> getting a consistent JCR level state like permission editor, property >>>> editor etc >>>> >>>> I can such hooks configured for RepositoryUpgrade but not seeing any >>>> such hook configured for RepositorySidegrade >>>> >>>> Probably we should also configure same set of hooks? >>>> >>>> Chetan Mehrotra >>>> >