[jira] [Commented] (OAK-4412) Lucene-memory property index
[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331433#comment-15331433 ] Tomek Rękawek commented on OAK-4412: [~egli], thanks for the suggestion. I improved the patch. In the query time we check if there's a repository change waiting to be processed. If it's, we wait for it. The new, incoming changes (committed after user calls the query()) are ignored and we won't wait for them. The new logic is mainly placed in the MonitoringBackgroundObserver. > Lucene-memory property index > > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Tomek Rękawek >Assignee: Tomek Rękawek > Fix For: 1.6 > > Attachments: OAK-4412.patch > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4412) Lucene-memory property index
[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320275#comment-15320275 ] Stefan Egli commented on OAK-4412: -- [~tomek.rekawek], another point re the sync-commitEditor vs async-observation handling of updating local indexing and the resulting problem that going via async-observation: what could be done is to stick to async (with the advantage to not burden commits) but handle the resulting issue that the index would be slightly delayed with trying to delay the query (if the index it would use is indeed 'behind') until the index is updated. This would move the potential performance hit from commit-time to index-time. > Lucene-memory property index > > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Tomek Rękawek >Assignee: Tomek Rękawek > Fix For: 1.6 > > Attachments: OAK-4412.patch > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4412) Lucene-memory property index
[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316406#comment-15316406 ] Tomek Rękawek commented on OAK-4412: [~chetanm], thanks for the feedback. I'd be more happy with only using the observer as well. My main concern is that observer is informed about the changes asynchronously, so it may happen that the user commits() their changes and run the JCR query() before the observer event is processed. Isn't this possible or likely? Also, I've followed your advice about the indentation. Thanks, the patch is now much smaller. > Lucene-memory property index > > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Tomek Rękawek >Assignee: Tomek Rękawek > Fix For: 1.6 > > Attachments: OAK-4412.patch > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4412) Lucene-memory property index
[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316394#comment-15316394 ] Vikas Saurabh commented on OAK-4412: Just to clarify - which problem are we trying to solve: # sometimes, due do delay in async indexing and background read, subsequent request from the same user get missing (yet un-indexed/un-bkRead) result? # some code patterns which currently _expect_ synchronous nature of property index (do change -> save -> query -> expect earlier save to show up) won't cope well with async nature? I understand the former problem statement has value and worth solving. But, my reading of this issue felt like we are trying to address the latter (code expectation requires sync nature BUT prop indices are currently expensive). I think, due to different expectations (wrt to timings - diff between 2 user requests v/s diff between 2 actions in same call stack) we might want to discuss the 2 problems separately. It'd great if same solution solves both cases - but I think we shouldn't force same solution for both cases. > Lucene-memory property index > > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Tomek Rękawek >Assignee: Tomek Rękawek > Fix For: 1.6 > > Attachments: OAK-4412.patch > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4412) Lucene-memory property index
[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316248#comment-15316248 ] Chetan Mehrotra commented on OAK-4412: -- Interesting work Tomek!. Before I dig deeper into the patch need some more understanding of the proposed approach *Use of CommitHook* The changes done by CommitHook may be rollbacked in case of conflict or if the branch is rebased. Further CommitHook can be invoked concurrently which would cause issue with Lucene indexing as its single threaded by design. The patch looks like make the CommitHook synchronous which would have adverse impact on writes. Instead of this I think it would be better to just rely on Observor and there only listen for local changes and update the index in observor call. This would ensure that index sees only committed changes and also does not impact the writes significantly. This approach has a downside that indexes would lag behind a bit with there sync property index counter parts but that can be be offset a bit with sticky sessions. Consider following flow # User U1 access cluster node N1 and performs some update to property "foo" which has a property index # In a subsequent gesture the request hits N1 again and performs a query - With property index (sync) expectation here is that updates nodes in #1 would be visible to the query. If we switch to default "async" index then that would fail. However if we switch to "hybrid" then the in memory index would include that update and result would be as expected This would work if there is sticky session at higher level (session here means user session) which is a suitable expectation for an eventually consistent deployment. And a minor suggestion - If the patch can avoid significant code displacements that would be better. So instead of re indenting the code may be introduce a new method which delegates to old method in some way would help to understand the change better (without noise) > Lucene-memory property index > > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Tomek Rękawek >Assignee: Tomek Rękawek > Fix For: 1.6 > > Attachments: OAK-4412.patch > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4412) Lucene-memory property index
[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313814#comment-15313814 ] Tomek Rękawek commented on OAK-4412: Patch attached. {{LuceneHybridTest}} presents how it can be used. The only change required for the index definition to enable the hybrid feature is adding an extra property: {{hybridIndex}} set to {{true}}. [~chetanm], [~edivad], [~catholicon] - I'd be grateful for some feedback. > Lucene-memory property index > > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Tomek Rękawek >Assignee: Tomek Rękawek > Fix For: 1.6 > > Attachments: OAK-4412.patch > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4412) Lucene-memory property index
[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15307659#comment-15307659 ] Vikas Saurabh commented on OAK-4412: Observation won't get a 'immediate visibility' post session.save() even for local commits. If immediate visibility of at least the local changes is a hard-requirement, we might want to do a commit hook based update for local changes and only consume external events for observation. BUT, that can lead to potential issue with expected result set due to differing ordering of revision visibility and indexing e.g: * T1 -> local change {{rL1}} happens and gets indexed * T2 -> remote change {{rR2}} is read via background read and put into observation queue * T3 -> local change {{rL3}} happens and get indexed * T4 -> observation event for {{rR2}} is processed and indexed With this scheduling, the code at T3 could see {{rR2}} when it committed {{rL3}}. A query between T3 and T4 can be done via the same code expecting results from {{rL1, rR2, rL3}} but would actually just get {{rL1, rL2}}. I can't think of a way to synchronize {{rR2}}'s visibility and indexing short of tying indexing with background read. Also, we might also just document it and leave it at that - but if we really want to match today's property indices, we would probably need to resolve this. > Lucene-memory property index > > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Tomek Rękawek >Assignee: Tomek Rękawek > Fix For: 1.6 > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4412) Lucene-memory property index
[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15307635#comment-15307635 ] Tomek Rękawek commented on OAK-4412: Branch: https://github.com/trekawek/jackrabbit-oak/tree/OAK-4412 A somehow working PoC: [LuceneMemoryClusterTest|https://github.com/trekawek/jackrabbit-oak/blob/ef427db94b6122e737b6f61602ae1c97d9b5e397/oak-lucene/src/test/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneMemoryClusterTest.java#L157] > Lucene-memory property index > > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Tomek Rękawek > Fix For: 1.6 > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)